PINGDOM_CHECK

Improved Frontera: Web crawling at scale with Python 3 support

Read Time

3 Mins

Posted on

September 1, 2016

Categories
Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible.

By

Sibiryakov Alexander

Return to top

Improved Frontera: Web crawling at scale with Python 3 support

Python is our go-to language of choice and Python 2 is losing traction. In order to survive, older programs need to be Python 3 compatible.

barney

And so we’re pleased to announce that Frontera will remain alive and kicking because it now supports Python 3 in full! Joining the ranks of Scrapy and Scrapy Cloud, you can officially continue to quickly create and scale fully formed crawlers without any issues in your Python 3-ready stack.

Frontera_Vector_Medium

As a key web crawling toolbox that works with Scrapy, along with other web crawling systems, Frontera provides a crawl frontier framework that is ideal for broad crawls. Frontera manages when and what to crawl next, and checks for crawling goal accomplishment. This is especially useful for building a distributed architecture with multiple web spider processes consuming URLs from a frontier.

Once you’re done cheering with joy, read on to see how you can use this upgrade in your stack.

Python 3 Perks and Frontera Installation

This move to Python 3 includes all run modes, workers, message buses, and backends, HBase, ZeroMQ and Kafka clients. The development process is now a lot more reliable since we have tests that cover all major components as well as integration tests running HBase and Kafka.

Frontera is already available on PyPI. All you need to do is pip install --upgrade frontera. And then you just run it with Python 3 interpreter and you’re ready to get your crawlers scaled!

Shiny New Features

The request object is now propagated throughout the whole pipeline, allowing you to schedule requests with custom methods, headers, cookies and body parameters.

HBaseQueue supports delayed requests now. Using ‘crawl_at’ field in meta with a timestamp makes requests available to spiders only after the moment expressed with the timestamp has passed.

There is a new option allowing you to choose an option other than the default message bus codec (MsgPack) or use a custom one, see the MESSAGE_BUS_CODEC option.

Upgrades from the Original

Now, Frontera guarantees the exclusive assignment of extracted links to strategy workers based on links' hostname.

So links from a specific host will be always be assigned to the same strategy worker instance which prevents errors and greatly simplifies design.

Upcoming Improvements

In the near to distant future, we want Frontera and Frontera-based crawlers to be the number one software for large scale web crawling. Our next step in this process is to ease the deployment of Frontera in the Docker environment. This includes scaling and management.

We’re aiming for Frontera to be easily deployable to major cloud providers infrastructures like Google Cloud Platform and AWS, among others. It’s quite likely we will choose Kubernetes as our orchestration platform. Along with this goal, we will develop a good Web UI to manage and monitor Frontera-based crawlers. So stay tuned!

Wrap Up

Have we piqued your interest? Here’s a quick guide to get started.

Well, what are you waiting for? Take full advantage of Frontera with Python 3 support and start scaling your crawls. Check out our use cases to see what's possible.