PINGDOM_CHECK

Frontera: The brain behind the crawls

Read Time

5 Mins

Posted on

April 22, 2015

Categories
At Zyte we're always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone.

By

Pablo Hoffman

Return to top

Frontera: The brain behind the crawls

At Zyte we're always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone. Crawling millions of pages from the internet requires more sophistication than getting a few contacts of a list, as we need to make sure that we get reliable data, up-to-date lists of item pages and are able to optimize our crawl as much as possible.

From these complex projects emerge technologies that can be used across all of our spiders, and we're very pleased to release Frontera, a flexible frontier for web crawlers.

Frontera, formerly Crawl Frontier, is an open-source framework we developed to facilitate building a crawl frontier, helping manage our crawling logic and sharing it between spiders in our Scrapy projects.

What is a crawl frontier?

A crawl frontier is the system in charge of the logic and policies to follow when crawling websites, and plays a key role in more sophisticated crawling systems. It allows us to set rules about what pages should be crawled next, visiting priorities and ordering, how often pages are revisited, and any behaviour we may want to build into the crawl.

While Frontera was originally designed for use with Scrapy, it’s completely agnostic and can be used with any other crawling framework or standalone project.

In this post, we’re going to demonstrate how Frontera can improve the way you crawl using Scrapy. We’ll show you how you can use Scrapy to scrape articles from Hacker News while using Frontera to ensure the same articles aren’t visited again in subsequent crawls.

The frontier needs to be initialized with a set of starting URLs (seeds), and then the crawler will ask the frontier which pages should visit. As the crawler visits pages it will inform back to the frontier of each page’s response and extracted URLs.

The frontier will decide how to use this information according to the defined logic. This process continues until an end condition is reached. Some crawlers may never stop, we refer to these as continuous crawls.

Creating a Spider for HackerNews

Hopefully, you're now familiar with what Frontera does. If not, have to take a look at this textbook's section for more theory on how a crawl frontier works.

You can check out the project we'll be developing in this example from GitHub.

Let’s start by creating a new project and spider:

scrapy startproject hn_scraper cd hn_scraper scrapy genspider HackerNews news.ycombinator.com

You should have a directory structure similar to the following:

hn_scraper hn_scraper/hn_scraper hn_scraper/hn_scraper/__init__.py hn_scraper/hn_scraper/__init__.pyc hn_scraper/hn_scraper/items.py hn_scraper/hn_scraper/pipelines.py hn_scraper/hn_scraper/settings.py hn_scraper/hn_scraper/settings.pyc hn_scraper/hn_scraper/spiders hn_scraper/hn_scraper/spiders/__init__.py hn_scraper/hn_scraper/spiders/__init__.pyc hn_scraper/hn_scraper/spiders/HackerNews.py hn_scraper/scrapy.cfg

Due to the way the spider template is set up, your start_urls in spiders/HackerNews.py will look like this:

start_urls = ( 'http://www.news.ycombinator.com/', )

So you will want to correct it like so:

start_urls = ( 'https://news.ycombinator.com/', )

We also need to create an item definition for the article we're scraping:

items.py import scrapy class HnArticleItem(scrapy.Item): url = scrapy.Field() title = scrapy.Field() item_id = scrapy.Field() pass

Here the url field will refer to the outbound URL, the title to the article's title, and the item_id to HN's item ID.

We then need to define a link extractor so Scrapy will know which links to follow and extract data from.

Hacker News doesn’t make use of CSS classes for each item row, and another problem is that the article's item URL, author, and comments count are on a separate row from the article title and outbound URL. We’ll need to use XPath in this case.

First, let's gather all of the rows containing a title and outbound URL. If you inspect the DOM, you will notice these rows contain 3 cells, whereas the subtext rows contain 2 cells. So we can use something like the following:

selector = Selector(response) rows = selector.xpath('//table[@id="hnmain"]//td[count(table) = 1]'  '//table[count(tr) > 1]//tr[count(td) = 3]')

We then iterate over each row, retrieving the article URL and title, and we also need to retrieve the item URL and author from the subtext row, which we can find using the following-sibling axis. You should create a method similar to the following:

def parse_item(self, response): selector = Selector(response) rows = selector.xpath('//table[@id="hnmain"]//td[count(table) = 1]'  '//table[count(tr) > 1]//tr[count(td) = 3]') for row in rows: item = HnArticleItem() article = row.xpath('td[@class="title" and count(a) = 1]//a') article_url = self.extract_one(article, './@href', '') article_title = self.extract_one(article, './text()', '') item['url'] = article_url item['title'] = article_title subtext = row.xpath( './following-sibling::tr[1]//td[@class="subtext" and count(a) = 3]') if subtext: item_author = self.extract_one(subtext, './/a[1]/@href', '') item_id = self.extract_one(subtext, './/a[2]/@href', '') item['author'] = item_author[8:] item['id'] = int(item_id[8:]) yield item

The extract_one method is a helper function to extract the first result:

def extract_one(self, selector, xpath, default=None): extracted = selector.xpath(xpath).extract() if extracted: return extracted[0] return default

There’s currently a bug with Frontera's SQLalchemy middleware where callbacks aren’t called, so right now we need to inherit from Spider and override the parse method and make it call our parse_item function. Here's an example of what the spider should look like:

spiders/HackerNews.py

# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from scrapy.spider import Spider from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from hn_scraper.items import HnArticleItem class HackernewsSpider(Spider): name = "HackerNews" allowed_domains = ["news.ycombinator.com"] start_urls = ('https://news.ycombinator.com/', ) link_extractor = SgmlLinkExtractor( allow=('news', ), restrict_xpaths=('//a[text()="More"]', )) def extract_one(self, selector, xpath, default=None): extracted = selector.xpath(xpath).extract() if extracted: return extracted[0] return default def parse(self, response): for link in self.link_extractor.extract_links(response): request = Request(url=link.url) request.meta.update(link_text=link.text) yield request for item in self.parse_item(response): yield item def parse_item(self, response): selector = Selector(response) rows = selector.xpath('//table[@id="hnmain"]//td[count(table) = 1]'  '//table[count(tr) > 1]//tr[count(td) = 3]') for row in rows: item = HnArticleItem() article = row.xpath('td[@class="title" and count(a) = 1]//a') article_url = self.extract_one(article, './@href', '') article_title = self.extract_one(article, './text()', '') item['url'] = article_url item['title'] = article_title subtext = row.xpath( './following-sibling::tr[1]//td[@class="subtext" and count(a) = 3]') if subtext: item_author = self.extract_one(subtext, './/a[1]/@href', '') item_id = self.extract_one(subtext, './/a[2]/@href', '') item['author'] = item_author[8:] item['id'] = int(item_id[8:]) yield item

Enabling Frontera in Our Project

Now, all we need to do is configure the Scrapy project to use Frontera with the SQLalchemy middleware. First, install Frontera:

pip install frontera

First, enable Frontera's middlewares and scheduler by adding the following to settings.py:

SPIDER_MIDDLEWARES = {} DOWNLOADER_MIDDLEWARES = {} SPIDER_MIDDLEWARES.update({ 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 999 }, ) DOWNLOADER_MIDDLEWARES.update({ 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 999 }) SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler' FRONTERA_SETTINGS = 'hn_scraper.frontera_settings'

Next, create a file named frontera_settings.py, as specified above in FRONTERA_SETTINGS, to store any settings related to the frontier:

BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFO' SQLALCHEMYBACKEND_ENGINE = 'sqlite:///hn_frontier.db' MAX_REQUESTS = 2000 MAX_NEXT_REQUESTS = 10 DELAY_ON_EMPTY = 0.0

Here we specify hn_frontier.db as the SQLite database file, which is where Frontera will store pages it has crawled.

Running the Spider

Let’s run the spider:

scrapy crawl HackerNews -o results.csv -t csv

You can review the items being scraped in results.csv while the spider is running.

You will notice the hn_scraper.db file we specified earlier will be created. You can browse it using the sqlite3 command-line tool:

sqlite> attach "hn_frontier.db" as hns; sqlite> .tables hns.pages sqlite> select * from hns.pages; https://news.ycombinator.com/|f1f3bd09de659fc955d2db1e439e3200802c4645|0|20150413231805460038|200|CRAWLED| https://news.ycombinator.com/news?p=2|e273a7bbcf16fdcdb74191eb0e6bddf984be6487|1|20150413231809316300|200|CRAWLED| https://news.ycombinator.com/news?p=3|f804e8cd8ff236bb0777220fb241fcbad6bf0145|2|20150413231810321708|200|CRAWLED| https://news.ycombinator.com/news?p=4|5dfeb8168e126c5b497dfa48032760ad30189454|3|20150413231811333822|200|CRAWLED| https://news.ycombinator.com/news?p=5|2ea8685c1863fca3075c4f5d451aa286f4af4261|4|20150413231812425024|200|CRAWLED| https://news.ycombinator.com/news?p=6|b7ca907cc8b5d1f783325d99bc3a8d5ae7dcec58|5|20150413231813312731|200|CRAWLED| https://news.ycombinator.com/news?p=7|81f45c4153cc8f2a291157b10bdce682563362f1|6|20150413231814324002|200|CRAWLED| https://news.ycombinator.com/news?p=8|5fbe397d005c2f79829169f2ec7858b2a7d0097d|7|20150413231815443002|200|CRAWLED| https://news.ycombinator.com/news?p=9|14ee3557a2920b62be3fd521893241c43864c728|8|20150413231816426616|200|CRAWLED|

As shown above, the database has one table, pages, which stores the URL, its fingerprint, timestamp, and response code. This schema is specific to the SQLalchemy backend, and different backends may use different schemas, and some don't persist crawled pages at all.

Frontera backends aren't limited to storing crawled pages; they're the core component of Frontera, and hold all crawl frontier related logic you wish to make use of, so which backend you use is heavily tied to what you want to achieve with Frontera.

In many cases, you will want to create your own backend. This is a lot easier than it sounds, and you can find all the information you need in the documentation.

Hopefully, this tutorial has given you a good insight into Frontera and how you can use it to improve the way you manage your crawling logic. Feel free to check out the code and docs. If you run into a problem please report it at the issue tracker.