PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogHow ToScrapy Cloud Secrets: Hub Crawl Frontier And How To Use It
ArticleTutorial / How-toHow To

Scrapy Cloud Secrets: Hub Crawl Frontier And How To Use It

Our scrapy cloud secrets help you deal with real cases that put your data extraction pipeline at risk. You have to be fully prepared for every scenario.

J

Julio Cesar Batista

6 min read · August 6, 2020

Scrapy Cloud Secrets: Hub Crawl Frontier And How To Use It

Scrapy Cloud secrets: Hub Crawl Frontier and how to use it

Imagine a long crawling process, like extracting data from a website for a whole month. We can start it and leave it running until we get the results.

Though, we can agree that a whole month is plenty of time for something to go wrong. The target website can go down for a few minutes/hours, there can be some sort of power outage in your crawling server, or even some other internet connection issues.

Any of those are real case scenarios and can happen at any given moment, bringing risk to your data extraction pipeline.

In this case, if something like that happens, you may need to restart your crawling process and wait even longer to get access to that precious data. But, you don’t need to panic, this is where Hub Crawl Frontier (HCF) and scrapy cloud secrets come to the rescue.

What is Hub Crawl Frontier (HCF)?

HCF is an API to store request data and is available through Scrapy Cloud projects. It is a bit similar to Collections, but its intended use is to store request data, not a generic key value storage like Collections. At this moment, if you are familiar with Scrapy, you may be wondering why one would use HCF, when Scrapy can store and recover the crawling state by itself.

The advantage is that Scrapy requires you to manage this state, by saving the content to disk (so needs disk quota) and if you are running inside a container, like in Scrapy Cloud, local files are lost once the process is finished. So, having some kind of external storage for requests is an alternative that takes this burden from your shoulders, leaving you to think about the extraction logic and not about the details on how to proceed in case it crashes and you need to restart.

Structure of Hub Crawl Frontier

Before digging into an example of how to use HCF, I’ll go over a bit on how it is structured. We can create many Frontiers per project, for each one we need a name. These Frontiers are then broken into slots, something similar to sharding, that can be useful in a producer-consumer scenario (topic of one of our upcoming blog posts). Usually, the name will be the name of the spider, to avoid any confusion. The catchy part is that we shouldn't change the number of slots after it was created, so keep it in mind when creating it.

Using HCF

Now that we know what HCF is and how we could make use of it, it is time to see it working. For this purpose, we’ll build a simple Scrapy spider to extract book information from http://books.toscrape.com. To get started, we’ll create a new scrapy project and install the proper dependencies as shown below (type them in your terminal).

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

# setup

mkdir hcf_example

cd hcf_example

python3 -m venv .venv # or your favorite virtual env

source .venv/bin/activate

# project

pip install scrapy scrapy-frontera hcf-backend

scrapy startproject hcf_example .

scrapy genspider books.toscrape.com books.toscrape.com

# setup mkdir hcf_example cd hcf_example python3 -m venv .venv # or your favorite virtual env source .venv/bin/activate # project pip install scrapy scrapy-frontera hcf-backend scrapy startproject hcf_example . scrapy genspider books.toscrape.com books.toscrape.com

1\# setup
2mkdir hcf\_example
3cd hcf\_example
4python3 -m venv .venv  # or your favorite virtual env
5source .venv/bin/activate
6
7# project
8pip install scrapy scrapy-frontera hcf-backend
9scrapy startproject hcf\_example .
10scrapy genspider books.toscrape.com books.toscrape.com
Copy

The commands above will create a new directory for our project and create a new virtual environment, to avoid messing up our Operational System. Then it will install Scrapy and some libraries to use HCF. Finally, it creates a new Scrapy project and a spider. A side note on the extra libraries for HCF. There are a couple of libraries we could use, like scrapy-hcf, but it seems to be unmaintained for a while. So, we’ll be using scrapy-frontera and HCF as a backed through hcf-backend.

Given that our project was successfully created and the dependencies were installed, we can write a minimal spider to extract the book data as shown in the following code snippet.

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

import scrapy

class BooksToscrapeComSpider(scrapy.Spider):

name = 'books.toscrape.com'

allowed_domains = ['books.toscrape.com']

start_urls = ['http://books.toscrape.com/'\]

def parse(self, response):

for href in response.css('.product_pod h3 a::attr(href)').getall():

# books

yield response.follow(href, self.parse_book)

next_page_href = response.css('.pager .next a::attr(href)').get()

if next_page_href:

yield response.follow(next_page_href, self.parse)

def parse_book(self, response):

return {

'title': response.css('.product_main h1::text').get().strip(),

'price': response.css('.product_main .price_color::text').get().strip()

}

import scrapy class BooksToscrapeComSpider(scrapy.Spider): name = 'books.toscrape.com' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/'\] def parse(self, response): for href in response.css('.product_pod h3 a::attr(href)').getall(): # books yield response.follow(href, self.parse_book) next_page_href = response.css('.pager .next a::attr(href)').get() if next_page_href: yield response.follow(next_page_href, self.parse) def parse_book(self, response): return { 'title': response.css('.product_main h1::text').get().strip(), 'price': response.css('.product_main .price_color::text').get().strip() }

1import scrapy
2
3
4class BooksToscrapeComSpider(scrapy.Spider):
5   name = 'books.toscrape.com'
6   allowed\_domains = \['books.toscrape.com'\]
7   start\_urls = \['http://books.toscrape.com/'\]
8
9   def parse(self, response):
10       for href in response.css('.product\_pod h3 a::attr(href)').getall():
11           # books
12           yield response.follow(href, self.parse\_book)
13
14       next\_page\_href = response.css('.pager .next a::attr(href)').get()
15       if next\_page\_href:
16           yield response.follow(next\_page\_href, self.parse)
17
18   def parse\_book(self, response):
19       return {
20           'title': response.css('.product\_main h1::text').get().strip(),
21           'price': response.css('.product\_main .price\_color::text').get().strip()
22       }
Copy

If you are familiar with Scrapy, there’s nothing so fancy in the code above. Just a simple spider that navigates the book pages and follows book links to their pages to extract the title and price.

We can run this spider from the terminal by typing Scrapy crawl books.toscrape.com and we should see the result there (no errors and 1,000 items were extracted). So far, we’re not interacting with HCF and we’ll be doing so by configuring it in the following changes. First, we’ll need to update our project settings.py file with the following.

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

HCF_AUTH = 'YOUR API KEY HERE'

HCF_PROJECT_ID = 'YOUR SCRAPY CLOUD PROJECT ID'

SCHEDULER = 'scrapy_frontera.scheduler.FronteraScheduler'

BACKEND = 'hcf_backend.HCFBackend'

DOWNLOADER_MIDDLEWARES = {

'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0,

}

SPIDER_MIDDLEWARES = {

'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0,

}

HCF_AUTH = 'YOUR API KEY HERE' HCF_PROJECT_ID = 'YOUR SCRAPY CLOUD PROJECT ID' SCHEDULER = 'scrapy_frontera.scheduler.FronteraScheduler' BACKEND = 'hcf_backend.HCFBackend' DOWNLOADER_MIDDLEWARES = { 'scrapy_frontera.middlewares.SchedulerDownloaderMiddleware': 0, } SPIDER_MIDDLEWARES = { 'scrapy_frontera.middlewares.SchedulerSpiderMiddleware': 0, }

1HCF\_AUTH = 'YOUR API KEY HERE'
2HCF\_PROJECT\_ID = 'YOUR SCRAPY CLOUD PROJECT ID'
3SCHEDULER = 'scrapy\_frontera.scheduler.FronteraScheduler'
4BACKEND = 'hcf\_backend.HCFBackend'
5
6DOWNLOADER\_MIDDLEWARES = {
7   'scrapy\_frontera.middlewares.SchedulerDownloaderMiddleware': 0,
8}
9
10SPIDER\_MIDDLEWARES = {
11   'scrapy\_frontera.middlewares.SchedulerSpiderMiddleware': 0,
12}
Copy

The

SCHEDULER, SPIDER_MIDDLEWARES

SCHEDULER, SPIDER_MIDDLEWARES and`

DOWNLOADER_MIDDLEWARES

`DOWNLOADER_MIDDLEWARES`` are set so

scrapy-frontera

scrapy-frontera works. Then, we set HCF as the

BACKEND

BACKEND and add the proper Scrapy Cloud API Key (`

HCF_AUTH

HCF_AUTH``) and the project in which we’re creating the Frontier (

HCF_PROJECT_ID

`HCF_PROJECT_ID``). With these settings in place, we can update our spider, so it starts interacting with HCF. If you run the spider now, you’ll see some new logs, but it won’t be storing the requests in HCF yet. The following changes should be applied in books_toscrape_com.py file.

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

frontera_settings = {

'HCF_PRODUCER_FRONTIER': 'books_toscrape_com',

'HCF_PRODUCER_NUMBER_OF_SLOTS': 1,

'HCF_CONSUMER_FRONTIER': 'books_toscrape_com',

'HCF_CONSUMER_SLOT': '0'

}

custom_settings = {

'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse', 'parse_book'],

}

frontera_settings = { 'HCF_PRODUCER_FRONTIER': 'books_toscrape_com', 'HCF_PRODUCER_NUMBER_OF_SLOTS': 1, 'HCF_CONSUMER_FRONTIER': 'books_toscrape_com', 'HCF_CONSUMER_SLOT': '0' } custom_settings = { 'FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER': ['parse', 'parse_book'], }

1frontera\_settings = {
2    'HCF\_PRODUCER\_FRONTIER': 'books\_toscrape\_com',
3    'HCF\_PRODUCER\_NUMBER\_OF\_SLOTS': 1,
4    'HCF\_CONSUMER\_FRONTIER': 'books\_toscrape\_com',
5    'HCF\_CONSUMER\_SLOT': '0'
6}
7
8custom\_settings = {
9    'FRONTERA\_SCHEDULER\_REQUEST\_CALLBACKS\_TO\_FRONTIER': \['parse', 'parse\_book'\],
10}
Copy

Recall that we are using scrapy-frontera to interact with HCF, that’s the reason we need to set

frontera_settings

frontera_settings. Basically, we’re setting the Frontier name where we are storing the requests (`

HCF_PRODUCER_FRONTIER

`HCF_PRODUCER_FRONTIER``) and where we are consuming them (

HCF_CONSUMER_FRONTIER

HCF_CONSUMER_FRONTIER). The

HCF_PRODUCER_NUMBER_OF_SLOTS

HCF_PRODUCER_NUMBER_OF_SLOTS setting means the number of slots we should be creating for this producer, in this case only one and

HCF_CONSUMER_SLOT

HCF_CONSUMER_SLOT means the slot we’re using for consumption which is the slot 0 (given that there is only 1 and starts from 0). Finally, we need to tell scrapy-frontera which requests it should send to the backend, and it happens by identifying the request callback. If the request callback is any of the names set in

FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER

FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER it will be sent to the backend, otherwise it’ll be processed as a local scrapy request.

This is it, we’ve got the moment that we can run our spider and it will be storing the requests in HCF. Just run the spider as we did before and it should work! But how can I tell that the requests were sent to HCF? For that, hcf-backend comes with a handy tool to help us, the hcfpal. From your terminal, just run the command below and you should see the Frontier name.

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

PROJECT_ID="" SH_APIKEY="" python -m hcf_backend.utils.hcfpal list

PROJECT_ID="" SH_APIKEY="" python -m hcf_backend.utils.hcfpal list

1PROJECT\_ID="<YOUR PROJECT ID>" SH\_APIKEY="<YOUR API KEY>" python -m hcf\_backend.utils.hcfpal list
Copy

There are some other commands available in hcfpal, like counting nthe requests in a given frontier.

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

PROJECT_ID="" SH_APIKEY="" python -m hcf_backend.utils.hcfpal count books_toscrape_com

PROJECT_ID="" SH_APIKEY="" python -m hcf_backend.utils.hcfpal count books_toscrape_com

1PROJECT\_ID="<YOUR PROJECT ID>" SH\_APIKEY="<YOUR API KEY>" python -m hcf\_backend.utils.hcfpal count books\_toscrape\_com
Copy

It will show you the request count per slot and total count (in case you have more than one slot).

Incremental crawl

As we are storing the requests in HCF for further restart, it can be used as an example of incremental crawling. So, no need for special logic or so, just run the spider and it should start getting only new content. The requests are identified as in scrapy, by their fingerprint. There is one catch when working with multiple slots that is:, a given request is unique in a given slot (but we won’t bother with it for now and leave it for a future article). To get started, let’s clean our Frontier by typing the following in our terminal.

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

PROJECT_ID="" SH_APIKEY="" python -m hcf_backend.utils.hcfpal delete books_toscrape_com 0

PROJECT_ID="" SH_APIKEY="" python -m hcf_backend.utils.hcfpal delete books_toscrape_com 0

1PROJECT\_ID="<YOUR PROJECT ID>" SH\_APIKEY="<YOUR API KEY>" python -m hcf\_backend.utils.hcfpal delete books\_toscrape\_com 0
Copy

Once it’s done, run the spider but stop it before it finishes (simulating an early stop). To do it, press CTRL + C (Command + C) on the terminal once. It should send the signal to scrapy to finish the process. Then, wait a bit so the crawling process finishes. As the process finishes, it logs the stats in the terminal and we should use them to understand a bit of what’s happening.

For example, by looking into

item_scrape_count

item_scrape_count I get that 80 items were extracted. Also, pay attention to stats starting with

hcf/consumer

hcf/consumer and

hcf/producer

hcf/producer. These are related to the URLs we found in our run, how many were processed/extracted (consumed) and how many were discovered/stored (produced). In my case, it consumed 84 requests and found 105 links (all new, as we had cleaned the Frontier before running).

After inspecting the stats, run the spider once again, without deleting the Frontier, and wait for it to finish. You should see that

item_scrape_count

item_scrape_count is the difference between the previous crawl and the current one (in my case, 920 items). This happened because the duplicate requests were filtered by HCF and then they weren’t processed again.

You should also identify a similar behavior in

hcf/consumer

hcf/consumer and_

hcf/producer

hcf/producer_ stats, showing that some links were extracted but not all of them are new.

Finally, you can run the spider once more and it will just stop, logging no items scraped, because all the links it extracts were already processed in the previous runs. So, there is no new data to be processed and it finishes.

Wrapping up

HCF is a kind of external storage for requests that is available in Scrapy Cloud projects and it can be used by Scrapy spiders. There are many use cases for it, and we’ve been through the recovery of a crawling process and incremental crawling scenarios. For a future article, we’ll explore a bit more how we can configure HCF in our projects and how to use it in a producer-consumer architecture. If you got interested in it, I invite you to check the Shub Workflow basic tutorial (which has some information similar to this tutorial) and Frontera docs.

Learn more

If you want to learn more about web data extraction and how it can serve your business you can check out our solutions to see how others are making use of web data. Also, if you’re considering outsourcing web scraping, you can watch our on-demand webinar to help you decide between in-house vs outsourced web data extraction.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
How To
J

Julio Cesar Batista

More from this author

In this article

  • What is Hub Crawl Frontier (HCF)?
  • Structure of Hub Crawl Frontier
  • Using HCF
  • Incremental crawl
  • Wrapping up
  • Learn more

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Teaching AI to scrape like a pro: how we measure LLMs’ data quality
How To

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

AI-enabled code editors can now conjure scraping code on command. But is it any good? Here’s how Zyte re-engineered LLMs with Web Scraping Copilot to drive best-in-class output.

Theresia Tanzil·10 min·February 23, 2026
Analyze web data quickly with Jupyter Notebooks and Zyte API
How To

Analyze web data quickly with Jupyter Notebooks and Zyte API

With AI Scraping in Zyte API, you can pull data from any e-commerce website straight into your Jupyter notebooks.

Neha Setia Nagpal·2 mins·December 13, 2024
Overcoming web scraping challenges of Puppeteer and Playwright
How To

Overcoming web scraping challenges of Puppeteer and Playwright

Discover the challenges of scaling web scraping with Playwright & Puppeteer, from browser farm management to IP rotation and anti-scraping tactics.

Neha Setia Nagpal·1 mins·December 5, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026