PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogOpen-sourceHow to crawl the web with Scrapy
ArticleTutorial / How-toOpen-source

How to crawl the web with Scrapy

How to crawl the web with Scrapy without harming websites. Here are a few tips for netizens who want to build polite and considerate web crawlers.

V

Valdir Stumm Junior

6 min read · August 25, 2016

How to crawl the web with Scrapy

How to crawl the web politely with Scrapy

The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.

In this post, we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers.

Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.

Robots

What makes a crawler polite?

A polite crawler respects robots.txt
A polite crawler never degrades a website’s performance
A polite crawler identifies its creator with contact information
A polite crawler is not a pain in the buttocks of system administrators

robots.txt

Always make sure that your crawler follows the rules defined in the website's robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and it describes what a crawler should or shouldn't crawl according to the Robots Exclusion Standard. Some websites even use the crawlers’ user agent to specify separate rules for different web crawlers:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

User-agent: Some_Annoying_Bot

Disallow: /

User-Agent: *

Disallow: /*.json

Disallow: /api

Disallow: /post

Disallow: /submit

Allow: /

User-agent: Some_Annoying_Bot Disallow: / User-Agent: * Disallow: /*.json Disallow: /api Disallow: /post Disallow: /submit Allow: /

1User-agent: Some\_Annoying\_Bot
2Disallow: /
3
4User-Agent: \*
5Disallow: /\*.json
6Disallow: /api
7Disallow: /post
8Disallow: /submit
9Allow: /
Copy

Crawl-delay

Mission-critical to having a polite crawler is making sure your crawler doesn't hit a website too hard. Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive.

When a website gets overloaded with more requests that the web server can handle, it might become unresponsive. Don’t be that guy or girl that causes a headache for the website administrators.

User-agent

However, if you have ignored the cardinal rules above (or your crawler has achieved aggressive sentience), there needs to be a way for the website owners to contact you. You can do this by including your company name and an email address or website in the request's User-Agent header. For example, Google's crawler user agent is "Googlebot".

Zyte abuse report form

Hey, all the folks using our Scrapy Cloud platform! We trust you will crawl responsibly, but to support website administrators, we provide an abuse report form where they can report any misbehavior from crawlers running on our platform. We’ll kindly pass the message along so that you can modify your crawls and avoid ruining a sysadmin’s day. If your crawler’s are turning into Skynet and running roughshod over human law, we reserve the right to halt their crawling activities and thus avert the robot apocalypse.

How to be polite using Scrapy

scrapy

Robots.txt

Crawlers created using Scrapy 1.1+ already respect robots.txt by default. If your crawlers have been generated using a previous version of Scrapy, you can enable this feature by adding this in the project's settings.py:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

ROBOTSTXT_OBEY = True

ROBOTSTXT_OBEY = True

1ROBOTSTXT\_OBEY = True
Copy

Then, every time your crawler tries to download a page from a disallowed URL, you'll see a message like this:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

2016-08-19 16:12:56 [scrapy] DEBUG: Forbidden by robots.txt: <GET http://website.com/login>

2016-08-19 16:12:56 [scrapy] DEBUG: Forbidden by robots.txt: <GET http://website.com/login>

12016-08-19 16:12:56 \[scrapy\] DEBUG: Forbidden by robots.txt: &lt;GET http://website.com/login&gt;
Copy

Identifying your crawler

It’s important to provide a way for sysadmins to easily contact you if they have any trouble with your crawler. If you don’t, they'll have to dig into their logs and look for the offending IPs.

Be nice to the friendly sysadmins in your life and identify your crawler via the Scrapy USER_AGENT setting. Share your crawler name, company name, and a contact email:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'

USER_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'

1USER\_AGENT = 'MyCompany-MyCrawler (bot@mycompany.com)'
Copy

Introducing delays

Scrapy spiders are blazingly fast. They can handle many concurrent requests and they make the most of your bandwidth and computing power. However, with great power comes great responsibility.

To avoid hitting the web servers too frequently, you need to use the DOWNLOAD_DELAY setting in your project (or in your spiders). Scrapy will then introduce a random delay ranging from 0.5 * DOWNLOAD_DELAY to 1.5 * DOWNLOAD_DELAY seconds between consecutive requests to the same domain. If you want to stick to the exact DOWNLOAD_DELAY that you defined, you have to disable RANDOMIZE_DOWNLOAD_DELAY.

By default, DOWNLOAD_DELAY is set to 0. To introduce a 5-second delay between requests from your crawler, add this to your settings.py:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

DOWNLOAD_DELAY = 5.0

DOWNLOAD_DELAY = 5.0

1DOWNLOAD\_DELAY = 5.0
Copy

If you have a multi-spider project crawling multiple sites, you can define a different delay for each spider with the download_delay (yes, it's lowercase) spider attribute:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

class MySpider(scrapy.Spider):

name = 'myspider'

download_delay = 5.0

...

class MySpider(scrapy.Spider): name = 'myspider' download_delay = 5.0 ...

1class MySpider(scrapy.Spider):
2    name = 'myspider'
3    download\_delay = 5.0
4    ...
Copy

Concurrent requests per domain

Another setting you might want to tweak to make your spider more polite is the number of concurrent requests it will do for each domain. By default, Scrapy will dispatch at most 8 requests simultaneously to any given domain, but you can change this value by updating the CONCURRENT_REQUESTS_PER_DOMAIN setting.

Heads up, the CONCURRENT_REQUESTS setting defines the maximum amount of simultaneous requests that Scrapy's downloader will do for all your spiders. Tweaking this setting is more about your own server performance/bandwidth than your target's when you're crawling multiple domains at the same time.

AutoThrottle to save the day

Websites vary drastically in the number of requests they can handle. Adjusting this manually for every website that you are crawling is about as much fun as watching paint dry. To save your sanity, Scrapy provides an extension called AutoThrottle.

AutoThrottle automatically adjusts the delays between requests according to the current web server load. It first calculates the latency from one request. Then it will adjust the delay between requests for the same domain in a way that no more than AUTOTHROTTLE_TARGET_CONCURRENCY requests will be simultaneously active. It also ensures that requests are evenly distributed in a given time span.

To enable AutoThrottle, just include this in your project's settings.py:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

AUTOTHROTTLE_ENABLED = True

AUTOTHROTTLE_ENABLED = True

1AUTOTHROTTLE\_ENABLED = True
Copy

Scrapy Cloud users don't have to worry about enabling it, because it's already enabled by default.

There’s a wide range of settings to help you tweak the throttle mechanism, so have fun playing around!

Use an HTTP cache for development

Developing a web crawler is an iterative process. However, running a crawler to check if it’s working means hitting the server multiple times for each test. To help you to avoid this impolite activity, Scrapy provides a built-in middleware called HttpCacheMiddleware. You can enable it by including this in your project's settings.py:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

HTTPCACHE_ENABLED = True

HTTPCACHE_ENABLED = True

1HTTPCACHE\_ENABLED = True
Copy

Once enabled, it caches every request made by your spider along with the related response. So the next time you run your spider, it will not hit the server for requests already done. It's a win-win: your tests will run much faster and the website will save resources.

Don't crawl, use the API

Many websites provide HTTP APIs so that third parties can consume their data without having to crawl their web pages. Before building a web scraper, check if the target website already provides an HTTP API that you can use. If it does, go with the API. Again, it's a win-win: you avoid digging into the page’s HTML and your crawler gets more robust because it doesn’t need to depend on the website’s layout.

Wrap up

Let’s all do our part to keep the peace between sysadmins, website owners, and developers by making sure that our web crawling projects are as noninvasive as possible. Remember, we need to band together to delay the rise of our robot overlords, so let’s keep our crawlers, spiders, and bots polite.

image03

To all website owners, help a crawler out and ensure your site has an HTTP API. And remember, if someone using our platform is overstepping their bounds, contact us and we’ll take care of the issue.

For those new to our platform, Scrapy Cloud is the peanut butter to Scrapy’s jelly. For our existing Scrapy and Scrapy Cloud users, hopefully, you learned a few tips for how to both speed up your crawls and prevent abuse complaints. Let us know if you have any further suggestions in the comment section below!

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Open-source
V

Valdir Stumm Junior

More from this author

In this article

  • What makes a crawler polite?
  • robots.txt
  • Crawl-delay
  • User-agent
  • Zyte abuse report form
  • How to be polite using Scrapy
  • Robots.txt
  • Identifying your crawler
  • Introducing delays
  • Concurrent requests per domain
  • AutoThrottle to save the day
  • Use an HTTP cache for development
  • Don't crawl, use the API
  • Wrap up

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Scrapy in 2026: New release brings modern async crawling standards
Open Source

Scrapy in 2026: New release brings modern async crawling standards

Scrapy 2.14.0 is released with a major under-the-hood modernization. Say goodbye to Twisted Deferreds.

Robert Andrews·6 min·January 12, 2026
The new economics of web data: Smaller scraping just got cheaper
Open Source

The new economics of web data: Smaller scraping just got cheaper

Smarter tools and AI-driven automation are rewriting the rules of web scraping. As costs fall and setup barriers vanish, smaller teams can now compete at scale, reshaping how the web’s data economy works.

Theresia Tanzil·2 mins·October 6, 2025
A Deep Dive into Zyte's Open-Source Libraries
Open Source

A Deep Dive into Zyte's Open-Source Libraries

Discover how Zyte’s open-source libraries like ClearHTML, Extruct, Chomp.js, and more simplify web data extraction and processing.

Neha Setia Nagpal·1 mins·December 19, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026