PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogOpen-sourceScrapy Tips from the Pros (February 2016 Edition): Continuous Learning
ArticleOpen-source

Scrapy Tips from the Pros (February 2016 Edition): Continuous Learning

Scrapy Tips from the Pros: February 2016 Edition - Stay ahead in web scraping with our latest tips from the pros. Enhance your scraping skills.

V

Valdir Stumm Junior

4 min read · February 24, 2016

Scrapy Tips from the Pros (February 2016 Edition): Continuous Learning

Scrapy tips from the pros: February 2016 edition

Welcome to the February Edition of Scrapy Tips from the Pros. Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll show you how to crawl websites more effectively following sitemaps and we'll also demonstrate how to add custom settings to individual spiders in a project.

Crawling a Website with Scrapy SitemapSpider

Web crawlers feed on URLs. The more they have, the longer they live. Finding a good source of URLs for any given website is very important as it gives the crawler a strong starting point.

Sitemaps are an excellent source of seed URLs. Website developers use them to indicate which URLs are available for crawling in a machine-readable format. Sitemaps are also a good way to discover web pages that would be otherwise unreachable, since some pages may not be linked to from any other page outside of the sitemap.

Sitemaps are often available at /sitemap.xml or in a different location specified in the robots.txt file.

With Scrapy you don’t need to worry about parsing XML and making requests. It includes a SitemapSpider class you can inherit to handle all of this for you.

SitemapSpider in Action

Let’s say you want to crawl Apple’s website to price check different products. You would want to visit as many pages as possible so that you can scrape as much data as you can. Fortunately, Apple's website provides a sitemap at apple.com/sitemap.xml, which looks like this:

1<urlset xmlns\="http://www.sitemaps.org/schemas/sitemap/0.9"\>
2 <url\><loc\>http://www.apple.com/</loc\></url\>
3 <url\><loc\>http://www.apple.com/about/</loc\></url\>
4 ...
5 <url\><loc\>http://www.apple.com/about/workingwithapple.html</loc\></url\>
6 <url\><loc\>http://www.apple.com/accessibility/</loc\></url\>
7 ...
8 </urlset\>
Copy

Scrapy’s generic SitemapSpider class implements all the logic for parsing and dispatching requests necessary to handle sitemaps. It reads and extracts URLs from the sitemap and it will dispatch a single request for each URL it finds. Here is a spider that will scrape Apple's website using the sitemap as its seed:

1`from scrapy.spiders import SitemapSpider class AppleSpider(SitemapSpider): name = 'apple-spider' sitemap_urls = ['http://www.apple.com/sitemap.xml'] def parse(self, response): yield { 'title': response.css("title ::text").extract_first(), 'url': response.url } # ...`
Copy

As you can see, you only need to subclass SitemapSpider and add the sitemap’s URL to the sitemap_urls attribute.

Now, run the spider and check the results:

1`$ scrapy runspider apple_spider.py -o items.jl --nolog $ head -n 5 items.jl {"url": "http://www.apple.com/", "title": "Apple"} {"url": "http://www.apple.com/ae/support/products/iphone.html", "title": "Support - AppleCare+ - iPhone - Apple (AE)"} {"url": "http://www.apple.com/ae/support/products/ipad.html", "title": "Support - AppleCare+ - iPad - Apple (AE)"} {"url": "http://www.apple.com/ae/support/products/", "title": "Support - AppleCare - Apple (AE)"} {"url": "http://www.apple.com/ae/support/ipod/", "title": "iPod - Apple Support"}`
Copy

Scrapy dispatches a request for each URL found by SitemapSpider in the sitemap and then it calls the parse method to handle each response it gets. However, some pages in a website may vary in structure and so you might want to use multiple callbacks for different types of pages.

For instance, you can define a specific callback to handle the Mac pages, another one for the iTunes pages and the default parse method for all the other pages:

1`from scrapy.spiders import SitemapSpider class AppleSpider(SitemapSpider): name = 'apple-spider' sitemap_urls = ['http://www.apple.com/sitemap.xml'] sitemap_rules = [ ('/mac/', 'parse_mac'), ('/itunes/', 'parse_itunes'), ('', 'parse') ] def parse(self, response): self.log("default parsing method for {}".format(response.url)) def parse_mac(self, response): self.log("parse_mac method for {}".format(response.url)) def parse_itunes(self, response): self.log("parse_itunes method for {}".format(response.url))`
Copy

To do it, you have to add a sitemap_rules attribute to your class, mapping URL patterns to callbacks. For instance, the URLs matching the '/mac/' pattern will have its response handled by the parse_mac method.

So, the next time you write a crawler, make sure to use SitemapSpider if you want to have comprehensive crawls of the website.

For more features, check SitemapSpider’s documentation.

Customize Settings for Individual Spiders using custom_settings

The settings for Scrapy projects are typically stored in the project’s settings.py file. However, these are global settings that apply to each of the spiders in your project. If you want to set individual settings for each spider, all you need to do is add an attribute called custom_settings to your spider class.

This is especially useful when you need to enable or disable pipelines or middlewares for certain spiders or to specify different settings for each one. For example, some spiders in your project might require Smart Proxy Manager (Crawlera) (a smart proxy service) enabled while others don't. You can achieve this by adding CRAWLERA_ENABLED = True in custom_settings in the specific spiders.

custom_settings in Action

Take a look at a simplified version of the spiders from a book catalog project. They need custom_settings to define where the book covers are going to be stored in the filesystem:

1`class AlibrisSpider(scrapy.Spider): name = "alibris-covers" allowed_domains = ["alibris.com"] start_urls = ( 'http://www.alibris.com/search/books/subject/Fiction-Science-Fiction', ) custom_settings = {'IMAGES_STORE': 'imgs/alibris/sci-fi'} def parse(self, response): for book in response.css("div#selected-works ul.primaryList li"): yield { 'image_urls': [book.css("img ::attr(src)").extract_first()], 'title': book.css("p.bookTitle > a ::text").re_first("s*((w+s?)+)s*") }`
Copy
1`class GoodreadsSpider(scrapy.Spider): name = "goodreads-covers" allowed_domains = ["goodreads.com"] start_urls = ( 'http://www.goodreads.com/genres/science-fiction', ) custom_settings = {'IMAGES_STORE': 'imgs/goodreads/sci-fi'} def parse(self, response): for book in response.css("img.bookImage"): yield { 'title': book.css("::attr(alt)").extract_first(), 'image_urls': [book.css("::attr(src)").extract_first()] }`
Copy

These spiders scrape metadata from books on a range of websites. Book covers are retrieved using ImagesPipeline, which can be enabled and configured in the project global settings.py file:

1`ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline': 1 } IMAGES_STORE = '/some/path/to/images'`
Copy

The IMAGES_STORE setting lets you define where the downloaded images will be stored in your filesystem. So if you want to separate the images downloaded by each spider into different folders, you just need to override the global IMAGES_STORE setting via custom_settings in each spider:

1`custom_settings = {'IMAGES_STORE': '/a/different/path'}`
Copy

Alternatively, you can pass the settings as arguments for the spider using the scrapy -s command line option, but that adds the hassle of having to pass all custom settings from the command line:

1$ scrapy crawl alibris -s IMAGES\_STORE=imgs/alibris/
Copy

So, if you ever need different settings for some spider in your project, include them in the custom_settings attribute in the spider. Heads up, this feature is available for Scrapy >= 1.0.0.

If you are interested in learning how to download images using your spiders, check the docs for more information.

Wrap Up

And that’s about it for our February tips. Check back in with us in March, follow us on Twitter, Facebook, and Instagram, and subscribe to our blog to catch our next Scrapy Tips from the Pros.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Open-source
V

Valdir Stumm Junior

More from this author

In this article

  • Crawling a Website with Scrapy SitemapSpider
  • SitemapSpider in Action
  • Customize Settings for Individual Spiders using custom_settings
  • custom_settings in Action
  • Wrap Up

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Scrapy in 2026: New release brings modern async crawling standards
Open Source

Scrapy in 2026: New release brings modern async crawling standards

Scrapy 2.14.0 is released with a major under-the-hood modernization. Say goodbye to Twisted Deferreds.

Robert Andrews·6 min·January 12, 2026
The new economics of web data: Smaller scraping just got cheaper
Open Source

The new economics of web data: Smaller scraping just got cheaper

Smarter tools and AI-driven automation are rewriting the rules of web scraping. As costs fall and setup barriers vanish, smaller teams can now compete at scale, reshaping how the web’s data economy works.

Theresia Tanzil·2 mins·October 6, 2025
A Deep Dive into Zyte's Open-Source Libraries
Open Source

A Deep Dive into Zyte's Open-Source Libraries

Discover how Zyte’s open-source libraries like ClearHTML, Extruct, Chomp.js, and more simplify web data extraction and processing.

Neha Setia Nagpal·1 mins·December 19, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026