PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
Scrapy Tips from the Pros (February 2016 Edition): Continuous Learning
Light
Dark

Scrapy tips from the pros: February 2016 edition

Read Time
4 Mins
Posted on
February 24, 2016
Open Source
Welcome to the February Edition of Scrapy Tips from the Pros. Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.
By
Valdir Stumm Junior
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog

Scrapy tips from the pros: February 2016 edition

Welcome to the February Edition of Scrapy Tips from the Pros. Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll show you how to crawl websites more effectively following sitemaps and we'll also demonstrate how to add custom settings to individual spiders in a project.

Crawling a Website with Scrapy SitemapSpider

Web crawlers feed on URLs. The more they have, the longer they live. Finding a good source of URLs for any given website is very important as it gives the crawler a strong starting point.

Sitemaps are an excellent source of seed URLs. Website developers use them to indicate which URLs are available for crawling in a machine-readable format. Sitemaps are also a good way to discover web pages that would be otherwise unreachable, since some pages may not be linked to from any other page outside of the sitemap.

Sitemaps are often available at /sitemap.xml or in a different location specified in the robots.txt file.

With Scrapy you don’t need to worry about parsing XML and making requests. It includes a SitemapSpider class you can inherit to handle all of this for you.

SitemapSpider in Action

Let’s say you want to crawl Apple’s website to price check different products. You would want to visit as many pages as possible so that you can scrape as much data as you can. Fortunately, Apple's website provides a sitemap at apple.com/sitemap.xml, which looks like this:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url><loc>http://www.apple.com/</loc></url>
 <url><loc>http://www.apple.com/about/</loc></url>
 ...
 <url><loc>http://www.apple.com/about/workingwithapple.html</loc></url>
 <url><loc>http://www.apple.com/accessibility/</loc></url>
 ...
 </urlset>

Scrapy’s generic SitemapSpider class implements all the logic for parsing and dispatching requests necessary to handle sitemaps. It reads and extracts URLs from the sitemap and it will dispatch a single request for each URL it finds. Here is a spider that will scrape Apple's website using the sitemap as its seed:

from scrapy.spiders import SitemapSpider class AppleSpider(SitemapSpider): name = 'apple-spider' sitemap_urls = ['http://www.apple.com/sitemap.xml'] def parse(self, response): yield { 'title': response.css("title ::text").extract_first(), 'url': response.url } # ...

As you can see, you only need to subclass SitemapSpider and add the sitemap’s URL to the sitemap_urls attribute.

Now, run the spider and check the results:

$ scrapy runspider apple_spider.py -o items.jl --nolog $ head -n 5 items.jl {"url": "http://www.apple.com/", "title": "Apple"} {"url": "http://www.apple.com/ae/support/products/iphone.html", "title": "Support - AppleCare+ - iPhone - Apple (AE)"} {"url": "http://www.apple.com/ae/support/products/ipad.html", "title": "Support - AppleCare+ - iPad - Apple (AE)"} {"url": "http://www.apple.com/ae/support/products/", "title": "Support - AppleCare - Apple (AE)"} {"url": "http://www.apple.com/ae/support/ipod/", "title": "iPod - Apple Support"}

Scrapy dispatches a request for each URL found by SitemapSpider in the sitemap and then it calls the parse method to handle each response it gets. However, some pages in a website may vary in structure and so you might want to use multiple callbacks for different types of pages.

For instance, you can define a specific callback to handle the Mac pages, another one for the iTunes pages and the default parse method for all the other pages:

from scrapy.spiders import SitemapSpider class AppleSpider(SitemapSpider): name = 'apple-spider' sitemap_urls = ['http://www.apple.com/sitemap.xml'] sitemap_rules = [ ('/mac/', 'parse_mac'), ('/itunes/', 'parse_itunes'), ('', 'parse') ] def parse(self, response): self.log("default parsing method for {}".format(response.url)) def parse_mac(self, response): self.log("parse_mac method for {}".format(response.url)) def parse_itunes(self, response): self.log("parse_itunes method for {}".format(response.url)) 

To do it, you have to add a sitemap_rules attribute to your class, mapping URL patterns to callbacks. For instance, the URLs matching the '/mac/' pattern will have its response handled by the parse_mac method.

So, the next time you write a crawler, make sure to use SitemapSpider if you want to have comprehensive crawls of the website.

For more features, check SitemapSpider’s documentation.

Customize Settings for Individual Spiders using custom_settings

The settings for Scrapy projects are typically stored in the project’s settings.py file. However, these are global settings that apply to each of the spiders in your project. If you want to set individual settings for each spider, all you need to do is add an attribute called custom_settings to your spider class.

This is especially useful when you need to enable or disable pipelines or middlewares for certain spiders or to specify different settings for each one. For example, some spiders in your project might require Smart Proxy Manager (Crawlera) (a smart proxy service) enabled while others don't. You can achieve this by adding CRAWLERA_ENABLED = True in custom_settings in the specific spiders.

custom_settings in Action

Take a look at a simplified version of the spiders from a book catalog project. They need custom_settings to define where the book covers are going to be stored in the filesystem:

class AlibrisSpider(scrapy.Spider): name = "alibris-covers" allowed_domains = ["alibris.com"] start_urls = ( 'http://www.alibris.com/search/books/subject/Fiction-Science-Fiction', ) custom_settings = {'IMAGES_STORE': 'imgs/alibris/sci-fi'} def parse(self, response): for book in response.css("div#selected-works ul.primaryList li"): yield { 'image_urls': [book.css("img ::attr(src)").extract_first()], 'title': book.css("p.bookTitle > a ::text").re_first("s*((w+s?)+)s*") }
class GoodreadsSpider(scrapy.Spider): name = "goodreads-covers" allowed_domains = ["goodreads.com"] start_urls = ( 'http://www.goodreads.com/genres/science-fiction', ) custom_settings = {'IMAGES_STORE': 'imgs/goodreads/sci-fi'} def parse(self, response): for book in response.css("img.bookImage"): yield { 'title': book.css("::attr(alt)").extract_first(), 'image_urls': [book.css("::attr(src)").extract_first()] }

These spiders scrape metadata from books on a range of websites. Book covers are retrieved using ImagesPipeline, which can be enabled and configured in the project global settings.py file:

ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline': 1 } IMAGES_STORE = '/some/path/to/images'

The IMAGES_STORE setting lets you define where the downloaded images will be stored in your filesystem. So if you want to separate the images downloaded by each spider into different folders, you just need to override the global IMAGES_STORE setting via custom_settings in each spider:

custom_settings = {'IMAGES_STORE': '/a/different/path'}

Alternatively, you can pass the settings as arguments for the spider using the scrapy -s command line option, but that adds the hassle of having to pass all custom settings from the command line:

$ scrapy crawl alibris -s IMAGES_STORE=imgs/alibris/

So, if you ever need different settings for some spider in your project, include them in the custom_settings attribute in the spider. Heads up, this feature is available for Scrapy >= 1.0.0.

If you are interested in learning how to download images using your spiders, check the docs for more information.

Wrap Up

And that’s about it for our February tips. Check back in with us in March, follow us on Twitter, Facebook, and Instagram, and subscribe to our blog to catch our next Scrapy Tips from the Pros.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026