PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
Scrapy Tips from the Pros (March 2016 Edition): Mastering the Craft
Light
Dark
Read Time
4 Mins
Posted on
March 23, 2016
Open Source
By
Valdir Stumm Junior
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog

Scrapy tips from the pros: March 2016 edition

Scrapy-Tips-March-2016

Welcome to the March Edition of Scrapy Tips from the Pros! Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll cover how to use a cookiejar with the CookiesMiddleware to get around websites that won’t allow you to crawl multiple pages at the same time using the same cookie. We’ll also share a handy tip on how to use multiple fallback XPath/CSS expressions with item loaders to get data from websites more reliably.

**Students reading this, we are participating in Google Summer of Code 2016 and some of our project ideas involve Scrapy! If you're interested, take a look at our ideas and remember to apply before Friday, March 25!

If you are not a student, please share with your student friends. They could get a summer stipend and we might even hire them at the end.**

Work Around Sites With Weird Session Behavior Using a CookieJar

Websites that store your UI state on their server's sessions are a pain to navigate, let alone scrape. Have you ever run into websites where one tab affects the other tabs open on the same site? Then you’ve probably run into this issue.

While this is frustrating for humans, it’s even worse for web crawlers. It can severely hinder a web crawling session. Unfortunately, this is a common pattern for ASP.Net and J2EE-based websites. And that's where cookiejars come in. While the cookiejar is not a frequent need, you’ll be so glad that you have it for those unexpected cases.

When your spider crawls a website, Scrapy automatically handles the cookie for you, storing and sending it in subsequent requests to the same site. But, as you may know, Scrapy requests are asynchronous. This means that you probably have multiple requests being handled concurrently to the same website while sharing the same cookie. To avoid having requests affect each other when crawling these types of websites, you must set different cookies for different requests.

You can do this by using a cookiejar to store separate cookies for different pages in the same website. The cookiejar is just a key-value collection of cookies that Scrapy keeps during the crawling session. You just have to define a unique identifier for each of the cookies that you want to store and then use that identifier when you want to use that specific cookie.

For example, say you want to crawl multiple categories on a website, but this website stores the data related to the category that you are crawling/browsing in the server session. To crawl the categories concurrently, you would need to create a cookie for each category by passing the category name as the identifier to the cookiejar meta parameter:

 class ExampleSpider(scrapy.Spider): urls = [ 'http://www.example.com/category/photo', 'http://www.example.com/category/videogames', 'http://www.example.com/category/tablets' ] def start_requests(self): for url in urls: category = url.split('/')[-1] yield scrapy.Request(url, meta={'cookiejar': category}) 

Three different cookies will be managed in this case (‘photo’, ‘videogames’ and ‘tablets’). You can create a new cookie whenever you pass a nonexistent key as the cookiejar meta value (like when a category name hasn’t been visited yet). When the key we pass already exists, Scrapy uses the respective cookie for that request.

So, if you want to reuse the cookie that has been used to crawl the 'videogames' page, for example, you just need to pass 'videogames' as the unique key to the cookiejar. Instead of creating a new cookie, it will use the existing one:

 yield scrapy.Request('http://www.example.com/atari2600', meta={'cookiejar': 'videogames'}) 

Adding Fallback CSS/XPath Rules

Item Loaders are useful when you need to accomplish more than simply populating a dictionary or an Item object with the data collected by your spider. For example, you might need to add some post-processing logic to the data that you just collected. You might be interested in something as simple as capitalizing every word in a title to more complex operations. With an ItemLoader, you can decouple this post-processing logic from the spider in order to have a more maintainable design.

This tip shows you how to add extra functionality to an Item Loader. Let’s say that you are crawling Amazon.com and extracting the price for each product. You can use an Item Loader to populate a ProductItem object with the product data:

class ProductItem(scrapy.Item): name = scrapy.Field() url = scrapy.Field() price = scrapy.Field() class AmazonSpider(scrapy.Spider): name = "amazon" allowed_domains = ["amazon.com"] def start_requests(self): ... def parse_product(self, response): loader = ItemLoader(item=ProductItem(), response=response) loader.add_css('price', '#priceblock_ourprice ::text') loader.add_css('name', '#productTitle ::text') loader.add_value('url', response.url) yield loader.load_item()

This method works pretty well, unless the scraped product is a deal. This is because Amazon represents deal prices in a slightly different format than regular prices. While the price of a regular product is represented like this:

<span id="priceblock_ourprice" class="a-size-medium a-color-price"> $699.99 </span>

The price of a deal is shown slightly differently:

<span id="priceblock_dealprice" class="a-size-medium a-color-price"> $649.99 </span>

A good way to handle situations like this is to add a fallback rule for the price field in the Item loader. This is a rule that is applied only if the previous rules for that field have failed. To accomplish this with the Item Loader, you can add a add_fallback_css method:

class AmazonItemLoader(ItemLoader): default_output_processor = TakeFirst() def get_collected_values(self, field_name): return (self._values[field_name] if field_name in self._values else self._values.default_factory()) def add_fallback_css(self, field_name, css, *processors, **kw): if not any(self.get_collected_values(field_name)): self.add_css(field_name, css, *processors, **kw)

As you can see, the add_fallback_css method will use the CSS rule if there are no previously collected values for that field. Now, we can change our spider to use AmazonItemLoader and then add the fallback CSS rule to our loader:

def parse_product(self, response): loader = AmazonItemLoader(item=ProductItem(), response=response) loader.add_css('price', '#priceblock_ourprice ::text') loader.add_fallback_css('price', '#priceblock_dealprice ::text') loader.add_css('name', '#productTitle ::text') loader.add_value('url', response.url) yield loader.load_item()

This tip can save you time and make your spiders much more robust. If one CSS rule fails to get the data, there will be other rules that can be applied which will extract the data you need.

If Item Loaders are new to you, check out the documentation.

Wrap Up

And there you have it! Please share any and all problems that you’ve run into while web scraping and extracting data. We’re always on the lookout for new tips and hacks to share in our Scrapy Tips from the Pros monthly column. Hit us up on Twitter or Facebook and let us know if we’ve helped your workflow.

And if you haven’t yet, give Portia, our open source visual web scraping tool, a try. We know you're attached to Scrapy, but it never hurts to experiment with your stack 😉

Please apply to join us for Google Summer of Code 2016 by Friday, March 25!

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026