PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogOpen-sourceScrapy Tips from the Pros (March 2016 Edition): Mastering the Craft
ArticleOpen-source

Scrapy Tips from the Pros (March 2016 Edition): Mastering the Craft

Scrapy Tips from the Pros: March 2016 Edition - Upgrade your web scraping game with the latest expert tips from our Scrapy pros.

V

Valdir Stumm Junior

4 min read · March 23, 2016

Scrapy Tips from the Pros (March 2016 Edition): Mastering the Craft

Scrapy tips from the pros: March 2016 edition

Scrapy-Tips-March-2016

Welcome to the March Edition of Scrapy Tips from the Pros! Each month we’ll release a few tips and hacks that we’ve developed to help make your Scrapy workflow go more smoothly.

This month we’ll cover how to use a cookiejar with the CookiesMiddleware to get around websites that won’t allow you to crawl multiple pages at the same time using the same cookie. We’ll also share a handy tip on how to use multiple fallback XPath/CSS expressions with item loaders to get data from websites more reliably.

**Students reading this, we are participating in Google Summer of Code 2016 and some of our project ideas involve Scrapy! If you're interested, take a look at our ideas and remember to apply before Friday, March 25!

If you are not a student, please share with your student friends. They could get a summer stipend and we might even hire them at the end.**

Work Around Sites With Weird Session Behavior Using a CookieJar

Websites that store your UI state on their server's sessions are a pain to navigate, let alone scrape. Have you ever run into websites where one tab affects the other tabs open on the same site? Then you’ve probably run into this issue.

While this is frustrating for humans, it’s even worse for web crawlers. It can severely hinder a web crawling session. Unfortunately, this is a common pattern for ASP.Net and J2EE-based websites. And that's where cookiejars come in. While the cookiejar is not a frequent need, you’ll be so glad that you have it for those unexpected cases.

When your spider crawls a website, Scrapy automatically handles the cookie for you, storing and sending it in subsequent requests to the same site. But, as you may know, Scrapy requests are asynchronous. This means that you probably have multiple requests being handled concurrently to the same website while sharing the same cookie. To avoid having requests affect each other when crawling these types of websites, you must set different cookies for different requests.

You can do this by using a cookiejar to store separate cookies for different pages in the same website. The cookiejar is just a key-value collection of cookies that Scrapy keeps during the crawling session. You just have to define a unique identifier for each of the cookies that you want to store and then use that identifier when you want to use that specific cookie.

For example, say you want to crawl multiple categories on a website, but this website stores the data related to the category that you are crawling/browsing in the server session. To crawl the categories concurrently, you would need to create a cookie for each category by passing the category name as the identifier to the cookiejar meta parameter:

1`class ExampleSpider(scrapy.Spider): urls = [ 'http://www.example.com/category/photo', 'http://www.example.com/category/videogames', 'http://www.example.com/category/tablets' ] def start_requests(self): for url in urls: category = url.split('/')[-1] yield scrapy.Request(url, meta={'cookiejar': category})`
Copy

Three different cookies will be managed in this case (‘photo’, ‘videogames’ and ‘tablets’). You can create a new cookie whenever you pass a nonexistent key as the cookiejar meta value (like when a category name hasn’t been visited yet). When the key we pass already exists, Scrapy uses the respective cookie for that request.

So, if you want to reuse the cookie that has been used to crawl the 'videogames' page, for example, you just need to pass 'videogames' as the unique key to the cookiejar. Instead of creating a new cookie, it will use the existing one:

1`yield scrapy.Request('http://www.example.com/atari2600', meta={'cookiejar': 'videogames'})`
Copy

Adding Fallback CSS/XPath Rules

Item Loaders are useful when you need to accomplish more than simply populating a dictionary or an Item object with the data collected by your spider. For example, you might need to add some post-processing logic to the data that you just collected. You might be interested in something as simple as capitalizing every word in a title to more complex operations. With an ItemLoader, you can decouple this post-processing logic from the spider in order to have a more maintainable design.

This tip shows you how to add extra functionality to an Item Loader. Let’s say that you are crawling Amazon.com and extracting the price for each product. You can use an Item Loader to populate a ProductItem object with the product data:

1class ProductItem(scrapy.Item): name = scrapy.Field() url = scrapy.Field() price = scrapy.Field() class AmazonSpider(scrapy.Spider): name = "amazon" allowed\_domains = \["amazon.com"\] def start\_requests(self): ... def parse\_product(self, response): loader = ItemLoader(item=ProductItem(), response=response) loader.add\_css('price', '#priceblock\_ourprice ::text') loader.add\_css('name', '#productTitle ::text') loader.add\_value('url', response.url) yield loader.load\_item()
Copy

This method works pretty well, unless the scraped product is a deal. This is because Amazon represents deal prices in a slightly different format than regular prices. While the price of a regular product is represented like this:

1`<span id="priceblock_ourprice" class="a-size-medium a-color-price"> $699.99 </span>`
Copy

The price of a deal is shown slightly differently:

1`<span id="priceblock_dealprice" class="a-size-medium a-color-price"> $649.99 </span>`
Copy

A good way to handle situations like this is to add a fallback rule for the price field in the Item loader. This is a rule that is applied only if the previous rules for that field have failed. To accomplish this with the Item Loader, you can add a add_fallback_css method:

1class AmazonItemLoader(ItemLoader): default\_output\_processor = TakeFirst() def get\_collected\_values(self, field\_name): return (self.\_values\[field\_name\] if field\_name in self.\_values else self.\_values.default\_factory()) def add\_fallback\_css(self, field\_name, css, \*processors, \*\*kw): if not any(self.get\_collected\_values(field\_name)): self.add\_css(field\_name, css, \*processors, \*\*kw)
Copy

As you can see, the add_fallback_css method will use the CSS rule if there are no previously collected values for that field. Now, we can change our spider to use AmazonItemLoader and then add the fallback CSS rule to our loader:

1def parse\_product(self, response): loader = AmazonItemLoader(item=ProductItem(), response=response) loader.add\_css('price', '#priceblock\_ourprice ::text') loader.add\_fallback\_css('price', '#priceblock\_dealprice ::text') loader.add\_css('name', '#productTitle ::text') loader.add\_value('url', response.url) yield loader.load\_item()
Copy

This tip can save you time and make your spiders much more robust. If one CSS rule fails to get the data, there will be other rules that can be applied which will extract the data you need.

If Item Loaders are new to you, check out the documentation.

Wrap Up

And there you have it! Please share any and all problems that you’ve run into while web scraping and extracting data. We’re always on the lookout for new tips and hacks to share in our Scrapy Tips from the Pros monthly column. Hit us up on Twitter or Facebook and let us know if we’ve helped your workflow.

And if you haven’t yet, give Portia, our open source visual web scraping tool, a try. We know you're attached to Scrapy, but it never hurts to experiment with your stack 😉

Please apply to join us for Google Summer of Code 2016 by Friday, March 25!

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Open-source
V

Valdir Stumm Junior

More from this author

In this article

  • Work Around Sites With Weird Session Behavior Using a CookieJar
  • Adding Fallback CSS/XPath Rules
  • Wrap Up

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Scrapy in 2026: New release brings modern async crawling standards
Open Source

Scrapy in 2026: New release brings modern async crawling standards

Scrapy 2.14.0 is released with a major under-the-hood modernization. Say goodbye to Twisted Deferreds.

Robert Andrews·6 min·January 12, 2026
The new economics of web data: Smaller scraping just got cheaper
Open Source

The new economics of web data: Smaller scraping just got cheaper

Smarter tools and AI-driven automation are rewriting the rules of web scraping. As costs fall and setup barriers vanish, smaller teams can now compete at scale, reshaping how the web’s data economy works.

Theresia Tanzil·2 mins·October 6, 2025
A Deep Dive into Zyte's Open-Source Libraries
Open Source

A Deep Dive into Zyte's Open-Source Libraries

Discover how Zyte’s open-source libraries like ClearHTML, Extruct, Chomp.js, and more simplify web data extraction and processing.

Neha Setia Nagpal·1 mins·December 19, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026