PINGDOM_CHECK
Light
Dark

The rise of Scrapy: How an open-source scraping framework conquered the web

Read Time
10 mins
Posted on
May 14, 2025
The story of Scrapy reflects the broader evolution of the web itself and the ongoing quest to harness its ever-expanding ocean of information.
By
Theresia Tanzil
Table of Content

It started as a way to gather data for a furniture ecommerce startup. It blossomed into web scrapers’ framework of choice with over 82 million downloads, fuelling tens of thousands of web crawlers to power price comparison engines, market research, AI models and more every day.


Ten years after the release of version 1.0, Scrapy's scale is undeniable. Almost 11,000 commits later, its GitHub repository reflects a vibrant history, built upon contributions from hundreds of developers, all tackling the complex, often frustrating task of extracting structured information from the web.


Scrapy’s influence has shaped best practices across the industry. But how did a pragmatic tool, born out of necessity, achieve such longevity and dominance in the fast-moving, adversarial landscape of web data?

The spark: A startup's data problem


The story of Scrapy begins not in a research lab or a university project, but amidst the practical pressures of a London startup in 2007.


Mydeco, a furniture aggregation website, faced a common challenge: gathering vast amounts of product information scattered across countless manufacturer and retailer websites. Shane Evans, then head of software development at Mydeco, found existing tools inadequate for the task. They were often brittle, difficult to maintain, and simply couldn't handle the required volume and complexity.


"My first significant experience web scraping with Python was in 2007, when I was building a vertical search engine," Evans recounted in a 2015 interview with DecisionStats. "Initially, my team started writing Python scripts, but that got problematic very quickly." The constant need to adapt to website changes, handle errors gracefully, and manage concurrent requests demanded a more structured approach. Faced with this, Evans made a pivotal decision: Mydeco would build its own scraping framework.


"I wrote a framework to make the job easier, promoting best practices for scraping and avoiding common mistakes," Evans explained. This initial framework laid the groundwork, focusing on robustness and maintainability. However, the sheer scale of the task required more resources.


Enter Pablo Hoffman. Through a client connection, Mydeco engaged Insophia, a small Python development shop in Montevideo, Uruguay, founded by Hoffman. Hoffman, who had "immediately fell in love with" Python in college around 2004 and "never looked back", according to a podcast interview he gave in 2016, quickly joined the Mydeco scraping team. Working closely with the framework Evans' team had started, Hoffman agreed its potential extended far beyond Mydeco's immediate needs. The core components – the downloader, the asynchronous engine – were already showing promise.

Going open source: sharing the solution


Working within Mydeco, Hoffman, like Evans, saw the framework evolving. "When I joined MyDeco, there were a lot of parts of Scrapy already built there," he told Talk Python. "The core framework, the downloader, and the more important internals were already in place... You really noticed that there was an improvement over how you go about writing ad hoc web crawlers."


The framework wasn't just solving Mydeco's problem; it was addressing a common pain point for developers everywhere who were manually stitching together libraries.


Adrian Chaves, a lead maintainer of Scrapy, recalls the pre-Scrapy landscape: "You had a downloader library like requests, and a parser like BeautifulSoup, but you had to write all the glue code yourself. With Scrapy, even in 2009, you would get an all-in-one tool, a way to structure your code in callbacks, performance benefits from concurrent requests and a lot of built-in middlewares. In web scraping, you don’t get to choose when to change your code. Websites force your hand. Scrapy makes it as easy as it can be for us to create maintainable and reusable code.”


Recognizing this broader potential, Hoffman proposed a bold move to Evans: open-source the framework. It was a strategic decision to foster collaboration and build a community around the tool. Mydeco agreed. After months of refinement, the project, now named Scrapy (a portmanteau of 'Scrape' and 'Python'), was released under the permissive BSD license in August 2008.

Foundational pillars: The early architecture


The Scrapy that emerged in 2008 was built on several key technical pillars, choices made deliberately to handle the scale and complexity of web crawling:


  • Python's Pragmatism: The choice of Python, while perhaps less common for high-performance networking then, prioritized developer productivity and accessibility. "Speed wasn't Python’s strong suit," Adrian Chaves, a later core contributor, recalls. "But for scraping, the speed of iteration mattered more. The codebase had to be accessible to be adapted constantly."

  • Twisted's Asynchronous Power: At its core, Scrapy leveraged Twisted, Python's event-driven networking engine. This was crucial. Web scraping is inherently I/O-bound (waiting for network responses). Twisted's non-blocking model allowed Scrapy to manage thousands of concurrent requests efficiently, providing a significant performance advantage over synchronous approaches. "Scrapy takes care of a lot of the lower level async programming, which is required to get good performance," noted Shane Evans. "This code is awkward to write, hard to understand, and a nightmare to debug."

  • Robust Parsing with lxml: To handle the often-messy HTML found in the wild, Scrapy relied on the lxml library, known for its speed and resilience in parsing malformed markup.

  • Standard Selectors (XPath/CSS): Instead of inventing a custom way to select data, Scrapy embraced web standards: XPath and CSS selectors. This allowed developers to use existing skills and browser tools, fostering easier adoption.

  • Modularity: Inspired by frameworks like Django, Scrapy was designed with distinct components (Engine, Scheduler, Downloader, Spiders, Pipelines) and crucial extension points (Middlewares). This allowed developers to customize behavior without modifying the core, a key factor in its future adaptability. Hoffman emphasized the goal was to "factor out the common things that you do when you write web scrapers and separate them from the actual extraction rules... that you will type for each website."

Early growth and the rise of Zyte


Following its 2008 release, Scrapy began to gain traction within the Python community. Developers grappling with the limitations of ad-hoc scraping scripts recognized the value of its structured, asynchronous approach. The scrapy-users mailing list became an active hub for questions, solutions, and bug reports.


Crucially, the community wasn't just consuming; it was contributing. Scrapy's modular design, especially its middleware system, made it relatively easy for users to extend its functionality and share their improvements.

However, maintaining a growing open-source project requires resources. Furthermore, Hoffman and Evans saw a burgeoning need for commercial services built around Scrapy. While the framework itself was powerful, businesses often required reliable hosting, large-scale proxy management to avoid blocks, solutions for JavaScript-heavy sites, and dedicated support – features beyond the scope of the core open-source project.


This led to the founding of Scrapinghub in 2010. The company’s mission was clear: "To make it easier to get structured data from the internet." Scrapinghub became the official corporate steward of Scrapy in 2011, providing crucial stability and dedicated developer time for maintenance and evolution. This prevented Scrapy from potentially stagnating, a common fate for unfunded open-source projects.


The relationship was designed to be symbiotic. Scrapinghub, renamed Zyte in 2021, invested heavily in the open-source framework, recognizing its foundational importance. Simultaneously, it developed commercial products that complemented Scrapy. "We realized we had kind of this best framework for writing web spiders," Hoffman explained, regarding the genesis of their cloud platform, " – the next obvious thing to do was to give it the friendliest way possible to run in the cloud.”


This led to Scrapy Cloud, followed by other tools for proxy management and JavaScript rendering, addressing the complex needs of large-scale commercial scraping operations while ensuring the core framework remained free and open.

The Python 3 challenge


Perhaps the most significant technical hurdle Scrapy faced was the migration from Python 2 to Python 3. This was a major undertaking for the entire Python ecosystem, but Scrapy's deep reliance on Twisted, which itself had a protracted Python 3 migration, made it particularly complex. The core team and community contributors invested considerable effort over several years, gradually refactoring code, updating dependencies, and ensuring compatibility.


"It took years but we knew it was critical," Adrian Chaves recalls. "Twisted had its own Python 3 transition too, and Scrapy's dependencies on it were deep. Staying on Python 2 wasn't an option if we want Scrapy to survive."


Scrapy 1.1 (December 2015) introduced experimental Python 3 support, and full support became standard in subsequent releases, ensuring the framework's relevance for the future of Python development.


Embracing Asyncio


While Twisted remained Scrapy's powerful asynchronous foundation, the rise of asyncio in Python's standard library presented an opportunity. Recognizing the desire for flexibility and alignment with modern Python practices, the Scrapy team undertook another significant effort: integrating asyncio support. Starting with Scrapy 2.0 (March 2020), developers could choose asyncio as the event loop, allowing them to leverage the wider asyncio ecosystem alongside Scrapy's robust crawling capabilities. This wasn't a replacement for Twisted but an alternative, demonstrating Scrapy's adaptability and commitment to developer choice.


"The day you could replace def parse(self, response) with async def parse(self, response) marked the beginning of a new era,” Adrian Chaves remembers. "It was a major leap."


Continuous Improvement


Beyond these major milestones, Scrapy continued to evolve through regular releases, incorporating new features, performance enhancements, and security updates. Community involvement remained vital, with contributions flowing through GitHub issues, pull requests, and participation in programs like Google Summer of Code (GSoC), which brought fresh talent and ideas to the project, including per-spider settings, Crawler API refactoring, HTTP/2 support, better robots.txt parsing, and improved MIME sniffing.


Adrian Chaves highlights the cumulative effect of these efforts: "We have many smaller but important community contributions in every release. To me it feels like the positive version of ‘death by a thousand cuts’, something like ‘success by a thousand patches’." The framework added better feed export options, improved crawl management features, enhanced support for different data types, and countless other refinements, solidifying its position as a comprehensive and versatile web scraping tool.

Legacy and future: The enduring impact of Scrapy


Fifteen years after its open-source debut and 10 years after it hit 1.0, Scrapy remains a dominant force in web scraping. Its journey from an internal startup tool to a globally adopted framework is a testament to its robust design, the strategic vision of its creators, and the power of open-source collaboration. Its impact is multifaceted:


  • Democratizing Data Access: By providing a powerful, free, and open-source tool, Scrapy lowered the barrier for developers, researchers, journalists, and businesses to access and utilize web data.

  • Establishing Best Practices: Scrapy promoted structured approaches to scraping, emphasizing maintainability, error handling, and respect for website resources (through features like AutoThrottle). Its conventions influenced how many developers approach web data extraction.

  • Powering an Ecosystem: Scrapy became the foundation for a thriving ecosystem, including the success of Zyte and numerous third-party extensions, tools, and services built upon or integrating with the framework.

  • Training Ground: For many Python developers, Scrapy served as an introduction to asynchronous programming concepts and practical data engineering challenges.


Today, Scrapy continues to be actively maintained by Zyte and a dedicated community of contributors. It faces ongoing challenges including the rise of dynamic JavaScript-heavy websites and evolving attitudes toward data collection. Yet, its core principles of modularity, asynchronous efficiency, and community-driven development position it well to adapt.


The story of Scrapy reflects the broader evolution of the web itself and the ongoing quest to harness its ever-expanding ocean of information.


From the individual needs of a London startup to a global standard, Scrapy's journey underscores the enduring power of solving a difficult problem well and sharing that solution openly with the world.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.