The Python 3 challenge
Perhaps the most significant technical hurdle Scrapy faced was the migration from Python 2 to Python 3. This was a major undertaking for the entire Python ecosystem, but Scrapy's deep reliance on Twisted, which itself had a protracted Python 3 migration, made it particularly complex. The core team and community contributors invested considerable effort over several years, gradually refactoring code, updating dependencies, and ensuring compatibility.
"It took years but we knew it was critical," Adrian Chaves recalls. "Twisted had its own Python 3 transition too, and Scrapy's dependencies on it were deep. Staying on Python 2 wasn't an option if we want Scrapy to survive."
Scrapy 1.1 (December 2015) introduced experimental Python 3 support, and full support became standard in subsequent releases, ensuring the framework's relevance for the future of Python development.
Embracing Asyncio
While Twisted remained Scrapy's powerful asynchronous foundation, the rise of asyncio in Python's standard library presented an opportunity. Recognizing the desire for flexibility and alignment with modern Python practices, the Scrapy team undertook another significant effort: integrating asyncio support. Starting with Scrapy 2.0 (March 2020), developers could choose asyncio as the event loop, allowing them to leverage the wider asyncio ecosystem alongside Scrapy's robust crawling capabilities. This wasn't a replacement for Twisted but an alternative, demonstrating Scrapy's adaptability and commitment to developer choice.
"The day you could replace def parse(self, response) with async def parse(self, response) marked the beginning of a new era,” Adrian Chaves remembers. "It was a major leap."
Continuous Improvement
Beyond these major milestones, Scrapy continued to evolve through regular releases, incorporating new features, performance enhancements, and security updates. Community involvement remained vital, with contributions flowing through GitHub issues, pull requests, and participation in programs like Google Summer of Code (GSoC), which brought fresh talent and ideas to the project, including per-spider settings, Crawler API refactoring, HTTP/2 support, better robots.txt parsing, and improved MIME sniffing.
Adrian Chaves highlights the cumulative effect of these efforts: "We have many smaller but important community contributions in every release. To me it feels like the positive version of ‘death by a thousand cuts’, something like ‘success by a thousand patches’." The framework added better feed export options, improved crawl management features, enhanced support for different data types, and countless other refinements, solidifying its position as a comprehensive and versatile web scraping tool.