PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator

The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Copilot

Welcome to Part 3 of our Modern Scrapy series.


  • In Part 1, we built a basic crawling spider.
  • Part 2 we refactored it into a professional, scalable architecture using scrapy-poet.

That refactor was a huge improvement, but it was still a lot of manual work. We had to:


  • Manually create our BookItem and BookListPage schemas.
  • Manually create the bookstoscrape_com.py Page Object file.
  • Manually use scrapy shell to find all the CSS selectors.
  • Manually write all the @field parsers.

What if you could do all of that in about 30 seconds?


In this guide, we'll show you how to use Web Scraping Copilot (our VS Code extension) to automatically write 100% of your Items, Page Objects, and even your unit tests. We'll take our simple spider from Part 1 and upgrade it to the professional scrapy-poet architecture from Part 2, but this time, the AI will do all the heavy lifting.




On This Page (Table of Contents)

  1. Prerequisites (Part 1 & VS Code)
  2. Step 1: Installing the Web Scraping Copilot Extension
  3. Step 2: Auto-Generating our BookItem
  4. Step 3: Running the AI-Generated Tests
  5. Step 4: Refactoring the Spider (The Easy Way)
  6. Step 5: Auto-Generating our BookListPage
  7. Conclusion: The "Hybrid Developer"



Prefer to watch?

Here's a livestream where I talk through and show our extension working, coding live


Prerequisites & Setup

This tutorial assumes you have:

  • Completed Part 1: Building Your First Crawling Spider. We will start from this simpler spider.
  • Visual Studio Code installed.
  • The Web Scraping Copilot extension (which we'll install now).



Step 1: Installing the Web Scraping Copilot

Inside VS Code, go to the "Extensions" tab and search for Web Scraping Copilot (published by Zyte).



Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like pytest—allow it to do so. This setup process ensures your environment is ready for AI-powered generation.




Step 2: Auto-Generating our BookItem

Let's start with the spider from Part 1. Our goal is to create a Page Object for our BookItem and add even more fields than we did in Part 2.

In the Copilot chat window:

  1. Select "Web Scraping."
  2. Write a prompt like this:

"Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html"


The copilot will now:

  • Check your project: It will confirm you have scrapy-poet and pytest (and will offer to install them if you don't).
  • Add scrapy-poet settings: It will automatically add the ADDONS and SCRAPY_POET_DISCOVER settings to your settings.py file.
  • Create your items.py: It will create a new BookItem class, but this time it will intelligently add all the fields it can find on the page.

# tutorial/items.py (Auto-Generated!)
import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str
    availability: str  # <-- New!
    number_of_reviews: int # <-- New!
    upc: str             # <-- New!

  • Create Fixtures: It creates a fixtures folder with the saved HTML and expected JSON output for testing.
  • Write the Page Object: It creates the tutorial/pages/bookstoscrape_com.py file and writes the entire Page Object, complete with all parsing logic and selectors, for all the new fields.

# tutorial/pages/bookstoscrape_com.py (Auto-Generated!)

from web_poet import WebPage, handle_urls, field, returns
from tutorial.items import BookItem

@handle_urls("books.toscrape.com/catalogue")
@returns(BookItem)
class BookDetailPage(WebPage):
    """
    This Page Object handles parsing data from book detail pages.
    """

    @field
    def name(self) -> str:
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

    # All of this was written for us!
    @field
    def availability(self) -> str:
        return self.response.css("p.availability::text").getall()[1].strip()

    @field
    def number_of_reviews(self) -> int:
        return int(self.response.css("table tr:last-child td::text").get())

    @field
    def upc(self) -> str:
        return self.response.css("table tr:first-child td::text").get()

In 30 seconds, the Copilot has done everything we did manually in Part 2, but better—it even added more fields.




Step 3: Running the AI-Generated Tests

The best part? The Copilot also wrote unit tests for you. It created a tests folder with test_bookstoscrape_com.py.
You can just click "Run Tests" in the Copilot UI (or run pytest in your terminal).


$ pytest
================ test session starts ================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
================ 8 tests passed in 0.10s ================

Your parsing logic is now fully tested, and you didn't write a single line of test code.




Step 4: Refactoring the Spider (The Easy Way)

Now, we just update our tutorial/spiders/books.py to use this new architecture, just like in Part 2.


# tutorial/spiders/books.py

import scrapy
# Import our new, auto-generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # We ask for the BookItem, and scrapy-poet does the rest!
    async def parse_book(self, response, book: BookItem):
        yield book



Step 5: Auto-Generating our BookListPage

We can repeat the exact same process for our list page to finish the refactor.

Prompt the Copilot:


"Create a page object for the list item BookListPage using the sample URL https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

Result:

  • The Copilot will create the BookListPage item in items.py.
  • It will create the BookListPageObject in bookstoscrape_com.py with the parsers for book_urls and next_page_url.
  • It will write and pass the tests.

Now we can update our spider one last time to be fully architected.


# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):

        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Copilot.




Conclusion: The "Hybrid Developer"

The Web Scraping Copilot doesn't replace you. It accelerates you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: crawling logic, strategy, and handling complex sites.

This is how we, as the maintainers of Scrapy, build spiders professionally.


What's Next? Join the Community.


💬 TALK: Have questions about the Copilot? Ask the author and 20k+ devs in our Discord.


▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.


📩 READ: Want more advanced Scrapy tips? Get the Extract Community newsletter so you don't miss it.


G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026