PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator

The Modern Scrapy Developer's Guide (Part 2): Page Objects with scrapy-poet


Welcome to Part 2 of our Modern Scrapy series. In Part 1, we built a working spider that crawls and scrapes an entire category. But if you look at our code, it's already getting messy. Our parse_listpage and parse_book functions are mixing two different jobs:


  • Crawling Logic: Finding the next page and following links.
  • Parsing Logic: Finding the data (name, price) with CSS selectors.

What happens when a selector changes? Or when you want to test your parsing logic? You have to run the whole spider. This is slow, hard to maintain, and difficult to test.


In this guide, we'll fix this by refactoring our spider to a professional, modern standard using Scrapy Items and Page Objects (via scrapy-poet). We will completely separate our crawling logic from our parsing logic. This will make our code cleaner, infinitely easier to test, and scalable.

What We'll Build

We will refactor our spider from Part 1. The spider itself will only handle crawling (following links). All the parsing logic will be moved into dedicated "Page Object" classes. scrapy-poet will automatically inject the correct, parsed item into our spider.


Look at how clean our spider's parse_book function becomes:


# The NEW parse_book function
# Where did the parsing logic go?! (Hint: scrapy-poet)

    async def parse_book(self, response, book: BookItem):
        # 'book' is a BookItem, magically injected and parsed
        # by scrapy-poet before this function is even called.
        # We just yield it.
        yield book

Prerequisites

This tutorial builds directly on Part 1: Building Your First Crawling Spider. Please complete that guide first, as we will be modifying the spider we built there.

Step 1: The "Why" (Separation of Concerns)

Our current spider is a monolith. The BooksSpider class knows how to crawl (find next page links, find product links) and how to parse (extract h1 tags, extract p.price_color).


This is bad. If we want to reuse our parsing logic, or test it without re-crawling the web, we can't.
The "Page Object" pattern solves this.


  • The Spider's Job: Crawling. Its only job is to navigate from page to page and yield Requests or Items.
  • The Page Object's Job: Parsing. Its only job is to take a response and extract structured data from it.

scrapy-poet is a library that automatically connects our spider to the correct Page Object.

Step 2: Create Our "Schema" (Scrapy Items)

First, let's define the data we're scraping. Instead of messy dictionaries, we'll use Scrapy Items. Scrapy comes with attrs, a fantastic library for this.
Open tutorial/items.py and add two classes: one for our book data and one for our list page data.


# tutorial/items.py

import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str

@attrs.define
class BookListPage:
    """
    The data and links we extract from a *list* page.
    """
    book_urls: list
    next_page_url: str | None

This is our "schema." It makes our code type-safe and easier to read.

Step 3: Install and Configure scrapy-poet

scrapy-poet is a separate package we need to install.


# Install scrapy-poet
uv add scrapy-poet
# or: pip install scrapy-poet

Now, we must enable it in tutorial/settings.py.


# tutorial/settings.py

# Add this to enable the scrapy-poet add-on
ADDONS = {
    'scrapy_poet.Addon': 300,
}

# Add this to tell scrapy-poet where to find our Page Objects
# 'tutorial.pages' means a folder named 'pages' in our 'tutorial' module
SCRAPY_POET_DISCOVER = ['tutorial.pages']

Step 4: Create Page Objects for Parsing

Now for the magic. Let's create the tutorial/pages module.


mkdir tutorial/pages
touch tutorial/pages/__init__.py

Inside this new folder, create a file named bookstoscrape_com.py. This file will hold all the parsing logic for bookstoscrape.com.
This is the most complex part, but it's a "set it and forget it" pattern.


# tutorial/pages/bookstoscrape_com.py

from web_poet import WebPage, handle_urls, field, returns

# Import our Item schemas
from tutorial.items import BookItem, BookListPage

# This class handles all book DETAIL pages
@handle_urls("books.toscrape.com/catalogue")
@returns(BookItem)
class BookDetailPage(WebPage):
    """
    This Page Object handles parsing data from book detail pages.
    """

    # The @field decorator tells scrapy-poet: "run this function
    # and put the result into the 'name' field of the BookItem."
    @field
    def name(self) -> str:
        # This is our parsing logic from Part 1
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        # This is our parsing logic from Part 1
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

# This class handles all book LIST pages (categories)
@handle_urls("books.toscrape.com/catalogue/category")
@returns(BookListPage)
class BookListPageObject(WebPage):
    """
    This Page Object handles parsing data from category/list pages.
    """

    @field
    def book_urls(self) -> list:
        # This is our parsing logic from Part 1
        return self.response.css("article.product_pod h3 a::attr(href)").getall()

    @field
    def next_page_url(self) -> str | None:
        # This is our parsing logic from Part 1
        return self.response.css("li.next a::attr(href)").get()

Look at that! All our messy response.css() calls are now neatly organized in their own classes, completely separate from our spider. The @handle_urls decorator tells scrapy-poet which Page Object to use for which URL.


Step 5: Refactor the Spider (The Payoff)

Now, let's go back to tutorial/spiders/books.py and refactor it. It becomes much simpler.


# tutorial/spiders/books.py

import scrapy
# Import our new Item classes
from tutorial.items import BookItem, BookListPage

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]
    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

    async def start(self):
        # We still start the same way
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # The 'page: BookListPage' is new.
    # We ask for the BookListPage item, and scrapy-poet injects it.
    async def parse_listpage(self, response, page: BookListPage):

        # 1. Get the parsed book URLs from the Page Object
        for url in page.book_urls:
            # We follow each URL, but our callback no longer
            # needs to do any work!
            yield response.follow(url, callback=self.parse_book)

        # 2. Get the next page URL from the Page Object
        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    # The 'book: BookItem' is new.
    # We ask for the BookItem, and scrapy-poet injects it.
    async def parse_book(self, response, book: BookItem):
        # Our parsing logic is GONE.
        # The 'book' variable is already a fully-populated
        # BookItem, parsed by our BookDetailPage Page Object.

        # We just yield it.
        yield book

Our spider is now only responsible for crawling. All parsing is handled by scrapy-poet and our Page Objects. This code is clean, testable, and incredibly easy to read.

When you run scrapy crawl books -o books.json, the output will be identical to Part 1, but your architecture is now 100x better.

The "Hard Part": Why This Still Breaks

We've built a professional, well-architected Scrapy spider. But we've just made a cleaner version of a spider that will still fail on a real-world site.
This architecture is beautiful, but it doesn't solve the "real" problems:


  • ❌ IP Blocks: You're still hitting the site from one IP. You will be blocked.
  • ❌ CAPTCHAs: You have no way to avoid captchas, and your spider will fail.
  • ❌ JavaScript: If the prices were loaded by JS, our response.css() selectors would find nothing.

We've just organized our failing code.

The "Easy Way": Zyte API as a Universal Page Object

scrapy-poet is a great way to organise your scrapy code, making your projects easier to build, collaborate and maintain. However, it doesn't change the fact we are not doing anything to avoid web scraping bans.


So we can add the below settings using our Zyte API account to run our scrapy project through Zyte API.


# add scrapy-zyte-api python library
uv add scrapy-zyte-api
# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}

This is the power of combining a great architecture (Scrapy) with a powerful service (Zyte API).


Conclusion & Next Steps

Today you elevated your spider from a simple script to a professional-grade crawler. You learned the "Separation of Concerns" principle, defined data with Items, and separated parsing logic with scrapy-poet's Page Objects.
This is the modern way to build robust, testable, and scalable Scrapy spiders.


What's Next? Join the Community.


💬 TALK: Confused about @handle_urls or attrs? Ask the author and 20k+ devs in our Discord.


▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.


📩 READ: Want more? In Part 3, we'll cover advanced settings and deployment. Get the Extract Community newsletter so you don't miss it.



And if you're ready to skip the "Hard Part" entirely, get your free API key and try the "Easy Way."

Start Your Free Zyte Trial

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026