PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
BlogLearnThe Modern Scrapy Developer's Guide (Part 1): Building Your First Spider
LearnScraping practice

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

J

John Rooney

·

4 min read · December 16, 2025

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

Scrapy can feel daunting. It's a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin?

In this definitive guide, we will walk you through, step-by-step, how to build a real, multi-page crawling spider. You will go from an empty folder to a clean JSON file of structured data in about 15 minutes. We'll use modern, async/await Python and cover project setup, finding selectors, following links (crawling), and saving your data.

What We'll Build

We will build a Scrapy spider that crawls the "Fantasy" category on books.toscrape.com, follows the "Next" button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean books.json file.

Here's a preview of our final spider code:

On This Page (Table of Contents)

  • Prerequisites & Setup
  • Step 1: Initialize Your Project
  • Step 2: Configure Your Settings
  • Step 3: Finding Our Selectors (with scrapy shell)
  • Step 4: Building the Spider (Crawling & Parsing)
  • Step 5: Running The Spider & Saving Data
  • Conclusion & Next Steps

Prerequisites & Setup

Before we start, you'll need Python 3.x installed. We'll also be using a virtual environment to keep our dependencies clean. You can use standard pip or a modern package manager like uv.

First, let's create a project folder and activate a virtual environment.

Now, let's install Scrapy.

Step 1: Initialize Your Project

With Scrapy installed, we can use its built-in command-line tools to generate our project boilerplate.

First, create the project itself.

You'll see a tutorial folder and a scrapy.cfg file appear. This folder contains all your project's logic.

Next, we'll generate our first spider.

If you look in tutorial/spiders/, you'll now see books.py. This is where we'll write our code.

Step 2: Configure Your Settings

Before we write our spider, let's quickly adjust two settings in tutorial/settings.py.

ROBOTSTXT_OBEY

By default, Scrapy respects robots.txt files. This is a good practice, but our test site (toscrape.com) doesn't have one, which can cause a 404 error in our logs. We'll turn it off for this tutorial.

Concurrency

Scrapy is polite by default and runs slowly. Since toscrape.com is a test site built for scraping, we can speed it up.

Warning: These settings are for this test site only. When scraping in the wild, you must be mindful of your target site and use respectful DOWNLOAD_DELAY and CONCURRENT_REQUESTS values.

Step 3: Finding Our Selectors (with scrapy shell)

To scrape a site, we need to tell Scrapy what data to get. We do this with CSS selectors. The scrapy shell is the best tool for this.

Let's launch the shell on our target category page:

This will download the page and give you an interactive shell with a response object.

Let's find the data we need:

Find all Book Links:

By inspecting the page, we see each book is in an article.product_pod. The link is inside an h3.

Find the "Next" Page Link:

At the bottom, we find the "Next" button in an li.next.

Find the Book Data (on a product page):

Finally, let's open a shell on a product page to find the selectors for our data.

Step 4: Building the Spider (Crawling & Parsing)

Now, let's open tutorial/spiders/books.py and write our spider. We'll use the user's provided code, as it's a clean, final version.

Delete the boilerplate in books.py and replace it with this:

Step 5: Running The Spider & Saving Data

We're ready to run. Go to your terminal (at the project root) and run:

You'll see Scrapy start up, and in the logs, you'll see all 48 items being scraped!

But we want to save this data. Scrapy has a built-in "Feed Exporter" that makes this easy. We just use the -o (output) flag.

This will run the spider again, but this time, you'll see a new books.json file in your project root, containing all 48 items, perfectly structured.

Conclusion & Next Steps

Today you built a powerful, modern, async Scrapy crawler. You learned how to set up a project, find selectors, follow links, and handle pagination.

This is just the starting block.

What's Next? Join the Community.

  • 💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
  • ▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.
  • 📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

More learn articles

Keep learning

All learn articles →
What are residential proxies bannerUse case

What is a residential proxy?

Learn what residential proxies are, how they compare to datacenter proxies, and why modern web scraping needs more than IP diversity.

10 min read

Zyte Case Studies — every customer story, in one placeUse case

How much do rotating proxies cost?

Learn how much rotating proxies cost, what affects pricing, and why total web scraping costs often go beyond proxy subscriptions.

10 min read

Zyte Case Studies — every customer story, in one placeUse case

How do rotating proxies work?

Learn how rotating proxies work, when to use them for web scraping, and why IP rotation alone is not enough for reliable data access.

10 min read

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026
1\# The final spider we'll build
2import scrapy
3
4class BooksSpider(scrapy.Spider):
5    name = "books"
6    allowed\_domains = \["toscrape.com"\]
7
8    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy\_19/index.html"
9
10    async def start(self):
11        yield scrapy.Request(self.url, callback=self.parse\_listpage)
12
13    async def parse\_listpage(self, response):
14        product\_urls = response.css("article.product\_pod h3 a::attr(href)").getall()
15        for url in product\_urls:
16            yield response.follow(url, callback=self.parse\_book)
17
18        next\_page\_url = response.css("li.next a::attr(href)").get()
19        if next\_page\_url:
20            yield response.follow(next\_page\_url, callback=self.parse\_listpage)
21
22    async def parse\_book(self, response):
23        yield {
24            "name": response.css("h1::text").get(),
25            "price": response.css("p.price\_color::text").get(),
26            "url": response.url
27        }
Copy
1\# Create a new folder
2mkdir scrapy\_project
3cd scrapy\_project
4
5\# Option 1: Using standard pip + venv
6python -m venv .venv
7source .venv/bin/activate  \# On Windows, use: .venv\\Scripts\\activate
8
9\# Option 2: Using uv (a fast, modern alternative)
10uv init
Copy
1\# Option 1: Using pip
2pip install scrapy
3
4\# Option 2: Using uv
5uv add scrapy
6source .venv/bin/activate
Copy
1\# The 'scrapy startproject' command creates the project structure
2\# The '.' tells it to use the current folder
3scrapy startproject tutorial .
Copy
1\# The 'genspider' command creates a new spider file
2\# Usage: scrapy genspider <spider\_name> <allowed\_domain>
3scrapy genspider books toscrape.com
Copy
1\# tutorial/settings.py
2
3\# Find this line and change it to False
4ROBOTSTXT\_OBEY = False
Copy
1\# tutorial/settings.py
2
3\# Uncomment or add these lines
4CONCURRENT\_REQUESTS = 16
5DOWNLOAD\_DELAY = 0
Copy
1scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy\_19/index.html
Copy
1\# In scrapy shell:
2>>> response.css("article.product\_pod h3 a::attr(href)").getall()
3\[
4  '../../../../the-host\_979/index.html',
5  '../../../../the-hunted\_978/index.html',
6  ...
7\]
Copy
1\# In scrapy shell:
2>>> response.css("li.next a::attr(href)").get()
3'page-2.html'
Copy
1\# Exit the shell and open a new one:
2scrapy shell https://books.toscrape.com/catalogue/the-host\_979/index.html
3
4\# In scrapy shell:
5>>> response.css("h1::text").get()
6'The Host'
7
8>>> response.css("p.price\_color::text").get()
9'£25.82'
Copy
1\# tutorial/spiders/books.py
2
3import scrapy
4
5class BooksSpider(scrapy.Spider):
6    name = "books"
7    allowed\_domains = \["toscrape.com"\]
8
9    \# This is our starting URL (the first page of the Fantasy category)
10    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy\_19/index.html"
11
12    \# This is the modern, async version of 'start\_requests'
13    async def start(self):
14        \# We yield our first request, sending the response to 'parse\_listpage'
15        yield scrapy.Request(self.url, callback=self.parse\_listpage)
16
17    \# This function handles the \*category page\*
18    async def parse\_listpage(self, response):
19
20        \# 1. Get all product URLs using the selector we found
21        product\_urls = response.css("article.product\_pod h3 a::attr(href)").getall()
22
23        \# 2. For each product URL, follow it and send the response to 'parse\_book'
24        for url in product\_urls:
25            yield response.follow(url, callback=self.parse\_book)
26
27        \# 3. Find the 'Next' page URL
28        next\_page\_url = response.css("li.next a::attr(href)").get()
29
30        \# 4. If a 'Next' page exists, follow it and send the response
31        if next\_page\_url:
32            yield response.follow(next\_page\_url, callback=self.parse\_listpage)
33
34    \# This function handles the \*product page\*
35    async def parse\_book(self, response):
36
37        \# We yield a dictionary of the data we want
38        yield {
39            "name": response.css("h1::text").get(),
40            "price": response.css("p.price\_color::text").get(),
41            "url": response.url
42        }
Copy
1scrapy crawl books
Copy
1scrapy crawl books -o books.json
Copy