PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator

Building a production-style web scraper with Scrapy, Docker, and PostgreSQL

Web scraping is often taught using scripts that invariably dump data into a JSON or CSV file. That’s fine for learning the basics, but it doesn’t reflect how scraping works at scale in real-world systems.

In practice:

  • Scrapers usually run as jobs, not always running scripts or daemons.
  • Data needs to be stored reliably.
  • Environments must be reproducible.
  • Scaling and maintenance should be easy.

In this blog post, let me walk through a demo project called scrape2postgresql, which shows how to:

  1. Use Scrapy to scrape structured data.
  2. Store results in PostgreSQL.
  3. Run everything using docker-compose.
  4. Keep spiders and database in separate containers.

This project uses books.toscrape.com, a safe demo website, to scrape book titles and prices, but the structure applies to almost any scraping use case.


Why Scrapy?

Scrapy is a full-featured web scraping framework, not just a request library.

It gives you:

  • A crawling engine.
  • Request scheduling.
  • Built-in support for pagination.
  • Structured item pipelines.
  • Retry and error handling.
  • Clear project structure.

Instead of writing while loops, requests and BeautifulSoup, Scrapy encourages you to think in terms of spiders, items, and pipelines. It scales much better as projects grow.


Why Docker (and docker-compose)?

A very common beginner setup looks like this:

  • Scrapy installed locally.
  • PostgreSQL installed locally.
  • Different Python versions and virtual environments.

This becomes painful fast, difficult to scale, manage and maintain. Enter Docker, which solves this by:

  • Packaging dependencies into a container.
  • Making environments consistent, reproducible and talk to one another.
  • Isolating concerns cleanly and sandboxing local networking.
  • Want to change databases from PostgreSQL to mongoDB? Just fetch it from docker hub and plug it in.
  • Want to log data in a chart? Add another container such as Grafana.

docker-compose goes one step further by allowing us to run multiple containers together as a bundle, taking care of inter-connectivity:

  • One container for Scrapy.
  • One container for PostgreSQL.

Each container does its individual thing well, making it easier to maintain the project and isolate bugs, if any.


High-level architecture

Before we dive into code, let’s understand the architecture.

We’re using docker-compose to fire up two Docker containers:

  1. The first container has our Scrapy spider whose sole job is to scrape the web page we provide and store the data in database - it starts only when we need it.
  2. The other container is a PostgresSQL database with a persistent volume mounted on the host - whatever information our spider scrapes gets stored in this database.

Since this is docker-compose the networking between these two containers is sorted, we just need to use the authentication credentials.

Scrapy Container (one-shot job) ───────> PostgreSQL Container (persistent service)


Key philosophy:

  • Scrapy is a job
    • starts
    • crawls
    • stores data
    • exits
  • PostgreSQL is a service
    • stays running
    • persists data
    • can be queried anytime

This separation is extremely important for scaling and maintenance.


Project Structure

Here’s the structure of scrape2postgresql:


.
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── requirements.txt
├── run_spider.sh
│
└── bookscraper/
    ├── scrapy.cfg
    └── bookscraper/
        ├── items.py
        ├── pipelines.py
        ├── settings.py
        └── spiders/
            └── books.py

Let’s go through each part and understand why it exists.


Scrapy project

The spider (/spiders/books.py)

The spider is where the website's crawling logic lives.

At a high level, our spider:

  • Accepts a URL dynamically.
  • Extracts book titles and prices.
  • Follows pagination links.
  • Yields structured data.

Initializing the spider


def __init__(self, url=None, max_pages=None, *args, **kwargs):
    super().__init__(*args, **kwargs)

    if not url:
        raise ValueError("You must pass a URL")

    self.start_urls = [url]

Instead of hard-coding URLs, we pass them at runtime. This makes the spider reusable for different categories or sites with similar structure.


CSS selectors

Scrapy supports both XPath and CSS selectors. CSS selectors are usually simpler and more readable.

Example:


for book in response.css("article.product_pod"):
    title = book.css("h3 a::attr(title)").get()
    price = book.css("p.price_color::text").get()

What this means:

  • article.product_pod selects each book card
  • h3 a::attr(title) extracts the book title
  • p.price_color::text extracts the price text

CSS selectors map directly to how the HTML is structured, making them easy to debug in browser DevTools.



Handling pagination

Pagination is one of the most important parts of any crawler. Basically it’s a logic using which you can navigate a website and move to the next page if/when needed.


next_page = response.css("li.next a::attr(href)").get()
if next_page:
    yield response.follow(next_page, callback=self.parse)

Scrapy handles relative URLs automatically with response.follow(), so you don’t have to manually build full URLs.

This approach ensures:

  • All pages in a category are crawled
  • No duplicate requests.
  • No infinite loops.

The pipeline (pipelines.py)

Spiders extract data, but pipelines store data. This separation is intentional.

Our pipeline:

  • Opens a PostgreSQL connection.
  • Creates a table if needed.
  • Inserts each scraped item.

class PostgresPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(...)
        self.cur = self.conn.cursor()

The open_spider() method runs once, when the spider starts.


Inserting data


def process_item(self, item, spider):
    self.cur.execute(
        "INSERT INTO books (title, price) VALUES (%s, %s)",
        (item["title"], item["price"])
    )
    self.conn.commit()
    return item

Each item yielded by the spider passes through the pipeline.

This makes it easy to:

  • Add validation.
  • Normalize data.
  • Store in different backends later.

Dockerizing the Scraper

Dockerfile

The Dockerfile defines how the Scrapy container is built.


FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY bookscraper /app/bookscraper
COPY run_spider.sh /app/run_spider.sh
RUN chmod +x /app/run_spider.sh

Key points:

  • Uses a lightweight Python base image.
  • Installs dependencies once.
  • Copies the Scrapy project into the container.
  • Includes a run script.

The run script (run_spider.sh)

This script is what actually runs when the container starts.


if [ -z "$URL" ]; then
  echo "ERROR: URL not provided"
  exit 1
fi

scrapy crawl books -a url="$URL"

Why a script?

  • Easier debugging.
  • Clearer error messages.
  • Simpler command invocation.
  • Easier to extend later (cron, retries, etc.).

Docker Compose

Docker Compose ties everything together.


services:
  postgres:
    image: postgres:15
    volumes:
      - pgdata:/var/lib/postgresql/data

  scrapy:
    build: .
    depends_on:
      - postgres

Important concepts here:

  • Separate containers.
  • Shared network.
  • Persistent volumes.
  • Explicit dependencies.

Scrapy can talk to PostgreSQL using the service name (postgres) as hostname.


Makefile

Instead of typing long Docker commands, I’ve created a Makefile.

Clone project from https://github.com/apscrapes/scrape2postgresql and use make commands to set it up :

Example:


make db
make scrape url="https://books.toscrape.com/..."
make psql

Why this design scales well

This setup scales since each component is isolated and replaceable.:

What?How?
Want more spiders?Add more Scrapy spiders
Want scheduled scraping?Trigger make scrape via cron or CI
Want a another DB?Swap PostgreSQL with another docker image (e.g., MongoDB)
Want to plot data-points?Add Grafana container

Final thoughts

scrape2postgresql is intentionally simple, but architecturally solid.

It demonstrates:

  • How Scrapy is meant to be used.
  • How Docker simplifies environments.
  • Why separating spiders and databases matters.
  • How real scraping pipelines are structured.

If you’re new to web scraping, this project gives you a strong foundation. If you’re experienced, it gives you a clean starting template.


Next steps you could explore

  • Add item validation.
  • Include Zyte to avoid bans.
  • Store historical price changes.
  • Add retries and throttling.
  • Expose data via an API.
  • Schedule scraping jobs.

Once you understand this setup, you can build data pipelines at scale.

Happy scraping 🚀.


G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026