PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogDeveloper interestAutomate deployment of your web scraper on VPS with Ubuntu 24.04 cloud-init
ArticleTutorial / How-toDeveloper interest

Automate deployment of your web scraper on VPS with Ubuntu 24.04 cloud-init

Your VPS is ready, but now you need to work through the same sequence you have run a dozen times before: apt update, apt install python3-pip, pip install scrapy, playwright install chromium, the Chromium dependency list that never installs cleanly on the first try, Redis, possibly Postgres, whatever else this particular project needs.

Ayan Pahwa · Developer Advocate

9 min read · May 31, 2026

Automate deployment of your web scraper on VPS with Ubuntu 24.04 cloud-init

An hour later, the machine is ready. Three months later, when that VM is gone and you need a replacement, the process repeats, and the result is a little different from the original because you are working from memory, not a spec.

Cloud-init solves this by moving provisioning out of your SSH session and into a configuration file that the VM reads during its very first boot. You write the config once, paste it into your VPS provider's user-data field when you create the instance, and everything you need is running before you log in for the first time. I had a similar problem, so I built spawn-cloud-scrapers to generate that config file for you. Add your ZYTE_API_KEY and any other environment variables, optionally paste a GitHub URL for an existing Scrapy project to auto-clone on first boot, select your tools, and copy the ready-to-use #cloud-config YAML from the output panel.

C N T 1201 Screenshot

CNT-1201-screenshot.png

Why VPS over your local machine
Scrapers that run on a laptop have a fundamental persistence problem: crawl jobs are interrupted every time the machine sleeps, the network switches, or you need to restart for an unrelated reason. The environment on a laptop, shaped by years of installs, upgrades, and one-off fixes, is also difficult to reproduce on a fresh machine, which matters when you need to debug a problem in a clean environment or hand the project to someone else.

A dedicated VPS gives you a process that keeps running after you close your laptop, an environment defined entirely by what you install during provisioning, and a reliable place to run scheduled crawls without involving your development machine.

For the requests themselves, Zyte API handles IP rotation, unblocking, and browser rendering in a single call, so your spider code stays focused on extraction logic rather than infrastructure concerns. What spawn-cloud-scrapers gives you is a clean, reproducible home for that code to run from. The discipline of provisioning via cloud-init reinforces this: when you cannot "just SSH in and fix it quickly," the config file stays accurate, and the next machine you spin up is genuinely identical to the last one.

What is cloud-init?

Cloud-init is the industry-standard first-boot initialization system for Linux VMs. Every major cloud provider supports it, which means a #cloud-config YAML file you write today will work on AWS, Google Cloud, Hetzner, DigitalOcean, Linode, Vultr, Azure, OVHcloud, Scaleway, and Oracle Cloud without modification. The file is passed to the VM at creation time via a user-data field, and cloud-init processes it once, during the first boot, before any user can log in.

A cloud-config file can install packages via apt, write files to the filesystem, create users, add SSH keys, run arbitrary commands in sequence, and enable or start services. Everything spawn-cloud-scrapers needs to do to build a scraping stack fits within these primitives.

What spawn-cloud-scrapers generates

Select any combination of the following services and the tool builds a deduplicated, correctly ordered cloud-config around them:

Service How it installs Notes
Scrapy uv pip install scrapy Python spider framework
Playwright Python pip install playwright + playwright install chromium --with-deps Browser automation
Puppeteer npm install -g puppeteer Node.js headless Chrome
Redis apt: redis-server, systemd enable Queue / cache (port 6379)
PostgreSQL apt: postgresql, systemd enable Relational DB (port 5432)
Tor Proxy apt: tor, systemd enable Anonymous routing (port 9050)
mitmproxy uv pip install mitmproxy Traffic inspection (port 8080)

Splash is available in Flatcar/Docker mode only and does not appear in Ubuntu mode, because it has no native Ubuntu package and running it as a container requires a separate Docker setup that cloud-init is not suited to manage. For everything else, native installs are stable, start on boot via systemd, and require no container runtime.

C N T 1201 Diagram Ubuntu Pipeline

CNT-1201-diagram_ubuntu_pipeline.png

The PEP 668 issue on Ubuntu 24.04

Ubuntu 24.04 enforces PEP 668, which marks the system Python environment as "externally managed" and prevents pip install from modifying it without an explicit override. Run a bare pip3 install scrapy on a fresh Ubuntu 24.04 VM and you get:

1error: externally-managed-environment
2× This environment is externally managed
Copy

The correct pattern is to bootstrap uv first, then use uv with the --system and --break-system-packages flags for all subsequent installs:

1pip3 install uv --break-system-packages
2uv pip install --system --break-system-packages scrapy
Copy

spawn-cloud-scrapers generates exactly this sequence in the runcmd section. Every Python package install in the output follows this pattern, so the provisioning script works on a vanilla Ubuntu 24.04 image without needing any pre-configuration. This is one of those details that is obvious in hindsight but costs you time the first time you encounter it on a fresh server, as described in the setup troubleshooting section of Scraping Swiss Army Knife: my personal fix for web setup fatigue.

A look at the generated cloud-config

Here is a representative cloud-config for a Scrapy and Redis stack, showing the structure that spawn-cloud-scrapers produces:

1#cloud-config
2package_update: true
3
4users:
5  - name: ubuntu
6    groups: sudo
7    shell: /bin/bash
8    sudo: ALL=(ALL) NOPASSWD:ALL
9    ssh_authorized_keys:
10      - 'ssh-ed25519 AAAA... you@host'
11
12write_files:
13  - path: /etc/scraper/.env
14    content: |
15      ZYTE_API_KEY=your_key_here
16    permissions: '0644'
17
18packages:
19  - python3-pip
20  - redis-server
21  - git
22
23runcmd:
24  - chown ubuntu:ubuntu /etc/scraper/.env
25  - pip3 install uv --break-system-packages
26  - uv pip install --system --break-system-packages scrapy
27  - systemctl enable redis-server
28  - systemctl start redis-server
Copy

A few details in this structure are worth understanding. The chown command for /etc/scraper/.env is always the first item in runcmd. Cloud-init's write_files phase runs before the user-creation phase, which means the ubuntu user does not exist yet when the file is written, so using owner: ubuntu:ubuntu in write_files would silently fail. The runcmd phase runs after users are created, so the chown there is guaranteed to find the user.

The packages list is automatically deduplicated across all selected services. If you select both Scrapy and mitmproxy, both of which need python3-pip, it appears in the list only once. The runcmd entries run in selection order, with service-level dependencies respected.

Scrapy project auto-clone

If you have an existing Scrapy project in a git repository, tick the Scrapy service and paste the HTTPS URL of your repo into the git URL field that appears. The generated config adds git to the packages list and a clone sequence to runcmd:

1git clone https://github.com/your-org/your-scraper.git /home/ubuntu/your-scraper
2chown -R ubuntu:ubuntu /home/ubuntu/your-scraper
3cd /home/ubuntu/your-scraper && \
4  uv pip install --system --break-system-packages -r requirements.txt || \
5  uv pip install --system --break-system-packages scrapy
Copy

The || scrapy fallback handles projects that use pyproject.toml rather than requirements.txt: if the requirements install fails because the file is absent, a bare Scrapy install ensures the tool is available regardless. Use HTTPS URLs rather than SSH git URLs; SSH would require a deploy key on the VM, which spawn-cloud-scrapers does not provision.

C N T 1201 Screenshot Scrapy Git

CNT-1201-screenshot_scrapy_git.png

How to use it

The tool is hosted at spawn-cloud-scrapers and requires no account or installation. If you want to run it locally or adapt it for your team:

1git clone https://github.com/zytelabs/spawn-cloud-scraper
2open index.html
Copy

The workflow:

  1. Enter a hostname (single-quoted in the output, safe for any provider's user-data field).
  2. Paste your SSH public key(s).
  3. Add environment variables: your ZYTE_API_KEY, database credentials, or any other secrets that will land in /etc/scraper/.env on the VM.
  4. Tick the services you need.
  5. Optionally paste a Scrapy project git URL.
  6. Make sure Ubuntu mode is selected (it is the default).
  7. Copy the generated #cloud-config YAML from the output panel.

C N T 1201 Screenshot Ubuntu Output

CNT-1201-screenshot_ubuntu_output.png

CLI automation across providers

The generated YAML can be passed directly to any provider's CLI. Here are working examples for three common choices:

DigitalOcean (doctl):

1doctl auth init
2
3doctl compute droplet create my-scraper \
4  --region nyc3 \
5  --size s-2vcpu-4gb \
6  --image ubuntu-24-04-x64 \
7  --ssh-keys "$(doctl compute ssh-key list --no-header --format FingerPrint)" \
8  --user-data-file ./cloud-config.yaml \
9  --wait
Copy

Hetzner Cloud (hcloud):

1hcloud server create \
2  --name my-scraper \
3  --type cx22 \
4  --image ubuntu-24.04 \
5  --ssh-key your-key-name \
6  --user-data-file cloud-config.yaml
Copy

AWS EC2 (aws-cli):

1aws ec2 run-instances \
2  --image-id ami-0c55b159cbfafe1f0 \
3  --instance-type t3.small \
4  --key-name your-key-pair \
5  --user-data file://cloud-config.yaml \
6  --count 1
Copy

After the instance boots, watch cloud-init finish:

1ssh ubuntu@<ip> "cloud-init status --wait && echo done"
Copy

Or tail the log directly to see each step as it runs:

1ssh ubuntu@<ip> "tail -f /var/log/cloud-init-output.log"
Copy

Verifying your services

Once cloud-init reports success, verify each service you selected:

1# Redis
2redis-cli -h <ip> -p 6379 ping
3# Expected: PONG
4
5# PostgreSQL
6psql -h <ip> -U postgres -c "SELECT version();"
7
8# Scrapy (interactive shell)
9ssh ubuntu@<ip>
10scrapy shell https://example.com
11
12# Tor exit node
13curl --socks5 <ip>:9050 https://check.torproject.org/api/ip
14
15# mitmproxy web UI
16open http://<ip>:8081
Copy

Redis and PostgreSQL are enabled via systemd and will restart automatically on reboot. Scrapy and mitmproxy are installed globally via uv --system and are available in the PATH for the ubuntu user.

VPS provider support

Cloud-init is supported by virtually every cloud provider that runs Linux, which makes the Ubuntu mode in spawn-cloud-scrapers the most portable option:

Provider cloud-init support Ubuntu 24.04 image
DigitalOcean Yes ubuntu-24-04-x64
Hetzner Cloud Yes ubuntu-24.04
Vultr Yes Available
AWS EC2 Yes AMI in all regions
Google Cloud Yes ubuntu-2404-lts
Azure Yes Canonical:ubuntu-24_04-lts
Linode/Akamai Yes Available
OVHcloud Yes Available
Scaleway Yes Available
Oracle Cloud Yes Available

If you need a provider that does not support Flatcar Linux, this mode works everywhere. The config is plain YAML with no cloud-provider-specific extensions.

What this approach does not do

Cloud-init is a first-boot provisioner. It runs once, and changes to the cloud-config file do not automatically propagate to running machines. If you update the config and want the change on an existing VM, the options are to redeploy the VM with the new config or to SSH in and apply the change manually. For infrastructure that changes frequently, container-based approaches like the Flatcar mode handle redeployment more gracefully, since the entire stack is described in the compose file and a systemctl restart scraper brings up the new configuration.

For most scraping workloads, however, the VM is provisioned once, runs for weeks or months, and is replaced rather than updated when its job is done. Cloud-init is well suited to that lifecycle, and the combination of a clean Ubuntu 24.04 base, a stable service set, and a reproducible config file eliminates most of the friction that makes infrastructure management tedious for scraping teams.

The architectural patterns for longer-running production pipelines, where the scraping logic itself needs to evolve independently of the infrastructure, are worth reading about separately: Hybrid scraping: the architecture for the modern web covers how to structure a production scraping stack that separates browser sessions from lightweight HTTP fetching, a pattern that fits cleanly on top of a cloud-init-provisioned VM.

Deploy now

Visit spawn-cloud-scrapers, select Ubuntu mode (the default), choose your services, and copy the generated #cloud-config YAML. The GitHub repository contains the full source if you want to extend it with additional services or adapt the output format for your team's toolchain.

For teams running container-native infrastructure or providers with Flatcar Linux support, the Flatcar mode in the same tool generates an Ignition JSON file that provisions an identical service selection on an immutable, Docker-only OS with no package manager and no configuration drift.

If you would rather skip the infrastructure layer entirely, Scrapy Cloud is worth checking out. It provides fully managed hosting for Scrapy spiders, with a generous free tier, built-in scheduling, job monitoring, and no servers to manage.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Developer interest

Ayan Pahwa

Developer Advocate

More from this author

In this article

  • What is cloud-init?
  • What spawn-cloud-scrapers generates
  • The PEP 668 issue on Ubuntu 24.04
  • A look at the generated cloud-config
  • Scrapy project auto-clone
  • How to use it
  • CLI automation across providers
  • Verifying your services
  • VPS provider support
  • What this approach does not do
  • Deploy now

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026