PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogAnnouncementQA: Web scraping at scale, anti-ban and legal compliance
ArticleAnnouncement

QA: Web scraping at scale, anti-ban and legal compliance

We gathered the questions on web scraping and data extraction at the Extract Summit. Read this blog to get answers on bans, proxies or GDPR in web scraping.

H

Himanshi Bhatt

5 min read · October 10, 2019

QA: Web scraping at scale, anti-ban and legal compliance

Web scraping questions & answers part I

As you know we held the first-ever Web Data Extraction Summit last month. During the talks, we had a lot of questions from the audience. We have divided the questions into two parts - in the first part, we will cover questions on web scraping at scale —proxy and anti-ban best practice, and legal compliance, GDPR in the world of data extraction. Enjoy! You can also check out the full talks on these topics here

*For information on 2022 Extract Summit visit this link*

Please note: The answers outlined below are opinions based on the knowledge and experience of the experts at Zyte . The answers are for informational purposes only and we do not offer any warranties of any kind, either express or implied, regarding the information contained herein.

Web scraping at scale—proxy and anti-ban best practice

Q: Can you imagine a future where antibot companies find a way to block all scrapers and web data extraction will no longer be feasible?

A: As both antibots, as well as bot developers, have access to similar tools, there will be a constant ebb and flow and never a complete stop to web scraping.

Q: Do you have your own proxies? How many proxies do you have?

A: We handle all types of datacenter proxies from multiple providers to ensure a diverse pool for every use case. Our proxy pools are in the order of hundreds of thousands, while that's an important figure, we are constantly focusing on delivering successful responses to our customers.

Q: Will Zyte Smart Proxy Manager deal with browser fingerprinting or offer specific solutions for specific anti-bot companies?

A: While Zyte Smart Proxy Manager does provide browser profiles, going forward there will be more features built under the hood which will help with more sophisticated anti-bots.

Q: How do you avoid captcha requirements?

A: By crawling responsibly. Using proxies.

Q: Do you use chrome headless browsers or are they detected too easily?

A: We use headless browsers and all browsers can be detected as bots.

Q: How do you know what sites are using to detect you so you can adapt to it?

A: It requires careful inspection of the response body, headers, and at times the entire network traffic to understand the behavior of the underlying anti-bot. We do use some internal tools to identify the type of detection used. There are also open-source tools like "don't fingerprint me" which allow you to assess the browser fingerprinting used by the website.

Q: Are HTTP headers ordered?

A: Yes they are and antibot companies do have a signature directory that is able to identify inconsistencies in the request headers.

Q: Do you know cases of companies feeding fake data instead of blocking requests?

A: While we cannot name them, there are several examples of e-commerce websites faking the data. It is thus important to perform QA and look for these anomalies by checking past data.

Q: Any experiences with global anti scrape/website security services like Cloudflare when doing broad crawls?

A: Cloudflare and Akamai are quite ubiquitous and are encountered frequently on websites. They come in various flavors so there are a whole host of different approaches to scrape such websites right from proxy rotation to get around geoblocking to using headless browsers.

Q: Does Scrapy handle Cloudflare challenges or integrate with cfscrape well?

A: By default, Scrapy does not handle Cloudflare challenges, you need to write code in order to do so.

Q: Does uppercase matter in HTTP headers?

A: Yes, HTTP headers are case-sensitive.

Q: How do requests/second relate to getting banned? Does slower crawling lead to fewer bans?

A: Yes, it's an important value to consider. Zyte Proxy Manager handles the optimal rate depending on the concurrency input from the client, how the proxies are performing and how the site is responding.

Q: How do you ensure customers behave responsibly - i.e. don't DDOS sites?

A: Zyte Proxy Manager has a proven throttling algorithm that uses the site's stats and concurrency from the customer to ensure a reasonable rate is used when targeting the site.

Q: Is there an automated way to identify and act on captcha and blockage?

A: Yes, there are many default ban rules that can be automated, while others such as captchas could be more customized. The different mechanisms to obtain successful responses are rotating proxies, doing proper throttling, and making sure requests are neat.

Q: How do you handle Incapsula and reCAPTCHA v3?

A: Handling different CDN's or antibot mechanisms depends mainly on every site. Using proxies is one big part of it while ensuring the client is following user-based patterns of crawling is important as well.

Q: What happens with burned IPs, is there any way to "recycle" them?

A: Yes, recycling is an important part of the process. Usually, web servers unban proxies after a period of time.


Legal compliance, GDPR in the world of web scraping

_Disclaimer: None of the opinions or information below constitute legal advice to you. If you want assistance with your specific situation then you should consult a lawyer.
_

Q: Will UK companies have to comply with GDPR post Brexit?

A: The UK Information Commissioner's Office provides the following guidance: The government intends to incorporate the GDPR into UK data protection law when the UK exit the EU – so in practice there will be little change to the core data protection principles, rights, and obligations found in the GDPR. The EU version of the GDPR may also still apply directly to a UK company if it operates in Europe, offers goods or services to individuals in Europe, or monitors the behavior of individuals in Europe.

Q: Does GDPR apply if my company is based in the US and is scraping personal data of EU citizens living in the US?

A: If the US company is not established in the EU and is not targeting the EU market then the processing will not come within the GDPR.

Q: Is it legal to scrape information from a company, like an email or phone number?

A: A company email address or phone number will come within GDPR if it is considered to be personal data. GDPR defines personal data as “any information relating to an identified or identifiable natural person.” A generic company contact email such as sales@zyte.com is unlikely to be considered personal data under GDPR. The email address of an individual working at that company will be considered personal data if it that individual is identifiable, for example, if the email address contains that individual's name. If you are scraping personal data you need to have a lawful basis to do so under GDPR.

Q: If the personal information is not enough to be able to reach the subject (e.g. just their name but no contact details) what are the obligations?

A: When personal data is collected from other sources (i.e. not collected directly from the individual by you), there are certain exceptions to your obligation to provide privacy information that may apply, for example, if providing the information is impossible or involves a disproportionate effort. If you intend to rely on one of these exceptions you must carry out a Data Protection Impact Assessment.

Q: If collecting data on behalf of another organization, who has GDPR requirements?

A: A data controller determines why and how personal data is processed. If you are processing personal data on behalf of a data controller then you are a data processor under GDPR. Both data controllers and data processors must comply with GDPR and have obligations under GDPR. However, a data processor has fewer obligations than a data controller.

Q: Does data minimization apply when no personal data is involved?

A: The principle of data minimization under GDPR will not apply if no personal data is being processed. However, the data collection may be subject to other non-GDPR related restrictions.

Q: Are usernames considered personal information?

A: Yes usernames may be personal data. If an individual is identified or identifiable from a username then it will be considered to be personal data under GDPR.

If you have any more questions or queries on the above-discussed topics, feel free to leave a comment below and we will try our best to answer them. In our next post, we will cover questions on The Next Generation of Web Scraping and How Machine Learning can be used in Web Scraping. Stay tuned! Also, you can access the recordings of the Extract Summit talks.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Announcement
H

Himanshi Bhatt

More from this author

In this article

  • Web scraping at scale—proxy and anti-ban best practice
  • Legal compliance, GDPR in the world of web scraping

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Zyte's first Developer Community Meetup: the recap, slides, and recording
Announcement

Zyte's first Developer Community Meetup: the recap, slides, and recording

AI agents can now write, run, and self-heal your web scrapers, and in Zyte's first-ever Web Scraping Community Meetup we show you exactly how. Live demos, a Claude Code plugin that turns a prompt into production-ready data, and a fireside chat on where AI is really heading.

Ayan Pahwa·June 25, 2026
Introducing Web Scraping Copilot - A rocket boost for data extractors
Announcement

Introducing Web Scraping Copilot - A rocket boost for data extractors

Meet Web Scraping Copilot, a free VS Code extension that uses AI to accelerate Scrapy projects. Generate code, manage spiders, and deploy to Scrapy Cloud faster than ever, keeping developers in control.

Valter Sciarrillo·10 Mins·November 4, 2025
Zyte Blog — field notes from the world of data extraction
Announcement

Extract clean content automatically with Zyte API’s new pageContent data type

Discover how Zyte API’s new PageContent data type makes content extraction effortless — delivering clean, structured data from any web page automatically.

Daniel Cave·10 Mins·October 20, 2025

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026