PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogWeb data collection legalityBalancing innovation and regulation in data scraping
ArticleWeb data collection legalityData gathering for AI

Balancing innovation and regulation in data scraping

Weighing your options from full control to full service.

S

Sanaea Daruwalla

10 min read · October 14, 2025

Balancing innovation and regulation in data scraping

For anyone involved in data gathering, the legal landscape can often feel like a waiting game, as protracted legal cases play out before becoming case law.

Recently, however, we have finally started to see exactly that happen.

For web data access, the changes are positive news. Innovators continually have to balance what they do with regulation. But the legal cases have confirmed growing scope for innovation.

Public web data

This is the foundational element for so much of the innovation happening today, but it’s also where the regulatory story begins.

Public Data

Innovation: Public data fuels creativity

The value of public web data is undeniable. On the innovation side of the scale, the arguments are clear:

  • Public web data is the largest data set in the world. The potential is infinite.

  • Web data can be used for countless business intelligence purposes, driving smarter decisions and creating new opportunities.

  • AI isn’t going anywhere, and we need good data to train it. Public data is the fuel for this technological revolution.

  • Fundamentally, we believe that public data should remain public.

Regulation: Logged-out public data capture may be permitted

Historically, the primary legal threat to web scraping came from the Computer Fraud and Abuse Act (CFAA), a US anti-hacking law. This was concerning because violations carried not only civil penalties (money) but also potential criminal penalties.

However, a few years ago, landmark court rulings in cases like LinkedIn Corp. v. hiQ Labs, Inc. and Van Buren v. United States clarified the landscape. The courts stated that if you have lawful access to the data—meaning anyone can go on a public website and see it—you are not violating the CFAA.

So, the question then became: “Can it nevertheless be a violation of a site’s Terms of Service (ToS)?” This year, we saw a major ruling in the Meta v. Bright Data case that answers this question. The court ruled that Bright Data did not violate Meta's ToS.

However, while many headlines declared that all public data scraping is now okay, that's not quite what the case said. The court's decision was specific to the facts: Bright Data was scraping data that was not behind a login and their activity did not violate Meta’s ToS.

Following this, we saw X (formerly Twitter) settle its lawsuit against Bright Data. While the terms are confidential, one can make an educated guess that X saw the outcome of the Meta case and decided it wasn't worth pursuing. The courts are favoring innovation.

Takeaway: Not everything is fair game

Just because the courts have been ruling in favor of scraping public data doesn’t mean it’s all fair game. What you do with the data still matters a lot, and what type of public data matters too. We're seeing courts look more closely at data usage, especially when it involves pirated or illegally obtained content, which leads us to our next topic.

Copyright

This is probably the area where we're seeing the most case law and the most litigation, especially with the rise of generative AI.

Copyright

Innovation: Fueling the next generation of AI

The innovation driven by vast datasets is transformative. Companies are looking to:

  • Obtain diverse data to inform business decisions.

  • Create robust LLMs to build highly effective generative AI.

  • Fine-tune models to fit specific business needs.

  • Build intelligent tools for analytics, social listening, and monitoring.

All of this relies on access to data, much of which is copyrighted.

Regulation: Fair use, piracy, and transformative work

Several recent and ongoing cases are shaping the rules around copyrighted data:

  1. The Anthropic case: In a key ruling, a court determined it is not a copyright violation to train an LLM with legally obtained works. Anthropic had paid for books to use in its training data, and the court found this to be fair use. However, the court also found it was likely a violation to train with pirated scraped works. Anthropic had also scraped websites that hosted stolen books. This distinction is critical.

  2. Thomson Reuters v. Ross Intelligence: This case explored the concept of "transformative use." The court said Ross Intelligence's use of scraped data was not transformative because it was used to create a directly competitive product. This is classic copyright infringement—you can't just copy-paste to build a competing service.

  3. Anderson v. Stability AI: In several ongoing cases involving gen-AI systems, the similarity of the AI's output to the original copyrighted works is a key aspect. The closer the output is to the input, the weaker the fair use argument becomes.

Takeaways: How to treat copyrighted data

Do

  • Ensure the data is lawfully obtained. It should be public data from a reputable, legally compliant website.

  • Materially transform the data. Create something new, like analytics or insights, rather than just reproducing the original work.

Don’t

  • Don’t scrape pirated or ill-gotten content. If a website obtained content illegally, don't scrape it.

  • Do not use the data to build a directly competitive product or simply copy it verbatim and repost it.

Personal data

Scraping personal data is always a hot topic, and while there haven't been massive legal shifts recently, the existing rules are more important than ever, especially with the integration of data into AI.

Persona Data

Innovation: Creating personalized and powerful datasets

The goals here are clear: obtaining vast and diverse data to build out various types of datasets, creating robust LLMs, fine-tuning models, and creating tools for brand monitoring and social listening. Personal data can be a component of this, but it requires extreme care.

Regulation: The US vs. EU divide

There is a huge distinction between how the US and the EU treat personal data.

  • United States: In the US, public personal data is typically okay to scrape. If data is "manifestly made public," then no consent or other type of legitimate interest is generally required.

  • European Union: In the EU, under GDPR, there is no exception for public personal data. You must have a legitimate interest or consent, even for data that is publicly accessible. This applies even if you are in the US but are scraping the personal data of EU citizens.

When incorporating data into AI, it's crucial to ensure you are not violating prohibited uses under new regulations like the EU AI Act, which restricts applications like facial recognition and automated decision-making for employment, housing, or loans.

Takeaways: When is public personal data okay?

The rules differ significantly by jurisdiction. In the EU, even with public data, you must consider:

  1. Data retention: How long do you keep the data?

  2. Anonymization: Can you anonymize the data to remove personal identifiers?

  3. Minimization: Are you only taking the data you absolutely need?

  4. Notices: Do you need to provide notice to data subjects?

  5. Opt-outs: Is there a mechanism for individuals to opt out?

Be cautious about the usage of personal data when building an LLM, ensure it's obtained compliantly, and design use cases that do not run afoul of the AI Act or other regulations.

Key takeaways for the road ahead

The legal changes this year have been overwhelmingly positive for the web scraping community. The courts are increasingly ruling that scraping public web data is acceptable and are even recognizing fair use in the context of training AI.

However, this freedom comes with responsibility. Here are the most important principles to guide your data scraping activities:

  • Ensure data comes from reputable, legally compliant websites.

  • Avoid scraping websites with pirated or illegal content. The potential damages are enormous, as seen in the Anthropic case.

  • Do not build directly competitive products unless the data is materially transformed. Add your own analysis and intelligence.

  • Ensure you handle personal data according to jurisdictional requirements, paying close attention to the stringent rules of the EU if you collect data on its citizens.

  • Do not use scraped data for AI products prohibited under emerging regulations like the EU AI Act.

The more the web scraping industry unites around ethical standards, the more we can influence regulators to continue making positive decisions that favor innovation. The law is finally catching up, and for those who proceed ethically, the future of data scraping looks bright.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Web data collection legalityData gathering for AI
S

Sanaea Daruwalla

More from this author

In this article

  • Public web data
  • Innovation: Public data fuels creativity
  • Regulation: Logged-out public data capture may be permitted
  • Takeaway: Not everything is fair game
  • Copyright
  • Innovation: Fueling the next generation of AI
  • Regulation: Fair use, piracy, and transformative work
  • Takeaways: How to treat copyrighted data
  • Personal data
  • Innovation: Creating personalized and powerful datasets
  • Regulation: The US vs. EU divide
  • Takeaways: When is public personal data okay?
  • Key takeaways for the road ahead

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026