Balancing innovation and regulation in data scraping

For anyone involved in data gathering, the legal landscape can often feel like a waiting game, as protracted legal cases play out before becoming case law.

Recently, however, we have finally started to see exactly that happen.

For web data access, the changes are positive news. Innovators continually have to balance what they do with regulation. But the legal cases have confirmed growing scope for innovation.

Public web data

This is the foundational element for so much of the innovation happening today, but it’s also where the regulatory story begins.

Innovation: Public data fuels creativity

The value of public web data is undeniable. On the innovation side of the scale, the arguments are clear:

Public web data is the largest data set in the world. The potential is infinite.
Web data can be used for countless business intelligence purposes, driving smarter decisions and creating new opportunities.
AI isn’t going anywhere, and we need good data to train it. Public data is the fuel for this technological revolution.
Fundamentally, we believe that public data should remain public.

Regulation: Logged-out public data capture may be permitted

Historically, the primary legal threat to web scraping came from the Computer Fraud and Abuse Act (CFAA), a US anti-hacking law. This was concerning because violations carried not only civil penalties (money) but also potential criminal penalties.

However, a few years ago, landmark court rulings in cases like LinkedIn Corp. v. hiQ Labs, Inc. and Van Buren v. United States clarified the landscape. The courts stated that if you have lawful access to the data—meaning anyone can go on a public website and see it—you are not violating the CFAA.

So, the question then became: “Can it nevertheless be a violation of a site’s Terms of Service (ToS)?” This year, we saw a major ruling in the Meta v. Bright Data case that answers this question. The court ruled that Bright Data did not violate Meta's ToS.

However, while many headlines declared that all public data scraping is now okay, that's not quite what the case said. The court's decision was specific to the facts: Bright Data was scraping data that was not behind a login and their activity did not violate Meta’s ToS.

Following this, we saw X (formerly Twitter) settle its lawsuit against Bright Data. While the terms are confidential, one can make an educated guess that X saw the outcome of the Meta case and decided it wasn't worth pursuing. The courts are favoring innovation.

Takeaway: Not everything is fair game

Just because the courts have been ruling in favor of scraping public data doesn’t mean it’s all fair game. What you do with the data still matters a lot, and what type of public data matters too. We're seeing courts look more closely at data usage, especially when it involves pirated or illegally obtained content, which leads us to our next topic.

Copyright

This is probably the area where we're seeing the most case law and the most litigation, especially with the rise of generative AI.

Innovation: Fueling the next generation of AI

The innovation driven by vast datasets is transformative. Companies are looking to:

Obtain diverse data to inform business decisions.
Create robust LLMs to build highly effective generative AI.
Fine-tune models to fit specific business needs.
Build intelligent tools for analytics, social listening, and monitoring.

All of this relies on access to data, much of which is copyrighted.

Regulation: Fair use, piracy, and transformative work

Several recent and ongoing cases are shaping the rules around copyrighted data:

The Anthropic case: In a key ruling, a court determined it is not a copyright violation to train an LLM with legally obtained works. Anthropic had paid for books to use in its training data, and the court found this to be fair use. However, the court also found it was likely a violation to train with pirated scraped works. Anthropic had also scraped websites that hosted stolen books. This distinction is critical.
Thomson Reuters v. Ross Intelligence: This case explored the concept of "transformative use." The court said Ross Intelligence's use of scraped data was not transformative because it was used to create a directly competitive product. This is classic copyright infringement—you can't just copy-paste to build a competing service.
Anderson v. Stability AI: In several ongoing cases involving gen-AI systems, the similarity of the AI's output to the original copyrighted works is a key aspect. The closer the output is to the input, the weaker the fair use argument becomes.

Takeaways: How to treat copyrighted data

Do

Ensure the data is lawfully obtained. It should be public data from a reputable, legally compliant website.
Materially transform the data. Create something new, like analytics or insights, rather than just reproducing the original work.

Don’t

Don’t scrape pirated or ill-gotten content. If a website obtained content illegally, don't scrape it.
Do not use the data to build a directly competitive product or simply copy it verbatim and repost it.

Personal data

Scraping personal data is always a hot topic, and while there haven't been massive legal shifts recently, the existing rules are more important than ever, especially with the integration of data into AI.

Innovation: Creating personalized and powerful datasets

The goals here are clear: obtaining vast and diverse data to build out various types of datasets, creating robust LLMs, fine-tuning models, and creating tools for brand monitoring and social listening. Personal data can be a component of this, but it requires extreme care.

Regulation: The US vs. EU divide

There is a huge distinction between how the US and the EU treat personal data.

United States: In the US, public personal data is typically okay to scrape. If data is "manifestly made public," then no consent or other type of legitimate interest is generally required.
European Union: In the EU, under GDPR, there is no exception for public personal data. You must have a legitimate interest or consent, even for data that is publicly accessible. This applies even if you are in the US but are scraping the personal data of EU citizens.

When incorporating data into AI, it's crucial to ensure you are not violating prohibited uses under new regulations like the EU AI Act, which restricts applications like facial recognition and automated decision-making for employment, housing, or loans.

Takeaways: When is public personal data okay?

The rules differ significantly by jurisdiction. In the EU, even with public data, you must consider:

Data retention: How long do you keep the data?
Anonymization: Can you anonymize the data to remove personal identifiers?
Minimization: Are you only taking the data you absolutely need?
Notices: Do you need to provide notice to data subjects?
Opt-outs: Is there a mechanism for individuals to opt out?

Be cautious about the usage of personal data when building an LLM, ensure it's obtained compliantly, and design use cases that do not run afoul of the AI Act or other regulations.

Key takeaways for the road ahead

The legal changes this year have been overwhelmingly positive for the web scraping community. The courts are increasingly ruling that scraping public web data is acceptable and are even recognizing fair use in the context of training AI.

However, this freedom comes with responsibility. Here are the most important principles to guide your data scraping activities:

Ensure data comes from reputable, legally compliant websites.
Avoid scraping websites with pirated or illegal content. The potential damages are enormous, as seen in the Anthropic case.
Do not build directly competitive products unless the data is materially transformed. Add your own analysis and intelligence.
Ensure you handle personal data according to jurisdictional requirements, paying close attention to the stringent rules of the EU if you collect data on its citizens.
Do not use scraped data for AI products prohibited under emerging regulations like the EU AI Act.

The more the web scraping industry unites around ethical standards, the more we can influence regulators to continue making positive decisions that favor innovation. The law is finally catching up, and for those who proceed ethically, the future of data scraping looks bright.