PINGDOM_CHECK
Light
Dark

What AI Builders Need to Know About the Training Data Copyright Debate

Read Time
6 min
Posted on
June 9, 2025
The generative AI gold rush is upon us, with astounding new products and capabilities emerging that are fuelled by web data.
By
Sanaea Daruwalla
Table of Content

The generative AI gold rush is upon us, with astounding new products and capabilities emerging that are fuelled by web data.


Yet this wave of innovation also casts long shadows of legal uncertainty. With the legal foundation of data gathering for AI training being tested right now, many businesses are waiting to learn the future legal basis of the practice.


But data users don’t have to wait. Smart AI builders can prepare for the future by carrying out risk assessments, creating a data ingestion audit trail, heeding creator preferences and seeking expert advice.

Scraping and the law: The story so far


When it comes to conventional web scraping, the rules of engagement, while not always simple, have become relatively well delineated. At Zyte, we have driven awareness of this over many years.


There is no law that says someone either can or cannot scrape. Rather, the law is concerned with how you scrape data and what you do with it.


  1. How: Obtaining data is typically permissible when it's public. However, scraping behind a login or consenting to any terms of service that prohibit scraping may be considered a violation of laws like the US Computer Fraud and Abuse Act (CFAA) or constitute a breach of contract. Additionally, the “how” question takes rate limiting and polite scraping into consideration too. 

  2. What: The intention for that data matters. For example, when the data collected is copyrighted, "fair use" provisions in some countries can allow its re-use for certain purposes.


In the U.S., whether a re-use constitutes "fair use" is weighed using four factors:


  1. The purpose or character of the use.

  2. The nature of the copyrighted work.

  3. The amount of the original work used.

  4. The effect on a copyright owner's market.


However, the legal basis afforded to generative AI, which is trained on data created by millions of people, is currently being thrashed out in two arenas:


  • A raft of lawsuits brought by copyright owners alleging infringement.

  • Regulatory reviews, leading to potential legislative amendments.

The litigation landscape: Cases to watch


As soon as ChatGPT launched in November 2022, we saw dozens of copyright lawsuits get filed. These cases are pivotal and will help decide how the future relationship between AI and copyright will be defined.


We are tracking 30 to 40 different cases, including:



Several battlegrounds are emerging.


AI copyright fair use for transformative works


In the realms of text and code, lawsuits such as The New York Times v. OpenAI and Microsoft are scrutinizing instances where LLM outputs bear a striking resemblance to, or directly reproduce, copyrighted training data.


While a work may have been copied on the input end, if the output has a very different character, it may be deemed "transformative", giving it “fair use” protection.


In The New York Times' case, it has attempted to demonstrate how prompting ChatGPT caused it to output material that was almost identical to its articles.


Similarly, Getty Images has attempted to show how its own watermarks appear in some outputs, while Stability AI argues its image outputs are substantially different from originals on which it was trained.


Harming creators' commercial market


A judge may look negatively upon re-uses of copyrighted material that also take away creators' market for their work.


This is a key component in Authors vs Meta, where authors including Junot Díaz and Sarah Silverman claim scraping of their books, from platforms like LibGen to train LLaMA models, hurts their ability to commercialize their material. The authors will bear the burden of proving this type of commercial harm, and it’s yet to be determined how and through what means they will accomplish this.


These diverse lawsuits underscore the breadth of the challenge.


In some cases, AI services claim training is merely the equivalent of human learning - something that will be tested by the sheer scale of that training and the resulting output.

Strategic imperatives for gen-AI builders


In the meantime, companies building AI businesses or features on scraped training data cannot sit on their hands, waiting.


So, what can innovative businesses do now, amidst this uncertainty? A proactive, risk-aware strategy is essential:


1. Acknowledge and continuously assess risk


The first step is a candid internal acknowledgment of the legal risks associated with your current and planned data sourcing practices for LLM training. Building a commercially viable generative AI product more than likely requires scraped web data - developers need to be comfortable with that fact.


2. Start an audit trail for data inputs


Robust data governance is no longer a “nice-to-have.” Before the laws get codified, start documenting the provenance of all training data. Where did it come from? How was it sourced? While this won’t cure an underlying infringement, it’s crucial for due diligence, responding to inquiries, and potentially for negotiating licenses.


3. Respect emerging norms and explicit creator signals


Cultivate an organizational culture that respects the intent of creators, even when the legal lines are still being drawn. For example, a growing band of creators is beginning to inject "no-AI" flags into HTML and image meta data, while Creative Commons is being pushed to adopt a noAI licence. Consider taking these no-AI markers into account when building out your training datasets, particularly if they are made explicit and upfront by the website or creator. 


While the precise legal enforceability of these signals is still being debated, at least outside of Europe, ignoring such explicit preferences, where they are made, is a matter of ethics and is likely to be viewed unfavorably by courts and regulators.


4. Stay agile, stay informed, seek expert counsel


The legal and regulatory landscape for AI is dynamic. Invest in staying abreast of developments through industry associations, legal updates, and expert consultations. At Zyte, I regularly talk with customers navigating these same issues. Don’t hesitate to seek specialized legal advice tailored to your specific use cases and risk profile.

Building the future, responsibly


The path forward for LLM development is undeniably complex, fraught with legal questions that are only now beginning to be answered.


However, this complexity should not be a deterrent to innovation. Instead, it should serve as a call to action – a call to build responsibly, ethically, and with a clear-eyed understanding of the evolving legal and societal expectations.


Such actions are not an impediment to progress; they are the very foundation upon which a trustworthy and sustainable AI-powered future will be built.

Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.