What AI Builders Need to Know About the Training Data Copyright Debate

The generative AI gold rush is upon us, with astounding new products and capabilities emerging that are fuelled by web data.

Yet this wave of innovation also casts long shadows of legal uncertainty. With the legal foundation of data gathering for AI training being tested right now, many businesses are waiting to learn the future legal basis of the practice.

But data users don’t have to wait. Smart AI builders can prepare for the future by carrying out risk assessments, creating a data ingestion audit trail, heeding creator preferences and seeking expert advice.

Scraping and the law: The story so far

When it comes to conventional web scraping, the rules of engagement, while not always simple, have become relatively well delineated. At Zyte, we have driven awareness of this over many years.

There is no law that says someone either can or cannot scrape. Rather, the law is concerned with how you scrape data and what you do with it.

How: Obtaining data is typically permissible when it's public. However, scraping behind a login or consenting to any terms of service that prohibit scraping may be considered a violation of laws like the US Computer Fraud and Abuse Act (CFAA) or constitute a breach of contract. Additionally, the “how” question takes rate limiting and polite scraping into consideration too.
What: The intention for that data matters. For example, when the data collected is copyrighted, "fair use" provisions in some countries can allow its re-use for certain purposes.

In the U.S., whether a re-use constitutes "fair use" is weighed using four factors:

The purpose or character of the use.
The nature of the copyrighted work.
The amount of the original work used.
The effect on a copyright owner's market.

However, the legal basis afforded to generative AI, which is trained on data created by millions of people, is currently being thrashed out in two arenas:

A raft of lawsuits brought by copyright owners alleging infringement.
Regulatory reviews, leading to potential legislative amendments.

The litigation landscape: Cases to watch

As soon as ChatGPT launched in November 2022, we saw dozens of copyright lawsuits get filed. These cases are pivotal and will help decide how the future relationship between AI and copyright will be defined.

We are tracking 30 to 40 different cases, including:

Several battlegrounds are emerging.

AI copyright fair use for transformative works

In the realms of text and code, lawsuits such as The New York Times v. OpenAI and Microsoft are scrutinizing instances where LLM outputs bear a striking resemblance to, or directly reproduce, copyrighted training data.

While a work may have been copied on the input end, if the output has a very different character, it may be deemed "transformative", giving it “fair use” protection.

In The New York Times' case, it has attempted to demonstrate how prompting ChatGPT caused it to output material that was almost identical to its articles.

Similarly, Getty Images has attempted to show how its own watermarks appear in some outputs, while Stability AI argues its image outputs are substantially different from originals on which it was trained.

Harming creators' commercial market

A judge may look negatively upon re-uses of copyrighted material that also take away creators' market for their work.

This is a key component in Authors vs Meta, where authors including Junot Díaz and Sarah Silverman claim scraping of their books, from platforms like LibGen to train LLaMA models, hurts their ability to commercialize their material. The authors will bear the burden of proving this type of commercial harm, and it’s yet to be determined how and through what means they will accomplish this.

These diverse lawsuits underscore the breadth of the challenge.

In some cases, AI services claim training is merely the equivalent of human learning - something that will be tested by the sheer scale of that training and the resulting output.

Navigating the global regulatory patchwork

Case law, when it is decided, will only be part of the picture.

Beyond the courtroom, governments worldwide are scrambling to adapt or create regulatory frameworks for AI. This has resulted in a complex, and often inconsistent, patchwork of rules and proposals, creating a challenging environment for global businesses.

The European Union’s AI Act is one of the first regulations to pass, imposing significant transparency obligations, including requirements to disclose summaries of copyrighted material used for training general-purpose AI models. This represents a significant step towards accountability but also introduces compliance burdens that some fear could stifle innovation within the EU.
In contrast, the United States is proceeding more cautiously. The U.S. Copyright Office has been studying the intersection of copyright and AI since 2023, issuing reports and soliciting public comment piece-by-piece. In May 2025, it said that training AI for research and analysis is likely fair use, but that training AI on copyrighted material to create competitive works is not automatically fair use. While this has sparked controversy, the statements made by the Copyright Office are not binding law and it is likely that as U.S. policy develops it will favor fostering innovation.
The United Kingdom is attempting to strike a balance. After a Copyright and AI consultation begun in December 2024, the UK is currently debating a Data (Use and Access) Bill, with the Houses of Lords and Commons disagreeing on the strength of required training transparency for copyright holders. A definitive path forward is still being charted.

Laws need to be updated to account for the existence of generative AI. But global divergence means that companies operating internationally cannot adopt a one-size-fits-all approach. Understanding the nuances of each jurisdiction’s stance on data scraping, copyright, and AI transparency is becoming a critical aspect of global business strategy.

In an ideal world, we will see major markets' legislatures come to a consensus view.

Timeline for legal certainty

If courts decide LLMs are engaged in large-scale copyright abuse, that would have catastrophic consequences, sending creators of the most innovative technology industry since the web itself back to the drawing board.

I believe we will see outcomes that balance the needs of both parties - creators and technology companies. I am hopeful that we will get legislative guidance from the US soon, and by 2026, we will get more definitive answers from the case law.

Strategic imperatives for gen-AI builders

In the meantime, companies building AI businesses or features on scraped training data cannot sit on their hands, waiting.

So, what can innovative businesses do now, amidst this uncertainty? A proactive, risk-aware strategy is essential:

1. Acknowledge and continuously assess risk

The first step is a candid internal acknowledgment of the legal risks associated with your current and planned data sourcing practices for LLM training. Building a commercially viable generative AI product more than likely requires scraped web data - developers need to be comfortable with that fact.

2. Start an audit trail for data inputs

Robust data governance is no longer a “nice-to-have.” Before the laws get codified, start documenting the provenance of all training data. Where did it come from? How was it sourced? While this won’t cure an underlying infringement, it’s crucial for due diligence, responding to inquiries, and potentially for negotiating licenses.

3. Respect emerging norms and explicit creator signals

Cultivate an organizational culture that respects the intent of creators, even when the legal lines are still being drawn. For example, a growing band of creators is beginning to inject "no-AI" flags into HTML and image meta data, while Creative Commons is being pushed to adopt a noAI licence. Consider taking these no-AI markers into account when building out your training datasets, particularly if they are made explicit and upfront by the website or creator.

While the precise legal enforceability of these signals is still being debated, at least outside of Europe, ignoring such explicit preferences, where they are made, is a matter of ethics and is likely to be viewed unfavorably by courts and regulators.

4. Stay agile, stay informed, seek expert counsel

The legal and regulatory landscape for AI is dynamic. Invest in staying abreast of developments through industry associations, legal updates, and expert consultations. At Zyte, I regularly talk with customers navigating these same issues. Don’t hesitate to seek specialized legal advice tailored to your specific use cases and risk profile.

Building the future, responsibly

The path forward for LLM development is undeniably complex, fraught with legal questions that are only now beginning to be answered.

However, this complexity should not be a deterrent to innovation. Instead, it should serve as a call to action – a call to build responsibly, ethically, and with a clear-eyed understanding of the evolving legal and societal expectations.

Such actions are not an impediment to progress; they are the very foundation upon which a trustworthy and sustainable AI-powered future will be built.