The generative AI gold rush is upon us, with astounding new products and capabilities emerging that are fuelled by web data.
Yet this wave of innovation also casts long shadows of legal uncertainty. With the legal foundation of data gathering for AI training being tested right now, many businesses are waiting to learn the future legal basis of the practice.
But data users don’t have to wait. Smart AI builders can prepare for the future by carrying out risk assessments, creating a data ingestion audit trail, heeding creator preferences and seeking expert advice.

Scraping and the law: The story so far
When it comes to conventional web scraping, the rules of engagement, while not always simple, have become relatively well delineated. At Zyte, we have driven awareness of this over many years.
There is no law that says someone either can or cannot scrape. Rather, the law is concerned with how you scrape data and what you do with it.
How: Obtaining data is typically permissible when it's public. However, scraping behind a login or consenting to any terms of service that prohibit scraping may be considered a violation of laws like the US Computer Fraud and Abuse Act (CFAA) or constitute a breach of contract. Additionally, the “how” question takes rate limiting and polite scraping into consideration too.Â
What: The intention for that data matters. For example, when the data collected is copyrighted, "fair use" provisions in some countries can allow its re-use for certain purposes.
In the U.S., whether a re-use constitutes "fair use" is weighed using four factors:
The purpose or character of the use.
The nature of the copyrighted work.
The amount of the original work used.
The effect on a copyright owner's market.
However, the legal basis afforded to generative AI, which is trained on data created by millions of people, is currently being thrashed out in two arenas:
A raft of lawsuits brought by copyright owners alleging infringement.
Regulatory reviews, leading to potential legislative amendments.

The litigation landscape: Cases to watch
As soon as ChatGPT launched in November 2022, we saw dozens of copyright lawsuits get filed. These cases are pivotal and will help decide how the future relationship between AI and copyright will be defined.
We are tracking 30 to 40 different cases, including:
Several battlegrounds are emerging.
AI copyright fair use for transformative works
In the realms of text and code, lawsuits such as The New York Times v. OpenAI and Microsoft are scrutinizing instances where LLM outputs bear a striking resemblance to, or directly reproduce, copyrighted training data.
While a work may have been copied on the input end, if the output has a very different character, it may be deemed "transformative", giving it “fair use” protection.
In The New York Times' case, it has attempted to demonstrate how prompting ChatGPT caused it to output material that was almost identical to its articles.
Similarly, Getty Images has attempted to show how its own watermarks appear in some outputs, while Stability AI argues its image outputs are substantially different from originals on which it was trained.
Harming creators' commercial market
A judge may look negatively upon re-uses of copyrighted material that also take away creators' market for their work.
This is a key component in Authors vs Meta, where authors including Junot DĂaz and Sarah Silverman claim scraping of their books, from platforms like LibGen to train LLaMA models, hurts their ability to commercialize their material. The authors will bear the burden of proving this type of commercial harm, and it’s yet to be determined how and through what means they will accomplish this.
These diverse lawsuits underscore the breadth of the challenge.
In some cases, AI services claim training is merely the equivalent of human learning - something that will be tested by the sheer scale of that training and the resulting output.

Timeline for legal certainty
If courts decide LLMs are engaged in large-scale copyright abuse, that would have catastrophic consequences, sending creators of the most innovative technology industry since the web itself back to the drawing board.
I believe we will see outcomes that balance the needs of both parties - creators and technology companies. I am hopeful that we will get legislative guidance from the US soon, and by 2026, we will get more definitive answers from the case law.
Strategic imperatives for gen-AI builders
In the meantime, companies building AI businesses or features on scraped training data cannot sit on their hands, waiting.
So, what can innovative businesses do now, amidst this uncertainty? A proactive, risk-aware strategy is essential:
1. Acknowledge and continuously assess risk
The first step is a candid internal acknowledgment of the legal risks associated with your current and planned data sourcing practices for LLM training. Building a commercially viable generative AI product more than likely requires scraped web data - developers need to be comfortable with that fact.
2. Start an audit trail for data inputs
Robust data governance is no longer a “nice-to-have.” Before the laws get codified, start documenting the provenance of all training data. Where did it come from? How was it sourced? While this won’t cure an underlying infringement, it’s crucial for due diligence, responding to inquiries, and potentially for negotiating licenses.
3. Respect emerging norms and explicit creator signals
Cultivate an organizational culture that respects the intent of creators, even when the legal lines are still being drawn. For example, a growing band of creators is beginning to inject "no-AI" flags into HTML and image meta data, while Creative Commons is being pushed to adopt a noAI licence. Consider taking these no-AI markers into account when building out your training datasets, particularly if they are made explicit and upfront by the website or creator.Â
While the precise legal enforceability of these signals is still being debated, at least outside of Europe, ignoring such explicit preferences, where they are made, is a matter of ethics and is likely to be viewed unfavorably by courts and regulators.
4. Stay agile, stay informed, seek expert counsel
The legal and regulatory landscape for AI is dynamic. Invest in staying abreast of developments through industry associations, legal updates, and expert consultations. At Zyte, I regularly talk with customers navigating these same issues. Don’t hesitate to seek specialized legal advice tailored to your specific use cases and risk profile.
Building the future, responsibly
The path forward for LLM development is undeniably complex, fraught with legal questions that are only now beginning to be answered.
However, this complexity should not be a deterrent to innovation. Instead, it should serve as a call to action – a call to build responsibly, ethically, and with a clear-eyed understanding of the evolving legal and societal expectations.
Such actions are not an impediment to progress; they are the very foundation upon which a trustworthy and sustainable AI-powered future will be built.
