G2.com

No matter what data type you're looking for, we've got you
In 2026, major new regulations are taking effect across multiple jurisdictions. Organizations using web data for AI will need to adapt to mandatory transparency, copyright respect, and provenance documentation.
2026 marks a turning point. California's Assembly Bill 2013 took effect January 1, 2026, while the EU AI Act's core obligations take effect August 2, 2026. These are binding legal requirements with enforcement mechanisms and significant penalties.
For organizations developing AI systems, compliance infrastructure is no longer optional. By mid-2026, operating without documented data provenance and compliance systems will create material legal risk.
Enterprises will not adopt AI systems without evidence of lawful data sourcing, and regulators will enforce actively. From this year, organizations that build compliance into their operations will now have competitive advantage.
California AB 2013 mandates specific disclosures for generative AI data. Developers of publicly available generative AI systems must publish detailed documentation, including their data sources, dataset size, data types, whether data includes copyrighted material, whether datasets were purchased or licensed, whether personal information is included, and data processing methods used.
The EU AI Act imposes transparency and other obligations on AI service operators, based on risk to users’ health, safety, and fundamental rights. All general-purpose AI model providers must publish "sufficiently detailed summaries" of training datasets and respect copyright holders' opt-outs. Providers cannot use copyrighted content if the rights holder has indicated non-consent. Penalties reach €35 million or 7% of global annual turnover.
Copyright litigation clarifies the boundaries. In the US, the 2025 Bartz v. Anthropic ruling established that training on legally obtained works is defensible, while training on pirated content is not. In Kadrey v. Meta, the court emphasized market harm as one of the decisive factors influencing the likelihood of a copyright breach ruling. However, fair use defenses will continue to be aggressively litigated. Organizations cannot assume blanket protection.
Regulators enforce actively, but make room for training with personal data. The French Commission Nationale de l'Informatique et des Libertés (CNIL) fined Kaspr, a B2B lead provider, €200,000 in 2025 for data scraping violations. This is not an isolated incident - regulators across jurisdictions are increasing enforcement activity. Organizations operating in Europe, California, or globally face real enforcement risk if they cannot demonstrate compliance.
However, CNIL also made clear that training AI models on personal data sourced from public content can be lawful under the GDPR’s legitimate interest basis, provided certain conditions are met. So there is a path forward for data scrapers and AI services to obtain public personal data lawfully under the GDPR.
Enterprise buyers demand provenance. Large organizations increasingly require evidence of lawful data sourcing before adopting AI systems. This is a procurement requirement. Enterprises face their own regulatory exposure and will not accept suppliers without documented compliance. This will create a new market signal: provenance is a competitive requirement, not a legal nicety.
Personal data handling remains strictly regulated. Despite some regulatory relaxation proposals, personal data handling remains tightly controlled in 2026. Identification, profiling, and biometric data trigger strict compliance obligations. Personal data handling will become a visible differentiator and enforcement priority. As noted above, a proper legitimate interest analysis (and in some cases a Data Protection Impact Assessment or DPIA) will be enough to satisfy the compliance burdens for public personal data.
Compliance becomes a core operational requirement. As it must be embedded across product, engineering, and data workflows, operational complexity and overhead grow. For many organizations, this will make partnering with specialized data providers more attractive than building and maintaining compliance infrastructure in-house.
Provenance tracking is a new foundation. Investors, auditors, and enterprise customers will demand evidence of lawful sourcing. By 2026, organizations without provenance tracking will face friction in capital raising and partnerships.
Global divergence forces standardization on strictest rules. Organizations operating globally must comply with the strictest overlapping standards. Building to EU AI Act standards will satisfy most jurisdictions. However, litigation is centered in the US, so tracking and following the results of the US litigation is critical as well.
Licensing markets accelerate. As legal risks of scraping increase, some organizations will increasingly seek formal data access agreements. By 2026, more standardized licensing frameworks and revenue-share models will emerge. The cost could be higher than scraping, but legal and operational risks are lower. This will bring about concerns around open access to public data, and compliant web scraping will emerge as a very important tool to keep open access alive.
Conduct a comprehensive audit of your training data. Identify the source and legality of all data. If you cannot document lawful sourcing, stop using it. Pirated or stolen data is indefensible.
Implement provenance tracking systems. Build infrastructure to document the origin, legality, and usage of all data. This must be auditable and transparent.
Respect copyright signals and opt-outs. Monitor for machine-readable signals indicating "do not use for AI training." Implement systems to honor these signals where possible.
Ensure lawful basis when accessing personal data. Identify all personal data in your datasets and conduct any required compliance analyses, such as an LIA or DPIA. Additionally, implement required data subject access, security, and data minimization protocols.
Design your systems for auditability. Build your data infrastructure to support disclosure requirements. Maintain detailed records of sources and usage. Implement governance systems that can demonstrate compliance to regulators and customers.