Who Owns the Training Set? The Coming Battles Over AI’s Raw Material

Executive Summary
AI promises transformative productivity and wealth, yet the value of its inputs—the training data that teach models to perceive and generate language, images or music—remains hotly contested. Generative models routinely ingest billions of works: news articles, copyrighted photos, books and songs. Creators argue that this practice “steals from the people who create the content”[1] and undermines livelihoods, while developers counter that the use is fair and transformative[2]. Legal frameworks vary widely: the EU allows text and data mining (TDM) unless rights are reserved[3], the UK proposes a similar opt‑out regime[4], and U.S. courts are only beginning to decide whether training constitutes fair use[5]. Recent litigation—from NYT v. OpenAI & Microsoft to Thomson Reuters v. ROSS and Getty v. Stability AI—has moved the battleground from policy debates to courtrooms. This article maps the landscape, evaluates the cultural and economic stakes for creators, and proposes frameworks to reconcile innovation with respect for human creativity.
I. Introduction: The Hidden Raw Material of AI
Generative AI is often described in terms of dazzling outputs—drafted text, synthetic images, or composed music—but the training sets powering these systems are less visible. These datasets are assembled by crawling the open web and, in some cases, proprietary archives. Their construction raises deep cultural and legal questions: are AI developers simply analyzing works, or are they copying and repurposing expressive content without permission? The LAION‑5B dataset, for example, scraped billions of image–text pairs from the web and became a backbone for models such as Stable Diffusion. A German court later held that creating the dataset fell within Europe’s scientific research TDM exception[6], yet the discovery of child sexual abuse material (CSAM) within it led LAION to remove 2,236 links and reissue a cleaned dataset[7]. Similarly, the Books3 dataset contained almost 200,000 pirated books; a class‑action lawsuit alleges that Apple misrepresented these works as “publicly available” and diluted authors’ markets[8].
Beyond safety, training data influence bias, representativeness and economic power. Cloudflare’s radar analysis reveals that AI crawlers have a vastly different relationship with publishers than search crawlers: Google’s bot sends roughly 14 visits for every time it crawls a page, while OpenAI’s GPTBot crawls 1,700 pages for every referral and Anthropic’s ClaudeBot 73,000 pages[9]. Such one‑way extraction erodes the traditional exchange where crawls drive traffic and revenue. In July 2024 Cloudflare responded by offering website owners a single‑click option to block AI scrapers[10] and by July 2025 it introduced tools to automatically manage robots.txt files and allow blocking only on monetized sections[11].
This section sets the stage: training data are not a neutral resource. They are cultural artefacts, business assets, and personal information. Understanding who owns them—and who benefits from their use—is the key to assessing AI’s legitimacy.
II. The Legal Fault Lines
United States
In the U.S. there is no explicit statutory exception for AI training. Developers rely on fair‑use jurisprudence, arguing that ingestion of copyrighted works is transformative and necessary to teach models. The U.S. Copyright Office’s Report on Copyright and Artificial Intelligence acknowledges that the legality of unlicensed training is unsettled[12]. Commenters expressed polarised views: some argued that training without permission “undermines entire markets” and destroys incentives for creation[12]; others cautioned that mandatory licensing would stifle innovation and entrench incumbents[2]. In March 2025, Judge Sidney Stein in the Southern District of New York refused to dismiss the New York Times’ copyright claims against OpenAI and Microsoft, noting “many” examples of ChatGPT copying articles and allowing direct and contributory infringement claims to proceed[13]. In May, the same court ordered OpenAI to preserve all ChatGPT user data for discovery—an unprecedented requirement that led OpenAI to appeal on privacy grounds[14].
A separate case, Thomson Reuters v. ROSS Intelligence, addressed AI training in the context of legal research. In February 2025 the U.S. District Court for Delaware granted partial summary judgment for Thomson Reuters, finding that ROSS’s use of Westlaw headnotes to train its AI tool was not transformative and harmed the potential market for Thomson Reuters’s headnotes[15]. The court held that Ross infringed 2,243 headnotes and rejected its fair‑use defense, signalling that non‑transformative AI uses may face liability[15]. Shortly after, U.S. District Judge Eumi Lee denied Universal Music Group (UMG) and other publishers’ request for a preliminary injunction against Anthropic, ruling that the motion was too broad and that the publishers failed to show irreparable harm[16]. The judge observed that determining fair use is the “determinative question”[17], leaving final resolution for trial.
European Union
The EU’s 2019 Copyright Directive introduced a text and data mining exception that permits reproductions “for the purposes of text and data mining” unless rights holders have reserved their rights in a machine‑readable manner[3]. This opt‑out mechanism, codified in Article 4, has become the focal point for AI training. However, a 2025 European Parliament study concluded that generative AI training goes beyond the scope of TDM exceptions, which were designed for scientific analysis rather than reproduction of expressive content[18]. The study recommended revising Article 4 to require an opt‑in for commercial AI training and called for mandatory disclosure of training datasets and traceability via watermarking[19].
The LAION case demonstrates the ambiguity of EU exceptions. In October 2024 a German regional court held that creating the LAION‑5B dataset was lawful under the EU’s research TDM exception[6]. Yet the same dataset triggered safety concerns when researchers found CSAM, prompting LAION to work with the Internet Watch Foundation and release “Re‑LAION‑5B” after removing problematic links[7]. This juxtaposition highlights the tension between transparency (public datasets enable audits) and harm (unfiltered data may include illegal content).
United Kingdom
At the time of writing, UK copyright law allows text and data mining only for non‑commercial research. In December 2024 the UK Government launched a consultation proposing an exception for AI training coupled with a rights reservation mechanism similar to the EU opt‑out[20]. The consultation argues that both creators and AI developers suffer from legal uncertainty and states that a new framework must reward creators, ensure lawful access and promote trust[21]. The proposed approach would allow AI developers to train on any material, including for commercial use, unless rights holders have reserved their rights via standardised machine‑readable declarations[4]. The government emphasises transparency—developers would need to disclose training sources and provide summary information upon request[22].
The proposals have faced pushback. The Creative Rights in AI Coalition—a group that includes the Society of Authors, UK Music and the Publishers Association—criticised the rights reservation model, arguing that AI developers should only use copyrighted works with express permission[23]. The coalition welcomes measures to improve transparency but insists that an opt‑out approach would shift the burden onto creators and fail to deter unauthorised use[24]. Parliament’s Culture, Media and Sport Committee echoed these concerns, noting widespread worry among creative industries about unconsented training[25]. Despite this, the government continues to explore technical solutions for machine‑readable opt‑outs and emphasises that any system must enable collective licensing and enforcement[22].
Global Forums
Beyond national regimes, multilateral organisations are shaping principles for AI training. The World Intellectual Property Organization (WIPO) convenes the “Conversation on IP & Frontier Technologies” series, where governments debate whether training constitutes use or analysis and explore infrastructure for rights reservations and compensation. UNESCO has issued recommendations on AI ethics that emphasise respect for human rights and cultural diversity. However, global consensus remains elusive: in many jurisdictions, courts rather than policymakers are making the first determinations.
III. Litigation Heat Map: Courts as the Frontline
The following cases illustrate how courts are addressing AI training disputes. Each case signals how different legal regimes interpret fair use or exceptions and reveals emerging patterns in remedies and procedural orders.
- NYT v. Microsoft & OpenAI (U.S., 2025) – Filed in December 2023, this lawsuit alleges that OpenAI and Microsoft copied millions of New York Times articles to train ChatGPT, harming the publisher by bypassing paywalls and reproducing its stories[26]. In March 2025 Judge Sidney Stein rejected most of the defendants’ dismissal arguments, allowing direct and contributory infringement claims to proceed[26]. In May the court ordered OpenAI to preserve all user logs for discovery[14], raising privacy concerns and signalling that courts may compel disclosure of training data and interactions.
- Thomson Reuters v. ROSS Intelligence (U.S., 2025) – This case involves Ross’s legal research AI built from Westlaw headnotes. In February 2025 the Delaware district court found Ross liable for copying 2,243 headnotes and ruled that its training was not transformative, emphasizing that Ross used the material to build a competing product[15]. The decision signals that AI tools offering non‑transformative functions (e.g., replicating a database) may not benefit from fair‑use defenses.
- Getty Images v. Stability AI (UK, 2025) – Getty claims that Stability AI used millions of its photos and captions to train Stable Diffusion without permission. The case includes copyright, trademark and database right claims. During the trial, Getty dropped its input and output claims, leaving only the issue of whether the trained model itself is an infringing “article”[27]. Stability argues that training occurred outside the UK and that users, not the company, produce outputs[28]. The outcome could determine whether models themselves can be considered infringing copies.
- UMG/Concord/ABKCO v. Anthropic (U.S., 2025) – Music publishers sued Anthropic for allegedly using lyrics from over 500 songs to train Claude. In March 2025 a California federal judge denied the publishers’ request for a preliminary injunction, ruling that the proposed order was too broad and that the plaintiffs failed to show irreparable harm[16]. The court noted that defining the licensing market for AI training remains unsettled and that fair use will be a central question[17]. The case continues amid broader negotiations; some publishers reportedly reached partial settlements after Anthropic announced a $1.5 billion settlement with authors in a separate class action[29].
- Authors v. Meta/OpenAI (U.S., 2025) – Various authors have filed suits alleging that Meta and OpenAI used pirated books, including “shadow library” datasets like Books3, to train language models. Courts have dismissed some claims but allowed others—especially those alleging removal of copyright management information (DMCA CMI) and unfair competition—to proceed. While not yet generating precedent, these cases illustrate the difficulty of policing training data and the potential liability for using illicit sources[8].
IV. Creators, Journalists and the New Bargaining Map
The litigation spotlight reveals deeper societal tensions. Creators, news organisations and cultural industries fear that AI training will cannibalise their markets. AI developers argue that training is necessary for innovation and emphasise open access to information. This section synthesises stakeholder positions using a Stakeholder Bargaining Map.
Authors and Artists
Many authors and artists see AI training as an existential threat. The Society of Authors (SoA) and the Creative Rights in AI Coalition argue that rights reservation models shift the burden onto creators; instead, they advocate for an opt‑in regime where AI companies must obtain express permission[23]. The SoA has publicly criticised proposals for automatic rights reservations, warning that they require expensive content‑recognition systems that creators cannot implement[30]. Surveys by UK creative unions suggest that a majority of writers believe AI threatens their livelihoods and that any legal framework must ensure compensation and control.
Journalists and Publishers
News publishers have organised under the News/Media Alliance (NMA) to demand compensation, transparency and anti‑monopoly measures. In 2024 the Alliance launched the “Support Responsible AI” campaign, featuring ads like “Keep Watch on AI” and “AI Steals from You Too” to highlight that AI companies scrape publishers’ content without payment[1]. The campaign calls on governments to require licensing deals, force AI companies to disclose training sources, and prevent tech monopolies from dominating the market[31]. Publishers argue that unlicensed AI training undermines their subscription model and that data‑scraping erodes the advertisement-driven business that funds journalism.
Musicians and Photographers
Music publishers UMG, Concord and ABKCO view AI training as both a threat and an opportunity. They have sued Anthropic for using lyrics without permission but also seek to negotiate licensing frameworks. Photographers, represented by Getty Images, worry that generative models allow users to generate images with watermarks or confuse consumers about provenance. Getty’s case against Stability AI emphasises the investment required to curate a photo library and argues that training on such a database without payment amounts to misappropriation[32]. Conversely, AI companies claim that training constitutes fair dealing or takes place outside the jurisdiction[28].
Advocacy and Civil Society
Civil liberties groups like the Electronic Frontier Foundation (EFF) caution against overly restrictive regimes that could hinder research and access to information. Creative Commons advocates for clear machine‑readable licenses that allow creators to choose permissive or restrictive terms. Communia and Open Future call for open datasets to enable accountability, noting that LAION’s transparency allowed researchers to detect harmful content and push for safety improvements[33]. These groups suggest that rather than banning training, policymakers should mandate transparency and invest in auditing tools to detect misuse.
Stakeholder Bargaining Map
Creators ↔ AI Providers: Creators seek compensation and control; AI providers seek access and legal certainty. The bargaining equilibrium may involve collective licensing schemes, revenue sharing and transparent audit logs.
AI Providers ↔ Intermediaries: Platforms like Cloudflare, search engines and hosting services mediate access to data. Cloudflare’s bots‑blocking tools and robots.txt management illustrate how intermediaries can empower publishers[9]. Conversely, circumventing such controls (e.g., crawling despite robots.txt) has triggered calls for enforcement and has been dubbed “Stop AI Theft.”
Creators ↔ Intermediaries: Creators often rely on intermediaries to enforce rights (e.g., robots.txt or licensing platforms). They advocate for standardised rights reservation mechanisms that intermediaries can implement, while emphasising that enforcement should not fall solely on individuals.
V. The Dataset Dilemma: Provenance, Safety and Governance
Training datasets raise questions beyond copyright. They present challenges of provenance (where did the data come from?), safety (are there illegal or harmful materials?), and governance (can rights holders see and control how their works are used?).
Open Datasets and Transparency
Open datasets like LAION‑5B provide valuable transparency. Because LAION published its dataset, researchers were able to identify CSAM and biased or harmful content[33]. LAION responded by collaborating with the Internet Watch Foundation, temporarily removing the dataset and releasing a cleaned version, Re‑LAION‑5B[7]. The case shows that public datasets, while risky, allow community auditing and improvement. In contrast, proprietary datasets remain opaque; rights holders cannot know whether their works were included unless developers voluntarily disclose.
Shadow Libraries and Pirated Data
The Books3 dataset compiled nearly 200,000 pirated books. A class action alleges that Apple used Books3 to train its language models and misrepresented the works as “publicly available”[8]. Plaintiffs argue this practice dilutes markets for authors’ works and deprives them of control over derivative uses. Similar suits target Meta and OpenAI for using other shadow libraries. These cases underscore that training on illicit data not only raises copyright concerns but also threatens trust and compliance. If companies knowingly use pirated works, they face statutory damages, reputational harm and legislative backlash.
Robots, Crawlers and Consent
Technical governance is emerging as an important layer. Cloudflare found that only about 37 % of top websites have a robots.txt file[34], meaning that most sites cannot even signal their preferences to AI crawlers. To remedy this, Cloudflare introduced managed robots.txt services and an option to block AI bots on monetized portions of a site[11]. The company’s analysis shows that AI crawlers seldom reciprocate traffic—OpenAI’s crawl‑to‑refer ratio is 1,700:1 and Anthropic’s 73,000:1[9]—highlighting why publishers view unsanctioned scraping as theft. These metrics provide a basis for a Dataset Provenance & Risk Scorecard.
Dataset Provenance & Risk Scorecard
The scorecard below summarizes risks associated with major datasets:
- LAION‑5B / Re‑LAION‑5B – Source: scraped web images; Opt‑out status: supports rights reservation via Spawning’s opt‑out registry; Safety: initial dataset contained CSAM but removed 2,236 links after review[7]; Auditability: high because dataset is public[33]; Traceability: moderate; no watermarks but includes URLs.
- Books3 / Shadow Libraries – Source: pirated books from Library Genesis and similar sites; Opt‑out status: none; Safety: includes copyrighted works; Auditability: low because dataset was hosted via torrent; Traceability: low; class actions allege misuse[8].
- Proprietary training sets (e.g., Google, OpenAI) – Source: mixture of licensed data, public web and proprietary corpuses; Opt‑out status: unclear; developers propose using robots.txt signals; Safety: unknown; Auditability: low because datasets are confidential; Traceability: low; rights holders cannot confirm inclusion.
These examples illustrate the need for governance that balances transparency with privacy. Public datasets allow scrutiny but may expose harmful material; proprietary datasets protect corporate secrets but raise trust issues.
VI. Paths Forward: From Conflict to Constructive Frameworks
The controversies outlined above show that AI training sits at the intersection of copyright, privacy, antitrust and cultural policy. To move from ad‑hoc litigation to sustainable governance, stakeholders need workable frameworks. This section proposes several principles.
1. Machine‑Readable Opt‑Outs and Registries
Building on the EU’s opt‑out model, policymakers should establish standardised, machine‑readable reservations that are simple to implement. Developers would be permitted to train on works unless rights holders register an opt‑out. To prevent “gotcha” enforcement, the registry should be publicly accessible, and AI companies must regularly sync their training pipelines to respect updates. This approach avoids placing the entire burden on creators: a central database maintained by a neutral body could handle registrations and maintain technical standards. The UK government’s consultation notes that such a system requires technical and organisational infrastructure[22].
2. Collective Licensing and Compensation
An opt‑out regime alone does not address compensation. Creators need a mechanism to receive royalties when their works are used for training. Collective management organisations (CMOs) could negotiate licensing terms on behalf of rights holders, similar to how performance rights organisations operate in music. News publishers could negotiate blanket deals that license content for training in exchange for revenue sharing, while authors could license books through existing collecting societies. The News/Media Alliance advocates for licensing frameworks and compensation[31]. Without such schemes, AI companies may continue to rely on fair use arguments and litigation will persist.
3. Transparency and Audit Logs
Courts and policymakers increasingly demand transparency. The U.S. discovery order requiring OpenAI to preserve user logs[14] signals that judges may compel disclosure to assess infringement. Policymakers should require AI developers to maintain audit logs showing what data was used, how it was obtained, and whether rights reservations were honoured. Logs should protect personal data through aggregation but allow independent auditors to verify compliance. The European Parliament study recommends mandatory dataset disclosure and traceability via watermarking[19], while the UK consultation emphasises that transparency is a prerequisite for any rights reservation system[22].
4. Risk‑Based Governance and Safety Reviews
Governance must also address safety. Public datasets should undergo third‑party reviews to detect illegal or harmful content, as in LAION’s collaboration with the Internet Watch Foundation[7]. AI developers should implement content filtering and allow rights holders to report problematic data. Legislators could mandate audits for datasets above a certain size or used for models deployed to the public. Risk‑based governance—already a feature of the EU’s AI Act—can be extended to training data: high‑risk domains (e.g., health or criminal justice) may require stricter scrutiny and licensing than low‑risk creative uses.
5. Harmonisation of Fair‑Use/Exception Standards
The divergence between U.S. fair‑use jurisprudence and EU/UK TDM exceptions creates uncertainty. Companies operating globally face inconsistent obligations. International organisations like WIPO could facilitate dialogue on harmonising exceptions, perhaps by establishing baseline criteria for transformative use, market substitution, and legitimate interests of rights holders. Until then, AI developers may choose to comply with the strictest regime, while lobbying for clarity and limiting liability through settlements and licences.
Conclusion: Towards an Equitable AI Economy
The battles over AI training data are about more than legal technicalities; they reflect cultural values and economic power. Creators fear that their work will be devalued; developers worry that innovation will be stifled; policymakers seek to balance these interests while promoting global competitiveness. Litigation in 2025 has begun to define the contours of legality—allowing some claims to proceed, rejecting others, and imposing unprecedented discovery obligations[35]. Regulatory proposals in the EU and UK experiment with opt‑outs and rights reservations[3][4]. Stakeholders across industries advocate for licensing, transparency and fair compensation[31]. Technology intermediaries like Cloudflare are developing tools that allow website owners to control AI crawlers and gather metrics on scraping behaviour[9], signalling that technical governance will complement legal reforms.
The way forward lies in combining these approaches: implement machine‑readable opt‑outs, facilitate collective licensing, mandate transparency and audits, and align fair‑use standards across jurisdictions. Doing so will not only reward human creativity but also provide legal certainty for AI innovators. As this article has shown, the coming battles over AI’s raw material will shape the legitimacy of the technology itself. A sustainable settlement requires respect for the people whose works underlie AI’s capabilities and a recognition that openness and innovation can coexist with fairness and accountability.
[1] [31] News and Book Publishers Launch Offensive to Stop Tech Giants from Stealing Their Content for A.I.
[2] [12] Copyright and Artificial Intelligence, Part 3: Generative AI Training Pre-Publication Version
[3] L_2019130EN.01009201.xml
https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/
[4] [20] [21] [22] Copyright and Artificial Intelligence - GOV.UK
[5] [15] Court shuts down AI fair use argument in Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence Inc. | Perspectives | Reed Smith LLP
[6] [33] LAION vs Kneschke: Building public datasets is covered by the TDM exception
[7] LAION releases AI dataset Re-LAION-5B purged of links to child abuse images
https://the-decoder.com/laion-releases-ai-dataset-re-laion-5b-purged-of-links-to-child-abuse-images/
[8] Class Action Lawsuit Alleges Apple Illegally Uses Copyrighted Works for AI Training
[9] [10] [11] [34] Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content
https://blog.cloudflare.com/control-content-use-for-ai-training/
[13] [14] [26] [35] The New York Times v. OpenAI and Microsoft - Smith & Hopen
https://smithhopen.com/2025/07/17/nyt-v-openai-microsoft-ai-copyright-lawsuit-update-2025/
[16] [17] Anthropic wins early round in music publishers' AI copyright case | Reuters
[18] [19] European Parliament's New Study on Generative AI and Copyright Calls for Overhaul of Opt-Out Regime | Insights | Jones Day
[23] [24] [25] Impact of AI on intellectual property - House of Commons Library
https://commonslibrary.parliament.uk/research-briefings/cdp-2025-0081/
[27] [28] [32] Getty Images vs. Stability AI: The UK Court Battle That Could Reshape AI and Copyright Law | Articles | Finnegan | Leading IP+ Law Firm
[29] Anthropic’s Landmark Copyright Settlement: Implications for AI Developers and Enterprise Users | Insights | Ropes & Gray LLP
[30] Copyright and artificial intelligence: Impact on creative industries - House of Lords Library