Executive Summary

On March 10, 2026, the Grand Chamber of the Court of Justice of the European Union (CJEU) held a six-hour hearing — its first ever — on generative AI and copyright. The case is C-250/25, brought by the Hungarian news publisher Like Company against Google over Gemini. Three questions are in play, but the one that matters most to data practice is the third. If training an LLM is an act of reproduction, is that reproduction covered by the text-and-data-mining (TDM) exception in the EU Copyright Directive? The answer effectively sets the legal definition of what counts, in Europe, as lawfully trainable data.

It matters because the frame itself differs from the United States. America weighs the legitimacy of training after the fact through fair use. Europe's TDM exception does the opposite: it permits mining by default but removes any work whose rightholder has reserved it in a machine-readable way — an ex-ante rule. So in Europe, legality is already decided at the moment data is scraped, by each work's opt-out status. The share of news sites blocking AI training crawlers jumped from 58% to 79% in two years. The law is catching up to a practice already well underway.

Whichever way the ruling lands, the conclusion converges on one point. A dataset that cannot document and trace opt-out status, licenses, and provenance becomes a liability for European deployment. If the U.S. lawsuits were Act One of the debate over data trust, this hearing is Act Two — the moment it hardens into institutional rule. This report traces the point where provenance shifts from a litigation defense to a deployment requirement.

58% → 79%

News sites blocking training crawlers

Feb 2024 → Jan 2026, opt-out practice surging

11 states

EU members mandating opt-out form

Optional in 10 · unregulated in 6

3% / €15M

AI Act GPAI fine ceiling

Of global turnover · enforced Aug 2, 2026

$1.5B

Bartz v Anthropic settlement

Largest U.S. copyright settlement — the cost of settling after

1

The Question That Finally Reached Europe's Top Court

On March 10, 2026, in the Grand Courtroom of the CJEU in Luxembourg, a fifteen-judge Grand Chamber took its seat. The Grand Chamber is the court's highest formation, convened only for cases the CJEU considers especially significant or of fundamental principle. The case that came before it, in six hours of oral argument, was Like Company v Google Ireland, C-250/25 — the first time Europe's top court squarely examined generative AI and copyright.

The claimant, Like Company, is a Hungarian media publisher running several news portals. The defendant is Google Ireland, which operates Gemini (formerly Bard). The case reached Luxembourg through a preliminary reference from a Budapest district court — the procedure by which a national court asks the CJEU what a piece of EU law means, then applies the answer to the dispute in front of it. That mechanism is the reason this hearing does not stop at Hungary: its outcome applies with equal weight to courts across the entire EU.

Three core questions — and where the data question sits

The Budapest court referred three core questions (some sources split the TDM sub-issue into two and count four). The first concerns the right of communication to the public: when a chatbot returns output that partly matches a protected article, does that amount to communicating the work to the public? The second concerns the reproduction right: is LLM training, which observes patterns and adjusts to them, an act of reproduction under EU copyright law? And the third: if training does involve reproduction, can that reproduction fall within the TDM exception in Article 4 of the DSM Directive (2019/790)?

This report focuses on the third. Where the first two ask whether training collides with copyright at all, the third asks how wide the door is that could make the collision lawful. That door is the TDM exception. Fixing the width of the door goes beyond deciding one infringement case: it draws the boundary of what qualifies, across all of Europe, as lawful training data.

There is no ruling yet. What happened on March 10 was an oral hearing; the opinion of the Advocate General — the official who reads the case's likely direction in advance — is scheduled for September 3, 2026, with the final judgment to follow. So what we can read now is not "who won," but "what is at stake and which way the scale is tipping." This report reads the scale.

C-250/25 · Three Questions Referred by the Budapest Court Question 1 Communication Is chatbot output that mirrors a protected article a "communication to the public"? DSM Art. 2 Question 2 Reproduction Is LLM training itself an act of reproduction under EU copyright law? DSM Art. 2 Question 3 · This report's focus TDM Exception If training is reproduction, can DSM Art. 4's TDM exception shield it? → defines "lawful training data"
▲ The three questions referred to the CJEU in C-250/25. Question 3 sets the legal definition of what qualifies, across all of Europe, as lawfully trainable data. | Pebblous original diagram

One more thing. March 10 was not only the day of the CJEU's first hearing; it was also the day the European Parliament adopted a resolution on generative AI and copyright. The judiciary and the legislature seizing the same subject on the same day reads less like coincidence and more like a signal — that the institutional center of gravity on this issue is shifting toward Europe.

2

The European Design Called the 'Data-Mining Exception'

To understand the TDM exception, you first have to see how Europe designed its copyright exceptions. The DSM Directive splits text-and-data mining into two provisions. Article 3 covers mining done by research and cultural-heritage institutions for scientific research, and rightholders cannot exclude it. Article 4 broadly permits all mining, including commercial purposes — but any work whose rightholder reserves it in a machine-readable form drops out of the exception. Whether commercial LLM training falls inside Article 4 is the crux of C-250/25.

Dimension Article 3 TDM exception Article 4 TDM exception
Who Research & cultural-heritage bodies Anyone, commercial included
Purpose Scientific research (non-profit) Unrestricted
Opt-out Not available (cannot be excluded) Available (machine-readable reservation)
Commercial LLM training In principle, outside The crux of this case

How a 'machine-readable opt-out' actually works

The weight of Article 4 rests on that one condition: a "machine-readable reservation." To keep their content out of training, rightholders cannot simply say no in prose the way they would to a person; they have to mark it in a form a machine can read. In practice, that reservation is expressed through robots.txt, metadata, terms of use, and standards such as TDMRep, ai.txt, and C2PA. The EUIPO catalogs roughly eight such techniques. The signaling tools themselves are settling in fast: 57.2% of France's top 250 sites have already implemented TDMRep.

Yet the design has several practical fractures. robots.txt can be bypassed, and typos or misconfigurations are common. The standards are fragmented across several tracks, so it is unclear which one to trust. Above all, the boundary of what counts as "machine-readable" is still being hammered out in case law. The Hamburg appellate ruling we'll see below held that a reservation written in natural language alone is not machine-readable. Both those who signal and those who scrape are still feeling their way toward where a valid opt-out ends.

How firmly opt-out is nailed into law varies by member state. That means the very same work can be a valid or an invalid reservation depending on whose national law you apply — and for anyone handling data across 27 countries, that is that many more status values to verify. The breakdown below groups EU member states by how far they have written machine-readable opt-out into law.

11 states

Mandated

Machine-readable opt-out required by law. Germany, Hungary, Ireland, Poland, and others.

10 states

Optional

France, Spain, the Netherlands, and others — opt-out recognized but no form imposed.

6 states

Unregulated

Denmark, Finland, Italy, and others — no separate rule, relying on the directive's text.

How DSM Art. 4 Works — Legality Is Decided at Collection Time Work Published news article · book · code, etc. Opt-out present? (machine-readable) YES Cannot collect Art. 4 excluded NO Can collect Art. 4 applies Hamburg appeal (Dec 2025): reservation in natural language only → not "machine-readable" → void → Requires a standard form: robots.txt · TDMRep · ai.txt · C2PA
▲ How DSM Directive Article 4 TDM exception operates. In Europe, legality is decided at the moment of scraping, by each work's opt-out status. | Pebblous original diagram

One more norm sits on top of all this. In Recital 105 and Article 53, the EU AI Act requires that any general-purpose AI (GPAI) provider placing a model on the EU market respect Article 4 TDM opt-outs and publish a summary of its training content. This obligation does not ask where the training happened. Train outside Europe, but commercialize the model inside Europe, and Europe's rules apply. The boundary of the TDM exception does not stop at Europe's borders — it reaches outward.

The crux is timing. In Europe, legality is not something a judge weighs after training ends; it is already decided at the moment data is scraped, by each work's opt-out status. So the record of "which work's reservation status we checked, and when, and how" becomes the evidence of legality itself.

3

America Asks After the Fact; Europe Decides Beforehand

For most readers, the AI-copyright story is a story of U.S. lawsuits. The New York Times sued OpenAI; authors won a $1.5 billion settlement from Anthropic. The American approach is fair use. Train first, and if it becomes a problem, a court weighs the legitimacy after the fact by balancing four factors: the purpose and character of the use, the nature of the work, the amount used, and the effect on the market. The outcome stays uncertain until the litigation ends.

Europe flips the timing. Not after training, but at the moment of collection, legality is decided in advance by each work's machine-readable opt-out status. It is an ex-ante rule, not an ex-post balancing. Facing the same "AI training and copyright" problem, the two systems place the burden at different moments and on different parties.

Dimension U.S. fair use EU TDM exception
When decided After the fact (in litigation) Beforehand (at data collection)
How decided Four-factor balancing Permitted by default + machine-readable opt-out
Who decides legality The judge's overall judgment Each work's opt-out status
Burden on data practice Defense after the fact (litigation-ready) Proof beforehand (collection records)
In one line Puts a price on it through litigation Draws the boundary through a rule
When Legality Is Decided — U.S. vs Europe 🇺🇸 United States (fair use) Data collection Training Deploy Lawsuit Ex-post (decided after) 🇪🇺 Europe (TDM exception) check opt-out Ex-ante (decided beforehand) Collect + train Deploy
▲ U.S. fair use balances legitimacy ex-post, in litigation. The EU TDM exception decides legality ex-ante, at the moment of data collection. | Pebblous original diagram

How large the American "settle-it-later" world has grown is a matter of numbers. By the Copyright Alliance's count, U.S. AI-copyright suits have passed 70, and Bartz v Anthropic closed with a settlement covering roughly 500,000 works at $3,000 each — $1.5 billion in total, the largest copyright settlement in U.S. history. That $3,000 figure is now hardening into a benchmark for future damages negotiations. The world that prices things through litigation has, in fact, started to put a price on it.

Europe draws that boundary outside the courtroom, through a rule, and it draws it first. A defense after the fact is only needed once a suit is filed; proof beforehand is needed from the first day you scrape. From a data-practice standpoint the difference is decisive, because the center of gravity moves from "defend if a problem arises" to "prove it from the moment you collect."

4

Where the Rulings Already Handed Down Point

There is no judgment in C-250/25 yet, but Europe's lower courts have already handed down several. Lined up in chronological order, they reveal the terrain the Grand Chamber now faces. Below are the major European rulings on the TDM exception, and this case's timeline.

Sep 2024 — LAION (Hamburg, first instance)

The photographer Kneschke sued the dataset-building organization LAION. The court recognized the Article 3 TDM exception for non-profit, scientific-research purposes and ruled for LAION — an early milestone on the lawfulness of building training datasets in Europe.

Oct 2024 — DPG Media v HowardsHome (Amsterdam)

A news-aggregator case, but it set out the principle of the Article 4 opt-out: blocking only specific bots does not establish a reservation against the rest, and a reservation must be explicit and machine-readable to be valid. The reasoning speaks directly to the issue in C-250/25.

Nov 2025 — GEMA v OpenAI (Munich)

Europe's first ruling to find copyright infringement tied to LLM training. It treated training as reproduction and output as communication to the public. The crux was the line it drew: the TDM exception applies only at the data-collection stage, and a model memorizing a work wholesale and reproducing it verbatim falls outside the exception.

Dec 2025 — LAION appeal (Hamburg)

The appellate court examined the commercial Article 4 as well and again ruled for LAION — but in doing so established the principle that an opt-out written in natural language alone is not machine-readable and is therefore void. A ruling that nailed down how the form of a reservation determines its validity.

Mar 2026 — C-250/25 Grand Chamber, first hearing

Europe's top court holds its first oral hearing on generative-AI copyright. The same day, the European Parliament adopted a related resolution — a symbolic moment of the judiciary and legislature seizing the issue at once.

Sep 2026 — Advocate General opinion expected

The Advocate General's opinion, which reads the case's direction in advance, is scheduled for September 3. The final judgment follows. For now, the task is to read the scale, not the verdict.

Where European Lower Courts Already Converge Kneschke v LAION (Hamburg) Sep 2024 (1st) · Dec 2025 (appeal) Natural-language opt-out = not machine-readable DPG Media v HowardsHome (Amsterdam) Oct 2024 Opt-out must be explicit & machine-readable GEMA v OpenAI (Munich) Nov 2025 EU's first LLM infringement ruling · memorization = outside exception Principle 1 An opt-out is valid only in machine-readable form Principle 2 Memorization (verbatim reproduction) & commercialization fall outside the exception → Will the Grand Chamber confirm this trend?
▲ Three European lower-court rulings converging on two principles. C-250/25's Grand Chamber will either confirm or reshape this direction. | Pebblous original diagram

Member states line up; the Commission stays cautious

Several member states filed observations at the hearing. The confirmed five (Hungary, Denmark, Greece, Spain, and France) broadly backed a broad, extraterritorial reading: that training and deployment are one integrated process, and that even training done outside the EU triggers EU copyright law once the model is commercialized inside the EU. The European Commission, by contrast, suggested the reference might be partly or wholly inadmissible, on the ground that the questions focused only on Gemini's functionality without identifying a specific infringing act.

So the lower-court trend and the member-state positions converge on a single direction: the exception's boundary runs up to the data-collection stage, and memorization and commercialization sit outside it. There is, of course, no guarantee the Grand Chamber will confirm that trend as is. But whichever way the ruling lands, the fallout splits. If Like Company wins, obtaining licenses becomes the de facto default for training inside Europe; if Google wins, the duty to respect opt-outs still remains — because AI Act Article 53 demands a published training-content summary and a copyright policy independent of the ruling, with enforcement beginning August 2, 2026.

"Memorization escapes the exception" (Munich) and "a natural-language opt-out is void" (Hamburg appeal). Set the two principles side by side and the conclusion appears: how you curated the data and how you verified its reservation status decides legality on its own. It is a world where the way you handle data becomes the legal outcome.

5

So How Do You Prove Data Is 'Trainable'?

Translate the hearing and the case law into the language of data practice, and one sentence remains. For data to be lawfully trainable in Europe, you have to be able to prove its provenance, its licenses, and its opt-out status. A dataset you cannot prove becomes a liability for European deployment regardless of which way the ruling goes. In a world of ex-ante rules, failing to answer "when, and in what status, did we obtain this data" is itself a defect.

The Munich ruling's "memorization equals leaving the exception" principle adds a second implication here. If a model that memorizes training data wholesale and regurgitates it verbatim leaves the exception, then lowering the probability of memorization is the same as lowering legal risk. Data that is heavily duplicated and dense with copyright raises the probability of verbatim reproduction. So deduplication, source filtering, and license tagging (data curation) become compliance tools at the same time as quality work. The characteristics of the training data run through the model's interior and out into the courtroom as evidence.

The Shifting Role of Data Provenance BEFORE (U.S.-litigation world) Lawsuit filed ↓ something you then pull out: Defense card optional · reactive EU TDM + AI Act §53 AFTER (European deployment world) Before deployment ↓ something you must already have: Deployment requirement mandatory · proactive A dataset that cannot prove provenance · opt-out status · licenses = European deployment liability
▲ Data provenance moves from a defense card pulled out when a lawsuit arrives to infrastructure that must be in place before deployment. | Pebblous original diagram

This demand is not Europe's alone. Because Article 4 and AI Act Article 53 apply extraterritorially to models placed on the EU market, a Korean or American company faces the same question the moment it ships an AI product into the EU. A GPAI provider must publish a summary of its training content, and a violation can draw a fine of up to 3% of global turnover or €15 million, whichever is higher. Enforcement begins August 2, 2026. "Prove the opt-out, license, and provenance status of your training data" enters as a precondition of European deployment.

If the U.S. lawsuits were Act One of the debate over data trust, this CJEU hearing is Act Two — the moment it hardens into institutional rule. In Europe, the "Ready" in "AI-Ready" now includes "legally deployable." And the means of proving it is data provenance. Provenance is moving from a defense card you play when a suit is filed to infrastructure you must have in place before you deploy.

Editor's Note

The problem Pebblous has worked on — tracing the provenance and rights of data and diagnosing and refining its quality (DataClinic) — sits in the same place as the European demand this report describes. We read the shift, from a provenance debate that began as a response to U.S. litigation to a deployment requirement inside Europe's ex-ante rules, through the lens of data quality.

R

References

Official rulings & policy

  • 1.Court of Justice of the European Union. Like Company v Google Ireland Ltd., C-250/25. curia.europa.eu / InfoCuria.
  • 2.EU IP Helpdesk. (2026). "First CJEU hearing on generative AI and copyright." European Commission.
  • 3.Landgericht München I. (2025, November 11). GEMA v OpenAI.
  • 4.Landgericht / Oberlandesgericht Hamburg. (2024–2025). Kneschke v LAION (310 O 227/23 · 5 U 104/24).
  • 5.Rechtbank Amsterdam. (2024, October 30). DPG Media v HowardsHome.
  • 6.European Union. EU AI Act, Article 53 & Recital 105; DSM Directive (EU) 2019/790, Articles 3–4.

Law-firm & academic briefings

Policy & data

※ C-250/25 is at the oral-hearing stage as of March 10, 2026; no judgment has been issued (Advocate General opinion expected September 3, 2026). Some figures — market sizing, blocking rates — are estimates based on commercial research and industry tallies and vary with sample and definition. Originally surfaced via: aimadetools.com.