Executive Summary
We have only just begun auditing the provenance of pretraining data. The Pile, C4, a run of copyright suits — for the past few years, the question "what did the AI learn from?" pointed squarely at document corpora. Meanwhile, the center of gravity that lifts frontier models' capability had already shifted to reinforcement learning with verifiable rewards (RLVR). The compute poured into reasoning training is growing far faster than pretraining, yet the data that actually shapes that training — verifiable tasks, answer-checkers, reward functions — is traced even more shallowly and in more fragments than any corpus. This piece is about how the frontier of the question has moved from corpus to reward signal, and how nearly empty that new frontier still is.
A recent framework called ATLAS took direct aim at that gap. It traced 1.45 million RLVR instances back to roughly twenty atomic sources, attributing almost all of them. That apparent success reveals a paradox. Most RLVR datasets turned out to be variants of a handful of upstream sources; genuinely new data was rare; and contamination — evaluation benchmarks leaking into training data — showed up throughout. More telling is what even ATLAS could not reach. The provenance of the checkers themselves, the path by which synthetic tasks were generated, the judgment inside human filtering, the design of the reward function — none of these are "datasets," so they fall outside the lineage graph.
The next battleground for data trust is not the document corpus but the provenance of verifiable tasks and reward signals. Regulation, auditing, and reproducibility all rest on this gap. While we chase pretraining lineage late, the real frontier is opening right now.
99.7%
Atom-level attribution
1.45M RLVR instances traced back to 20 atomic sources
70.4%
Top-5 source concentration
Just 5 sources are over two-thirds — cn_k12 alone is 23.6%
80% vs 33%
The licensing paradox
80% of source content carries non-commercial terms; under 33% is labeled that way
~10×
RL compute growth
10× every few months — outpacing pretraining (~5×/year)
The Frontier Moves: From Corpus to Reward Signal
For the past few years the question "what did the AI learn from?" pointed at a single target: the document corpus. How much copyrighted book text was mixed into the web scrape, which sources went into The Pile and C4 — lawsuits and data audits chased that question. The answers came slowly, and they are still incomplete.
In the meantime, the center of gravity for capability quietly moved. A model that built its foundational language ability through pretraining is now refined with reinforcement learning from verifiable rewards (RLVR) to push its reasoning higher. RLVR trains a model by rewarding it on tasks that can be scored automatically — is the math answer correct, does the code pass its tests? Unlike RLHF, which learns human preference, here an answer-checker (a verifier) decides the reward.
How fast this stage has grown shows up in the compute. By Epoch AI's accounting, the compute going into reasoning training is expanding roughly 10× every few months. Set against frontier compute overall, which grows about 5× a year, that pace is overwhelming. The figure below shows the gap between the two growth curves.
Figure 1. Training-compute growth compared. Reasoning (RL) compute grows ~10× every few months; pretraining, ~5× a year. The AIME jump is cited from Epoch AI's reporting (cross-check against the original paper recommended). Original Pebblous diagram.
The emblematic case Epoch AI cites is DeepSeek-R1-Zero. In an RL stage using only about one-fifth of pretraining's compute, its AIME 2024 score jumped from 10% to 71% in 8,000 steps. In other words, the capability gain from a relatively small compute investment is transformative. Capital moved the same way. According to The Information's September 2025 report, Anthropic cited a budget of over $1 billion for exploring the build-out of RL environments.
If the center of gravity moved, so did what needs auditing. When the data that shapes capability shifts from corpus to reward signal, the question "what did it learn from?" has to move too. Yet the tracing infrastructure is still tied to the corpus. Money flows to RL; auditing stays on the corpus. That mismatch is where this piece begins.
RLVR "Data": Not a Corpus, but Five Interlocking Parts
Tracing a corpus is hard, but the question is simple: which documents went in? In RLVR that question doesn't even hold, because RLVR "data" is not a single text corpus but an assembly of several parts.
Building one RLVR training signal takes at least five things meshing together: the task prompt the model has to solve, the verifier that is the answer or that judges correctness, the reward function that turns that judgment into a score, the rollouts (response samples) the model actually produces, and the filtering rules that decide what to keep and what to discard. The figure below contrasts how structurally different a corpus and RLVR data really are.
Figure 2. Structural contrast between a corpus and RLVR data. Where a corpus is a single token stream, RLVR data is an assembly of five interlocking parts. Original Pebblous diagram.
The core difference is this. A corpus is a sequence of text tokens; RLVR data is a "device that scores behavior." Each of the five parts has a different origin. Tasks are synthetically generated or derived from existing datasets; verifiers are code written by a person or another model entirely; reward functions are rules carrying a designer's judgment; filtering is an editorial act deciding what to keep as signal. If tracing a corpus is "check the document list," tracing RLVR means walking back each of these five lineages separately. That is why it is structurally harder.
Harder still: these five lineages barely surface even when the dataset is public. You can download the file, but the provenance docs that record where the tasks came from, or who wrote the verifier and how, usually don't ship with it. It is a state of being "openly closed." So to learn an RLVR dataset's lineage, there is no choice but to take the public files apart in reverse and reconstruct which upstream they branched from. Taking exactly that on, head-on, is ATLAS — the subject of the next section.
ATLAS: What Atom-Level Attribution Exposed
Released in May 2026, ATLAS (arXiv:2605.26971) took that hard problem on directly. It traced 1,450,827 instances across 16 representative open RLVR datasets, attributing 99.7% of them — back to 20 atomic sources (single origins that can't be split further). Fewer than 1% remained of unknown provenance. Proving that tracing is possible is, at the same time, evidence that until now nobody was doing it.
3.1 "Novelty" was an illusion — extreme source concentration
The first fact the attribution surfaced is concentration. 70.4% of the 1.45M instances derive from just 5 atomic sources. Chinese high-school math (cn_k12) alone is 23.6%, followed by olympiad problems (olympiads) at 20.5%. Datasets that look like separate "new" releases turn out, once you dig to the root, to be variants branching from the same small set of upstreams. In the paper's words, "most RLVR datasets are variants of a few upstream sources, and cases that introduce genuinely new data are rare."
Figure 3. RLVR atomic-source concentration. The top 5 sources account for 70.4% of the total. Source: reconstructed by Pebblous from ATLAS (arXiv:2605.26971) Table 7.
Concentration is not, by itself, a risk. The problem is that if a few upstreams carry errors, bias, or contamination, those get amplified into a great many downstream datasets. You may believe you secured diversity by blending ten "new datasets," when in fact you reused the same root ten times. Diversity and independence become an illusion.
What matters is that this illusion is not one paper's lone observation. A separate lineage study tracking inheritance among post-training datasets (arXiv:2604.10480) drew a graph in which 83 seed datasets spread through 971 inheritance edges into more than 430 derived datasets. Unfold datasets with all their different names and versions along their lineage, and they turn out to be branches that split, again and again, from just a few roots. Two independent studies arrived at the same conclusion.
3.2 Contamination you can't see unless you trace
The second thing exposed is contamination. At a similarity threshold of 90% or higher, ATLAS found 36,148 benchmark-leakage instances — cases where problems meant for evaluation leaked into training data. Omni-Math and HARP, in particular, directly contained evaluation benchmarks. The post-training lineage study cited above (arXiv:2604.10480) likewise confirmed benchmark leakage in 19 of its 83 seed datasets, and open-instruct-v1 reported a 46.48% duplication rate.
Contamination is frightening because it collapses capability evaluation itself. When evaluation problems leak into training, "the model solved it" quietly becomes "the model memorized it." Every model choice, investment, and regulatory judgment built on top of that then wobbles. And the leakage is invisible unless you trace provenance. Those 36K instances surfaced because ATLAS walked back to the atomic level; looking only at the dataset surface, you'd never catch them.
Figure 3-1. Benchmark contamination path. Evaluation benchmarks (Omni-Math, HARP) flow into RLVR training datasets through derivation and direct inclusion. ATLAS found 36,148 instances (≥90% similarity) across 16 datasets; the lineage study (arXiv:2604.10480) confirmed leakage in 19 of 83 seed datasets. Original Pebblous diagram (reinterpretation of Fig. 3 findings).
ATLAS's success is the exposure. Being able to attribute 99.7% of 1.45M instances is a technical achievement — but what looking that closely turned up was concentration into a few sources and 36K contaminated instances. Atom-level tracing revealed not how deep RLVR data runs, but how shallow and overlapping it really is.
Where Tracing Stops
There is a hidden caveat in ATLAS's 99.7%. That attribution is limited to "the provenance of task data." The paper itself names the territory it could not trace. Of RLVR data's five parts, effectively only one — the task prompt — makes it into the lineage graph. The rest are not "datasets," so they stay outside the graph.
Figure 4. Inside and outside atom-level attribution. What ATLAS attributed at 99.7% is the task data; verifiers, synthetic generation, human filtering, and reward design stay outside the lineage graph. Source: reconstructed by Pebblous from ATLAS (arXiv:2605.26971) limitations discussion.
4.1 The licensing paradox recurs in RL
What you miss when tracing stops has already been demonstrated by pretraining data. A large-scale audit by Longpre et al. at ICLR 2025 (arXiv:2412.17847) surveyed roughly 4,000 datasets (3,916, to be exact) across 608 languages, 798 sources, and 67 countries. The result was a licensing paradox. Judged by the licenses labeled at the dataset level, under 33% were restrictive; but walk back up the derivation chain to the actual source content, and over 80% carried non-commercial terms.
Figure 5. The licensing paradox. The share of restrictive terms by dataset label diverges sharply from the actual source content. Source: reconstructed by Pebblous from Longpre et al. (arXiv:2412.17847).
RLVR is a world of even more derivation and synthesis. Tasks are derived from other datasets, verifiers are built from yet other models, and synthetic generation layers on top. There is no reason the licensing paradox already confirmed in pretraining wouldn't run deeper in RLVR, where tracing is even shallower. Contamination, reward hacking, irreproducibility — all three grow beyond the point where tracing stops.
The Next Battleground for Data Trust
So what should we be tracing next? The lineage of the pretraining corpus has, belatedly, become an industry topic. From a data-provenance point of view, the gap no one has yet turned into an asset is the lineage of RLVR reward data. A lineage graph that connects where each of the task, the verifier, and the reward signal came from — that is the next battleground.
Figure 6. Roadmap for what to trace next. A lineage graph connecting task, verifier, and reward signal back to their upstream sources. Original Pebblous diagram.
5.1 Regulation doesn't yet cover this gap
Regulation, first of all, rests on this gap. EU AI Act Article 10 requires documenting the provenance, representativeness, and bias of training, validation, and test data. Full application arrives in August 2026. Yet whether the tasks, verifiers, and reward functions for verifiable rewards fall under that same obligation has no settled regulatory reading. The "training data" the text envisions is calibrated to document corpora, while RL reward data, verifiers, and environments sit right on the edge of that definition. That interpretive gap is itself the regulatory risk.
5.2 If you run an RLVR pipeline
This is not abstract. Building a single high-quality RLVR dataset costs real resources. DeepMath-103K took $138,000 and 127,000 GPU hours to produce. Running training while stacking that much without knowing what you stacked is a waste of the asset. Four things an organization running an RLVR pipeline can check right now:
- · Confirm the upstream source. Walk the lineage back to find which atomic source your open RLVR datasets actually derive from. This is about checking whether a "new dataset" is in fact the same root.
- · Screen for contamination. Filter for evaluation benchmarks leaking into training data at a similarity threshold (≥90%). The trustworthiness of capability evaluation hangs on this.
- · Turn your own assets into lineage. Document the provenance and design decisions of the tasks, verifiers, and reward functions you built yourself as a lineage graph. Reproduction, audit, and debugging all come out of this record.
- · Verify licensing down to the source. Check licenses not by the dataset label but back up to the actual source content. The real risk is the latent 80%, not the labeled 33%.
The frontier of the question "what did the AI learn from?" has already moved from document corpus to verifiable tasks and reward signals. The tracing tools and audit practices simply haven't caught up with that move. What we need to trace next is clear. The question is who will be first to have the language and the tools to treat it as an asset.
Editor's Note
Pebblous is a company that has long treated data provenance and lineage as an asset. DataClinic diagnoses data quality, and the AI-Ready Data philosophy takes as its premise that you must know what goes into training. The RLVR reward-data lineage gap this report points to is the next chapter of extending that view beyond the pretraining corpus. And ATLAS's quality score Q correlated strongly with downstream performance (Pearson r=0.96) — a quantitative piece of evidence for the proposition that knowing provenance lets you predict quality. This paragraph is editorial background; please read it as separate from the analysis in the body.
References
Academic papers
- 1.Huang, H.-Y., Liu, W., Tang, C., Lee, S., Yang, K., Chen, Y., Yang, S., & Wu, Y. (2026). "RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data" (ATLAS). arXiv:2605.26971. (published 2026-05-26) — primary source
- 2.Longpre, S., Mahari, R., et al. (2025). "Bridging the Data Provenance Gap Across Text, Speech and Video." ICLR 2025. arXiv:2412.17847. — the licensing paradox (80% vs 33%)
- 3."Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs" (2026). arXiv:2604.10480. — 83 seeds → 430 datasets, 971 inheritance edges
- 4.Yu, Q., et al. (2025). "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." arXiv:2503.14476.
- 5."DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset." (2025). arXiv:2504.11456. (production cost $138K / 127K GPU-hours)
- 6.Lambert, N., et al. (2024). "Tülu 3: Pushing Frontiers in Open Language Model Post-Training" (RLVR). arXiv:2411.15124.
Policy & statistics
- 7.Epoch AI (2025-05-09). "How Far Can Reasoning Models Scale?" epoch.ai. (RL compute 10× every 3–5 months; DeepSeek-R1-Zero AIME 10%→71%)
- 8.Epoch AI (2025). "The State of RL Environments." epoch.ai.
- 9.The Information (2025-09). Report on Anthropic's $1B+ budget for exploring RL-environment build-out. (via Epoch AI re-citation)
- 10.European Union (2024). "Artificial Intelligence Act — Article 10: Data and Data Governance." (documentation obligations for training/validation/test data; full application 2026-08)
Datasets & Pebblous-adjacent
- 11.Data Provenance Initiative. Publications & Multimodal Provenance Audit. dataprovenance.org.
- 12.NuminaMath-1.5. Hugging Face Datasets. huggingface.co/datasets/AI-MO/NuminaMath-1.5.
- 13.Pebblous Data Communication Team (2026-06-08). "What Is AI-Ready Data — Quality, Lineage, Governance." Pebblous Blog.