Executive Summary

Through 2024, the news from the pretraining side went quiet. Making models larger no longer drove the loss curve down as steeply as it once did, and most of the visible gains came from post-training and inference-time compute. As the strategy of adding parameters approached its limit, the variable that began to separate models again was the quality of the data. This article follows public numbers to trace why that shift happened and how data curation returned as the new bottleneck.

The clearest evidence came from Hugging Face's FineWeb-Edu. When a model-based classifier kept only the text with high educational value, a model trained on 38B tokens matched one trained on an unfiltered 350B tokens. The same performance from nine times less data. But this curation is not free. High-quality human text is depleting fast, and the synthetic data that fills the gap can break a model if used carelessly. The moment the performance lever moved to data, choosing the data itself became the next bottleneck.

The story picks up where Chinchilla's 20:1 ratio broke and follows a single causal chain: how curation became the lever that replaced sheer model size, and how, in the same move, it turned into the new bottleneck itself. It works downward from the evidence in pretraining datasets to the labeling pipelines on the ground—so if you work with data day to day, you can read it as a guide to what to check first in your own.

The way curation began to substitute for model size is compressed into four numbers: the efficiency of curation, how far practice overshot Chinchilla's optimum, the performance a small model reached, and the scale of contamination that makes data risky again.

38B = 350B

FineWeb-Edu efficiency

Curated 38B matched unfiltered 350B—9x more efficient

1,875:1

Llama 3 tokens/parameter

About 94x Chinchilla's optimal 20:1 over-training

50.6%

Phi-1 HumanEval

A 1.3B small model surpassing far larger ones

74%

New web pages with AI text

As of April 2025; without curation, contamination accelerates

1

Bigger Models, Flatter Curve

For a long time, the starting point for predicting a foundation model's performance was the scaling law. DeepMind's 2022 Chinchilla study showed that, for a fixed compute budget, loss bottoms out when model size and data volume grow together. A 70B model outperformed the 280B Gopher trained on the same budget, and the compute-optimal ratio settled at roughly 20 tokens per parameter. For a while, this 20:1 was the compass for the people designing models.

But practice drifted away from that optimum. Once you account not just for training cost but also for inference cost after deployment, it pays to reach the same performance with a smaller model. So over-training—pouring far more data into smaller models—became the norm. The shift in ratio is steep.

  • Llama 1 (7B): starting around 142 tokens/parameter
  • Llama 2 (7B): about 284 tokens/parameter—double in a single generation
  • Llama 3 (8B, 15T tokens): 1,875 tokens/parameter, about 94x Chinchilla's optimum
  • Qwen3-0.6B (36T tokens): 60,000:1—the ratio jumps from three digits to five
Over-training ratio progression (log scale, relative to Chinchilla optimal 20:1) Chinchilla 20:1 Llama 1 (7B) 142:1 Llama 2 (7B) 284:1 Llama 3 (8B) 1,875:1 Qwen3-0.6B 60,000:1 ← Chinchilla optimum Bar length = log₁₀(tokens per parameter)
▲ Original Pebblous diagram — generational acceleration of over-training ratios vs. Chinchilla optimum (20:1) | Source: respective model technical reports

Models thus learned longer, on more data. And yet, after 2024, the performance news coming out of pretraining noticeably thinned. The spotlight shifted to post-training and inference-time compute. The inference that the pretraining scaling law had hit a wall grows out of this silence. If adding parameters and pouring in more data no longer returns the gains it once did, the next question naturally turns from the quantity of data to its quality.

2

The Data Wall: High-Quality Text Runs Dry

One reason the curve stops rising even as you pour in more data is simple: the data to pour in is running out. High-quality text written by humans on the web is finite, and frontier models have already used most of it for training. What remains is either low quality or the machine-generated content that keeps multiplying. Researchers call this limit the data wall.

The pace of contamination is not trivial. According to one analysis, as of April 2025 more than 74% of newly created web pages contain AI-generated text. If the next model scrapes the web and trains on it without much filtering, that training data carries an ever-larger share of output produced by earlier models. This is the point where the old assumption that quantity could paper over quality begins to weaken.

The bottleneck has changed location. In the era when GPUs and compute were scarce, a bigger cluster decided performance. Now, even when compute is secured, whether there is enough high-quality data to feed that compute has become the harder question. The battleground for performance has moved from "how much can you compute" to "what do you select from the data that remains to feed it."

3

38B Beats 350B

"Better data beats more data" can sound like a slogan. But over the past few years, results have accumulated that back the claim with hard numbers. The sharpest case is Hugging Face's FineWeb-Edu.

FineWeb-Edu used a Llama-70B-based classifier to score the "educational value" of web text, then kept only the top tier and discarded the bottom 90%. A model trained on the 38B tokens that survived this filter matched a model trained on an unfiltered 350B tokens. A 1.82B model trained on a 1.3T-token subset built the same way outperformed models trained on the full FineWeb, MassiveText, and Dolma. The side that selected well, not the side that added more, won.

FineWeb-Edu: 9× Less Data, Same Performance Data volume Benchmark performance Unfiltered 350B tokens Baseline perf. Curated 38B tokens Same perf. ✓ efficiency Both bars achieve equivalent benchmark scores — curation delivers 9× data efficiency (HuggingFace FineWeb-Edu, 2024)
▲ Original Pebblous diagram (FineWeb-Edu results reinterpreted) | Source: Penedo et al., HuggingFace, 2024

3.1The Moment a Small Model Surpassed a Large One

Microsoft's Phi family pushed this logic further. The Phi-1 of "Textbooks Are All You Need" had only 1.3B parameters, yet trained on refined textbook-quality data it scored 50.6% on HumanEval and 55.5% on MBPP—numbers that beat far larger models at code generation. The follow-up Phi-3-mini (3.8B) reached the level of GPT-3.5 or Mixtral with its 45B total parameters. It deliberately departed from the scaling-law curve and pulled its weight class up through the quality of its data.

Phi's trick for punching above its weight came down to two decisions about choosing data. First it filtered web text by whether it "carried knowledge" and whether it "could build reasoning ability," then added textbook-quality synthetic data the model generated itself. Phi-3 split this into two stages, learning general knowledge from the web in stage one and concentrating reasoning training on refined synthetic data in stage two. What let the small model edge out larger ones was not parameter count but the design of what to feed it. And the fact that it pulled model-generated data into training connects directly to the double edge of synthetic data we take up later.

3.2Filtering Evolved From Heuristics to Models

A standard for comparing curation strategies fairly also emerged. DCLM (DataComp-LM) is a benchmark that pits different curation strategies against each other on the same pool, and it showed that even a simple binary fastText classifier can rival sophisticated strategies. The direction in which datasets evolved is consistent too: from perplexity filtering (Dolma) to heuristics (FineWeb), then to model- and classifier-based filtering (FineWeb-Edu, DCLM). Scale grew from RefinedWeb's 600B to FineWeb's 15T and DCLM's 240T tokens, but the lesson confirmed again and again is the same: precise filtering overwhelms a mass of unfiltered data.

4

Frontier Models Filter Data With Models

That curation is no fringe technique becomes obvious when you look at the recipe of a top-tier model. Llama 3 was trained on 15T multilingual tokens—more than eight times the 1.8T of Llama 2. Yet what the report explains with the most care is not the volume but the filtering.

The pipeline is built in several stages. Heuristic and NSFW filters strip out the obvious garbage, semantic deduplication weeds out repeated content, and a text-quality classifier scores what to keep. The part worth noting is that quality classifier. It combined a fastText model that predicts whether a piece of text is the kind that gets cited on Wikipedia with a RoBERTa-family classifier trained on Llama 2's judgments. In other words, Llama 2 curated the training data for its own successor. It is a recursive structure—filtering data with a model, then building a better model from that data.

Llama 3 Data Curation Pipeline Raw Web Crawl Tens of T tokens Heuristic Filter · NSFW Remove obvious noise Semantic Deduplication Remove duplicates Quality Classifier (Llama 2 trained) Recursive core Domain Mix Final 15T tokens 50% general / 25% reasoning / 17% code Llama 2 curates Llama 3's training data (recursive)
▲ Original Pebblous diagram — Llama 3 data pipeline reinterpreted | Source: Meta AI, Llama 3 Technical Report, 2024

4.1The Data Mix Is a Product of Design

How much of what to mix is also a decision, not chance. Llama 3's final data mix was set at roughly 50% general knowledge, 25% math and reasoning, 17% code, and 8% multilingual. Domains over-represented on the web, like art or entertainment, were deliberately down-sampled, and code and reasoning data were gathered through separate extraction pipelines. It shows that data quality is not only a matter of the individual document but also of how you design the ratios between domains.

Here the definition of data quality widens a step. Good data is not merely a clean sentence; it is data whose domain ratios are tuned to aim at what you want the model to become good at. Curation moves past cleaning and closer to design.

5

Synthetic Data: Escape Hatch or Trap?

If high-quality human text is running out, why not have the model make the data itself? Synthetic data has emerged as the most plausible detour around the data wall. Much of Phi's textbook-quality data was, in fact, model-generated. Yet the same synthetic data can also work to break a model.

The 2024 model collapse study published in Nature laid out that risk plainly. When you repeatedly train AI on data made by AI, the rare patterns in the tails of the distribution vanish first. Generation after generation, the model recycles only average output, and some analyses see measurable degradation within five generations in a purely recursive setting. The phenomenon was not confined to a particular architecture; it appeared across VAEs, GMMs, and LLMs.

Distribution Collapse Under Repeated Synthetic Training (Model Collapse) Gen 1 Human-data based — rich tail distribution Gen 2 Tail patterns begin to erode Gen 3 Rare patterns drop sharply Gen 4 Only average outputs repeat Gen 5 Measurable degradation Tail loss zone
▲ Original Pebblous diagram — distribution collapse under indiscriminate synthetic training reinterpreted | Source: Shumailov et al., Nature, 2024

5.1The Problem Is Not Synthesis but Carelessness

Reducing the conclusion to "synthetic data is dangerous" misses the point. The keyword the collapse study singled out is "indiscriminate" use. Collapse happens when you wholesale replace human data with synthetic, without tracing provenance and after losing diversity. Conversely, synthetic data that raised diversity, tracked provenance, and kept curation discipline actually lifted small models' benchmark performance. Phi's phrase "textbook quality" reads almost like a public statement that it will use synthetic data while vouching for its quality.

So in the age of synthetic data, curation does not matter less—it matters more. What to generate, what to keep among what was generated, and how to trace where it came from are what separate collapse from a leap forward. The discipline of curation is the safety mechanism for synthetic data.

6

The Real Reason Curation Is the Bottleneck

So far this has been a story about pretraining datasets. But the more direct reason curation is called a "bottleneck" lies in field pipelines. The work of selecting and labeling data is still expensive, slow, and error-prone.

The scale alone is daunting. The data labeling and curation market is growing from about $3.7 billion in 2024 to more than $17 billion by 2030, at a rate above 25% a year. A unit price of about four cents per bounding box looks small, but even a mid-sized project easily pushes the cost into six figures. Gartner estimates that poor-quality data drains $12.9 million per organization per year.

Quality is the trickier part. Even benchmarks considered well-curated carry 3–6% label errors, and field pipelines are usually worse. When errors surface only weeks into training, they lead to costly rework. The bottleneck is not one place but several.

  • Scalability: it is hard to keep consistency while managing thousands of labelers at once
  • Guideline drift: labeling standards wobble little by little over time
  • The speed–accuracy trade-off, a shortage of domain-expert labelers, and the limits of automation
  • Tool silos: collection, review, and training tools are disconnected, so data cannot flow smoothly

6.1The Direction of the Fix: Curate Before Labeling

The answer the field found is to flip the order. Instead of labeling all the collected data and then weeding it out, curate first to cut the excess before labeling. In a case introduced by Voxel51, Automotus used this approach to shrink its dataset by 35% and cut labeling cost by more than 33%. The same company's VAL matched expert labels at about 95% agreement while sharply lowering cost. Of course, rare long-tail classes still need a human hand. Even so, the direction is clear: the earlier you move the work of selecting well, the more downstream cost and error you remove.

7

Data and Models Cannot Be Separated

The dichotomy of "bigger model or better data" is convenient but inaccurate. Recent theory says the two have to be handled together. The traditional scaling law assumed data was uniformly high quality and mutually substitutable, whereas real data has duplication, imbalance, and gaps in concept coverage.

So the 2025 quality-aware scaling law introduced a dimensionless parameter Q for data quality, extending Chinchilla's loss function into a joint function of model size, data volume, and data quality. In the same vein, "Data curation cannot be compute-agnostic" showed that the optimal filtering strategy changes with the compute budget. Optimizing curation and scale separately gives you the wrong answer.

There is evidence in the opposite direction too. Studies like LIMO and s1 showed that selecting only small but valid and challenging examples can raise reasoning performance more than pouring in data in bulk—a result that runs head-on into the "more is better" intuition. When curation wins and when training on everything is optimal has now become a core research question in itself.

Put together, the conclusion is simple. As the race for model size nears its limit, the variable that decides performance has moved to data quality, and the curation that produces that quality has become a first-class engineering problem inseparable from compute and scale. Data curation capability is, in effect, the competitiveness of model performance.

Editor's Note

Pebblous works on diagnosing and refining the quality of data before it enters training. Putting the evidence this article followed into our own words: data curation is not the cleanup after a model is finished, but the up-front input that decides what to feed it. The place this article arrives at is that making better data before chasing a bigger model—AI-Ready Data—is the next lever for performance.

R

References

Academic Papers

Industry & Press