Synthetic Data Market Failure & Provenance Subsidies

Executive Summary

When an AI is trained on the output of an earlier generation of AI, data quality degrades with each round. This is usually called "model collapse." A May 2026 economics paper on arXiv reframes it. This is not an engineering story where bad data goes in and a bad model comes out. It is a market failure in which the quality of a commodity, data, deteriorates endogenously as a function of its own market share. Change the diagnosis and you change the prescription. The answer is not censorship. It is a price.

The heart of the paper is a subsidy formula that fixes exactly how much a producer of authentic data should be paid. The optimal subsidy is $s^* = \mathrm{KL}(q_\rho \,\|\, p) / 2\kappa$. How far the current distribution has drifted from the original (its KL divergence) sets the price you owe. The intuition that authentic data grows more valuable the worse the contamination gets is written here not as an observation but as an equation. On the C4 benchmark the collapse coefficient landed within 1σ of the theoretical value of 0.183 ($R^2 = 0.951$), and PMIR, the algorithm that applies the formula iteratively in a market, pulled the contamination rate down from 78% to 41%.

This article follows that logic through the eyes of someone who buys and sells data. Why "market failure" rather than "contamination," how subsidies and watermarks shift from words in a regulatory document to variables in an economic model, and why the ability to measure authenticity is the same as the ability to charge for it — in that order.

+23.1%

Model quality gain

PMIR vs. unregulated benchmark

78%→41%

Contamination rate drop

After iterated subsidy

0.318→0.142

Distribution drift

2-Wasserstein, down 55%

R²=0.962

Collapse-law fit

10-generation retraining

1

Not Contamination, a Market Failure

Most writing about model collapse treats it as an engineering problem. Synthetic data piles up across the internet, the next generation of models feeds on it, and the tails of the distribution get clipped away. With each round the diversity of the original fades and outputs converge on the mean. Bad input makes bad output. The remedy is engineering too: filter out the synthetic data and pour in more human-made data.

This paper names the same phenomenon differently. The quality of a commodity, data, degrades on its own as a function of the market share that commodity holds. The more synthetic data you use, the more contaminated the next generation's training data becomes, and contaminated data in turn breeds still more synthetic data. Quality is bound endogenously inside market structure. This is a new kind of market failure that classical information economics never anticipated, and the paper gives this equilibrium a name: SDCE.

Why call it a market failure? A company that produces and sells synthetic data reasons only about its own profit. It does not factor in the cost its data imposes when it mixes into the shared data pool and lowers the quality for every other learner. Economists call this structure an externality. If a factory dumps wastewater into a river without paying for cleanup, the wastewater keeps growing. The recursive contamination of synthetic data has the same shape. A choice that is rational for each individual participant destroys Pareto efficiency for the whole.

The shift in framing is not trivial. Call it "contamination" and the remedy becomes the language of cleanup and blocking — the language of censorship. Call it "market failure" and the remedy becomes the work of pricing the externality — the language of taxes and subsidies. Just as you levy a charge on wastewater, you attach a subsidy to authentic data. What this paper does is calculate exactly how large that subsidy should be.

▲ Pebblous original diagram — self-reinforcing synthetic-data externality cycle (Fig. 1 reinterpretation)

2

Pricing What We Lose

To calculate a subsidy you first have to put a number on what is being lost. The paper decomposes social welfare into four terms: producer surplus and consumer surplus are added, and two kinds of loss are subtracted.

$$W = W_{\text{prod}} + W_{\text{cons}} - L_{\text{coll}} - L_{\text{info}}$$

Social welfare decomposition. $L_{\text{coll}}$ is the collapse loss, $L_{\text{info}}$ the information-asymmetry loss.

The two losses are the protagonists of this article. First, the collapse loss $L_{\text{coll}}$ measures, in KL divergence, how far synthetic contamination has pushed the data distribution away from the original. The worse the contamination, the farther the distribution drifts and the larger the loss. Second, the information-asymmetry loss $L_{\text{info}}$ is a lemons-market penalty. When a buyer cannot verify the authenticity of data, there is no reason to pay a premium for authentic data. Good and bad data sell at the same price, and in the end the producers of good data leave the market.

If the decomposition feels abstract, the paper's empirical work brings it down to earth. Per-generation model quality decays logarithmically, in proportion to the square of the contamination rate.

$$\log Q_t = \log Q_0 - 0.183\, t\, \rho^2$$

The collapse law. $Q_t$ is the quality of the $t$-th generation model, $\rho$ the contamination rate. In a 10-generation retraining experiment, $R^2 = 0.962$.

On the C4 benchmark, a reduced-form regression estimated the collapse coefficient at 0.181 with a standard error of 0.024. That falls within 1σ of the 0.183 the theory predicted, and the coefficient of determination was 0.951. The authors read this as evidence that the collapse rate is not an accident of a particular corpus or architecture but a structural constant. The fact that the contamination rate $\rho$ enters as a square is especially heavy. Double the share of synthetic data and the quality loss quadruples. After 10 generations, roughly three years of retraining cycles, the loss compounds exponentially.

▲ Pebblous original diagram — collapse law: double the contamination rate, quadruple the quality loss (10-generation simulation, Fig. 2 reinterpretation)

3

The Provenance Subsidy — When KL Divergence Becomes a Price Tag

Once the loss is written as a number, the optimal subsidy that halts that loss comes out as a number too. The optimal subsidy the paper's Corollary 1 derives for a producer of authentic data is this.

$$s^* = \frac{\mathrm{KL}(q_\rho \,\|\, p)}{2\kappa}$$

The optimal provenance subsidy. $\mathrm{KL}(q_\rho \,\|\, p)$ is the KL divergence between the contaminated and the original distribution; $\kappa$ is the marginal collapse weight.

The formula is built from two quantities. The numerator is how distorted the data is right now. The denominator $\kappa$ is a weight that captures how sensitive society is to collapse. The intuition is simple: the worse the contamination, the larger the KL divergence, the more you must pay for authentic data. The value of authenticity becomes a floating price that moves with the state of the market. Authentic data yesterday and authentic data today carry different prices, by exactly the amount contamination has advanced.

This is where the paper parts ways with the regulatory conversation. Regulation usually imposes duties: "label your synthetic data," "disclose your sources." This formula instead says, "attach this much value to authentic data." The paper puts three policies on the same scale.

Policy	Quality loss	Welfare gain
Optimal subsidy s*	−1.1%	+0.031
Mandatory disclosure	−0.6%	+0.024
Statutory royalty cap	−1.9%	+0.012

The optimal subsidy leads on welfare gain. What is worth noticing is the runner-up: mandatory disclosure. It is cheap to implement yet delivers a gain close to the subsidy's. Where authenticity is hard to buy directly, simply making sources transparent already produces a substantial gain. A royalty cap, by contrast (a form of price control), has the largest quality loss and the smallest gain. Intervening to suppress the price only chokes off the supply of authentic data.

4

When Watermarks Become Economic Variables

Watermarks usually belong to the language of regulation. Like the machine-readable marks the EU AI Act requires, they are understood as tags stuck onto synthetic content. This paper recasts the watermark as an economic variable. The optimal watermark strength comes out as a function of detectability and the degree of contamination.

$$w^* = \frac{(1-\psi)\,\mathrm{KL}(q_\rho \,\|\, p)}{2\kappa\psi}$$

The optimal watermark strength. $\psi$ is watermark detectability (0–1).

Detectability $\psi$ sits in the denominator. The larger $\psi$ grows, the better the detection technology gets, the smaller the required watermark strength $w^*$. As detection approaches perfect ($\psi \to 1$), watermarking becomes economically equivalent to a cash subsidy. Conversely, when detection is poor, you have to crank the watermark up to compensate for the gap. Investment in detection technology comes back as a reduction in watermark cost. Technology investment and regulatory burden trade off inside a single equation.

The paper does not stop at optimism. Theorem 4 proves an impossibility result: "under information constraints, complete verification of authenticity is unimplementable." Producer-side observation alone can never fully separate authentic data from the rest. The implication is sharp. Provenance certification based on blockchain or smart contracts does not escape this limit either, because a source recorded in a ledger does not fully guarantee actual authenticity. That is why the paper concludes that a cash transfer, a direct subsidy, is more effective than an elaborate certification apparatus.

This is where data provenance and watermarking move from words in a regulatory document to variables in economics. Provenance becomes the input for computing KL divergence, and the watermark detection rate $\psi$ becomes a parameter traded off against the subsidy. Authenticity is no longer a principle that is merely nice to uphold; it becomes an economic good that can be priced, bought, and sold.

5

PMIR — Running the Theory in a Live Market

A formula being correct on paper and working in a market are two different things. The paper translates the theory into a runnable algorithm. PMIR (Provenance-Market Iterative Retraining) alternates between an authenticity market and retraining, finding the optimal subsidy by iterated computation. Rather than solving in one shot which data to buy and at what price, it nudges the price step by step until the market approaches equilibrium.

The results reduce to three numbers. Against an unregulated benchmark, model quality rose 23.1%. The contamination rate fell from 78% to 41%. The 2-Wasserstein drift, which measures how far things have been pushed from the original, dropped from 0.318 to 0.142 — a cut of more than half. The market actually moved in the direction the theory predicted.

▲ Pebblous original diagram — PMIR iterative equilibrium-finding algorithm (Fig. 3 reinterpretation)

The convergence rate was worked out too. PMIR reaches an approximate equilibrium within $O(\varepsilon^{-2} \log T)$ iterations and attains the information-theoretic lower bound up to a constant factor. Put in practitioner's terms, the algorithm narrows in on the answer to "given the current state of the market, how much is it optimal to pay for authentic data?" through repeated learning. It does not fix a price once and hold it; it re-prices as contamination advances.

6

So What Should You Buy?

For a practitioner who buys and sells data, this paper leaves four practical grounds for judgment.

• The premium on authentic data rises with the contamination rate. The higher the share of synthetic data, the larger, mathematically, the value of the authentic data you should buy. That gives you grounds to see buying authentic data not as a cost but as a hedge against contamination.
• Investing in detection technology lowers watermark cost. The better you get at spotting synthetic data, the lighter the burden of regulatory compliance. Detection and regulatory cost stand in a trade-off relationship.
• Blockchain provenance certification alone is not enough. Theorem 4 spells out the limits of certification. A direct purchase contract for authentic data is more effective than a ledger record. That is the paper's conclusion.
• A long-term supply of fresh human data is a strategic asset. To head off the exponential collapse roughly three years (ten generations) out, you have to secure a supply line of authentic data in advance.

Editor's Note

This paper names no company's product. But where its diagnosis points overlaps with ground Pebblous has long worked. The input to the subsidy formula is KL divergence, and to compute KL divergence you need to know how far the data has drifted from the original: its provenance. Data without provenance has no value to feed into the KL term; it cannot even be put into the formula. The ability to measure and prove authenticity is, in effect, the ability to charge $s^*$. A paper in which authenticity appears as a variable in economics is, in the language of theory, a support for why data quality can carry a price tag at all.

Pebblous Data Communication Team
July 3, 2026

R

References

1.Lundström-Imanov, G. O. Y. L.-F. (2026). "The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets." arXiv:2605.20279.
2.Borji, A. (2024). "A Note on Shumailov et al. (2024): 'AI Models Collapse When Trained on Recursively Generated Data'." arXiv:2410.12954.
3.Shumailov, I., Shumaylov, Z., Zhao, Y., et al. (2024). "AI Models Collapse When Trained on Recursively Generated Data." Nature, 631, 755–759.