You Can Carve a Right Into Data. Can You Make It Follow the Data Into Inference? — The Traceability Gap RSL Exposes

Pebblous Data Communication Team

Executive Summary

For decades, content owners had exactly one thing they could say to a web crawler: "Stay out." The Disallow directive in robots.txt cannot express anything more nuanced than that. RSL — Really Simple Licensing, announced in September 2025 with its 1.0 spec finalized that December — is the most concrete attempt yet to change that one word into "Come in, on these terms." Led by a co-creator of the RSS lineage, this open XML standard inscribes a license directly onto content across five channels (robots.txt, HTTP headers, HTML, RSS, and file metadata), and it lets owners distinguish crawling, training, and inference inputs, each with its own usage terms and pricing model. More than 1,500 publishers have voiced support. This report argues that the standard is a signal not of success, but of asymmetry.

Declaring a right is easy; making that right follow the data all the way into inference is something no one has solved. Charging for the act of crawling (pay-per-crawl) works because a crawl is a single, observable HTTP event on the network: a CDN can gate it at the door or bill for it. Charging for how that data is actually used to produce an answer (pay-per-inference) is a different animal. It requires attributing, inside the model, which training documents contributed to a given output. Even the academic state of the art, influence functions, costs roughly as much compute as pretraining, is only approximate, and has no standard runtime infrastructure. The more fundamental problem sits at the starting line: more than 70% of major training datasets are missing license information entirely, so the trail is already broken before tracing can begin.

That is why the real question RSL raises is not a copyright headline but a data-infrastructure one: can you trace where data came from and which answer it ended up shaping, all the way through? Carving a license into metadata is the easy part. Making that license travel through training and inference is a different order of problem, and that empty layer is the subject of this report.

Editor's note. For the hard half of RSL (pay-per-inference) to work at all, you have to be able to trace which data shaped which output. That is fundamentally a problem of data provenance, lineage, and authenticity, and it overlaps directly with the territory Pebblous has been working on through DataClinic and AI-Ready Data. Beneath the layer that declares a license, the layer that makes data traceable enough for the right to follow is still empty. Pebblous wrote this report because data quality is the variable sitting at the center of that empty layer.

Key Figures

The four numbers below trace the backbone of this report. The easy half, charging for crawls, already runs at a scale of a billion responses a day, while 70% of training datasets flow into models with their license information already lost, severing the trail at its origin. Not a single major AI company has yet formally pledged to honor these declarations, and in the meantime a $1.5 billion settlement has nailed down that the value of this gap is anything but abstract.

Sources: Data Provenance Initiative (Longpre et al., 2023; Nature Machine Intelligence, 2024), Cloudflare Radar (2025), Bartz v. Anthropic settlement (Courthouse News, 2025), industry reporting (as of 2026-06).

70%+

Training datasets missing license info

Where the trail breaks at its origin

1B/day

Cloudflare HTTP 402 responses

The scale at which pay-per-crawl already works

0

Major AI firms formally pledging RSL compliance

As of 2026-06, the enforcement vacuum

$1.5B

Bartz v. Anthropic settlement

Largest copyright settlement in U.S. history

1

How robots.txt Became a Bargaining Table

Created in 1994, robots.txt is one of the longest-surviving conventions on the web. A single text file that tells search-engine crawlers "don't scrape these paths," it has stood in for the handshake between site operators and bots for thirty years. But it can express only two things: let in (Allow) or keep out (Disallow). In a world where the worst case was a search index scraping content meant for human eyes, that binary was enough.

1.1The Moment the Binary Broke

Generative AI changed what crawling means. Even when the same page is fetched, gathering it for a search index and gathering it to train a large language model are entirely different events for the content owner. The former sends traffic back; the latter absorbs the content's value into the model and then produces answers that route around the original site. Content owners now want to say something more precise than "stay out," something like "search indexing is fine, but if you want to train on this, pay." The robots.txt binary has no way to express that sentence.

1.2The `License:` Directive — From Blocking to Conditional Consent

What RSL did was add one line to robots.txt. Below the familiar Disallow, it places a License: directive that, instead of blocking, points to "follow the terms in this license document and you may use it." The signal shifts from binary (block / allow) to polynomial (allow on these terms, at this price).

1.3The Inherited Original Sin — Voluntary Compliance

The paradigm changed, but RSL inherits robots.txt's birth defect intact. robots.txt is not law; it is a gentlemen's agreement. Whether a bot reads that file and obeys it rests entirely on the goodwill of whoever built the bot. The same is true of RSL's single License: line. It can now express far richer terms, but the power to enforce those terms is not in the standard itself. As expressiveness grew, so did the visible distance between the rights you can declare and the rights you can compel. That distance is the path this report follows.

2

Carving a License Into the Content — What RSL Can Express

RSL's core idea is to stop keeping the license in a separate contract and instead attach it to the content itself, as metadata, making the data carry its own usage terms wherever it goes. The standard defines that attachment across five channels: a directive in robots.txt, a Link header in the HTTP response, a <link> tag in HTML, an <rsl:content> element in an RSS feed, and metadata embedded in the file itself (EPUB, XMP, ID3). Whether the content moves as a web page, a feed, or a downloaded e-book, the license is designed to travel with it.

2.1A Grammar of Usage Types and Pricing Models

Where RSL gains its expressive power is in splitting AI usage into distinct types. Using permits and prohibits, it can allow or forbid training (ai-train), inference input (ai-input — pulling content in at answer time, as in RAG), indexing (ai-index), and ordinary search (search) independently. To each it can attach a pricing model (payment): per-crawl (crawl), per-use or per-inference (use=inference), per-training-run (training), subscription, or free in exchange for attribution. The most common pairings of usage type and pricing model line up like this:

Usage type	Meaning	Typical pricing model
ai-train	Used as model training data	training / subscription
ai-input	Used as input for answer generation at inference time (e.g. RAG)	use=inference
ai-index	Included in an AI search index	crawl / attribution
search	Traditional search indexing	free / attribution

2.2The RSL Collective — From Individual Bargaining to Collective Rights

The standard alone does not give a small publisher the leverage to negotiate one-on-one with a giant AI company. The RSL camp solves this with a model borrowed from the music industry: collective rights organizations like ASCAP and BMI, which pool the rights of hundreds of thousands of songwriters to bargain en masse with broadcasters and streaming services. The RSL Collective bundles the rights of scattered publishers into negotiating power against AI firms. This is precisely why Reddit backs RSL even though it already has individual deals with Google and OpenAI: an individual deal ends with a single negotiation, while a collective-rights model leaves structural leverage behind.

▲ RSL Collective structure — pooling scattered publisher rights into the ASCAP/BMI collective-bargaining model for AI licensing | Pebblous original diagram

This is where RSL's strength ends. Splitting usage into types, attaching pricing models, and bundling rights collectively all belong to the expressiveness of the declaration. But the rights you can express and the rights you can compel are different orders of problem. How far that declaration actually gets enforced in the real world is where a right is truly tested.

3

The Easy Half and the Hard Half — pay-per-crawl vs pay-per-inference

The pricing models RSL expresses split into two worlds: charging for the act of taking the data (pay-per-crawl), and charging every time that data is actually used to produce an answer (pay-per-inference). They sit one line apart in the spec, but their real-world difficulty is not on the same scale. One already works; the other no one has solved. This asymmetry runs through the entire report. Declaring is easy; tracing is hard.

3.1The Easy Half — A Crawl Is an Observable Event

A crawl is, at bottom, a single HTTP request on the network. Who fetched what URL, and when, is right there in the log. Observable means gateable. A CDN standing in front of the content can check the requester's identity and, if the terms aren't met, close the door or bill for access. In practice, Cloudflare returns more than a billion HTTP 402 (Payment Required) responses a day, responses that tell unpaid AI crawlers "pay up." Per-crawl is not hard. The infrastructure already exists.

3.2The Hard Half — Inference Happens Inside the Model

Pay-per-inference asks for something else entirely. Not "this document was used in training, so pay," but "this document contributed to the answer that was just generated, so pay in proportion." For that, every time the model produces an output, you would have to attribute, inside the model, which training documents contributed, and how much. Unlike a crawl, that contribution is not observable at the network boundary. The trace of training, dispersed and absorbed across billions of parameters, cannot be read off the output. The difference between the two worlds comes down to where a gate can stand.

3.3The Wall of Training Data Attribution

Academia has studied this attribution problem for years under the name Training Data Attribution (TDA). Methods like influence functions, TracIn, and Data Shapley try to estimate "how much did this training sample influence this prediction?" The trouble is that all of them collapse the moment they leave laboratory scale. Three walls go up at once.

· Cost. Applying influence functions accurately to a large model costs about as much compute as pretraining. You cannot rerun something on the order of retraining the model just to bill for a single answer.
· Approximation. What gets used in practice is not the exact value but an approximation, and how trustworthy that approximation is on large models has not been adequately validated. The error is too large to serve as a basis for billing.
· Instability under multi-stage training. When alignment stages like RLHF layer on top of pretraining, the contribution signal for a single answer gets mixed and blurred at each stage. No clean line connects "this answer came from that document."

The conclusion is sober. Pay-per-inference is not impossible in principle, but it is closer to an open problem with no standardized runtime infrastructure. RSL can express this pricing model, but the world has no tracing mechanism yet that can enforce that expression. Almost no one has built out the apparatus that would make a right follow the data into inference.

4

Who Does the Enforcing — Fastly, Cloudflare, and the Freedom to Ignore

For a declaration to turn into enforcement, someone has to stand at the door. RSL's weakness is that the gatekeeper role lives outside the standard. The license is carved into the content, but whether its terms are actually compelled depends on which infrastructure the content sits behind. Two enforcement models collide here. One is open; the other is closed.

4.1Open Enforcement — The RSL Collective and Fastly

The RSL camp has partnered with the CDN provider Fastly to offer an enforcement layer that inspects license terms. A "bouncer at the door," standing in front of the content, checks an incoming crawler's identity and payment status against the RSL declaration. The point is openness: you can attach an RSL declaration to content hosted on any platform, and the standard itself is not bound to any single company. The catch is that for an actual gate to function, the content has to sit behind an enforcement partner like Fastly. Anyone can declare; only those with the infrastructure can enforce.

4.2Closed Enforcement — Cloudflare Is a Competitor

Setting the record straight. The framing that "Reddit, the AP, and Cloudflare adopted RSL" is not accurate. Cloudflare is not an RSL adopter but the operator of a competing method. It runs its own HTTP 402–based Pay-Per-Crawl inside its CDN, and its CEO has publicly criticized RSL, to the effect that it is "good at press releases." Reddit is a confirmed founding supporter, and the AP later joined a list of more than 1,500 supporters, but "founding adoption" and "expressed support" must not be conflated.

Cloudflare's model is closed. It exerts strong enforcement when content sits behind its own network. Inside its own infrastructure, it reliably blocks or charges unpaid bots on the spot. The trade-off is clear: RSL chose the openness of attaching to any platform; Cloudflare chose strong enforcement within its own CDN. If you want openness, RSL is your side; if you want enforcement that works today, Cloudflare is. The two are not marching under the same flag.

▲ Positioning of AI content rights standards on openness and enforcement axes | Pebblous original diagram

4.3The Freedom to Ignore — Quantifying the Enforcement Vacuum

The majority of publishers, who sit behind neither Fastly nor Cloudflare, remain in a state of "able to ask, unable to compel." As long as RSL stands on top of robots.txt, there is nothing the standard can do if an AI company simply ignores the declaration. The numbers show the vacuum. AI bots' robots.txt non-compliance rate surged from 3.3% in Q4 2024 to 13.26% in Q2 2025, and as of June 2026 not a single major AI company has formally pledged to comply with RSL. 1,500 publishers made the declaration; zero AI companies have promised to honor it.

▲ AI bot robots.txt non-compliance rate: surged from 3.3% (Q4 2024) to 13.26% (Q2 2025) — a 4× increase | Source: TechnologyChecker.io

4.4Competing Standards

RSL and Cloudflare are not the only ones competing over the declaration layer. Creative Commons is preparing CC Signals to mark AI-usage preferences on content, and the IETF is advancing AIPREF, an effort to standardize this on top of robots.txt. The proliferation of standards is, paradoxically, a sign that this space has no winner yet. And all of this competition plays out at the same layer: "how do we declare a right?" The layer beneath it, "how do we make a right follow the data?", is still empty.

5

For a Right to Follow Data Into Inference — The Missing Data Layer

Stack the story so far into a single picture and three layers come into view. At the top is the declaration layer: RSL, CC Signals, and AIPREF carve a license into content. Beneath it is the enforcement layer: Fastly and Cloudflare gate crawling at the network boundary. And at the very bottom, the traceability layer, the one that would make a right follow through training and inference, is nearly empty.

5.1The Trail Is Broken at Its Origin

Saying the traceability layer is empty is not an abstraction. When the Data Provenance Initiative audited roughly 1,800 training datasets, the result put numbers on that blank space: the license-omission rate exceeded 70%, and the license-misclassification rate exceeded 50%. For a right to follow, an unbroken chain has to connect where the data came from (provenance), what terms were attached (license), and how it was absorbed into the model (attribution), and that very first link is already severed.

▲ License information disappears as original content moves through dataset collection and model training | Pebblous original diagram (based on Data Provenance Initiative audit)

The thornier fact is that dataset labels hide reality. In the same audit, more than 80% of the original source content actually used carried non-commercial restrictions, yet at the dataset level those restrictions were labeled less than 33% of the time. The rights attached to the originals evaporated in the process of bundling them into datasets. Few numbers show more clearly how different two things are: carving a license into content (declaration) and preserving that license through the data pipeline (traceability).

5.2The Legal Stakes Are Already Real

This gap is not an abstract debate, because money is already moving. The $1.5 billion settlement in Bartz v. Anthropic, the largest copyright settlement in U.S. history, nailed down that the legal risk of training data is no hypothesis. It works out to roughly $3,000 per work across an estimated 500,000 covered works. On the other side, Reddit, which has already signed deals with Google and OpenAI worth tens of millions a year, is pushing to renegotiate from a flat fee toward dynamic pricing tied to how much its content is relied upon. The distance between a right that stalls at declaration and a right that settles only when it can be traced is, in effect, the distance of the market.

5.3Filling the Empty Layer

So the real task is not to build a better declaration grammar, but to make data traceable enough that the declaration can follow. Only when an infrastructure connecting data lineage, content authenticity (C2PA), dataset specifications (datasheets), and Training Data Attribution (TDA) holds it up does RSL's idea, "make the data carry its own usage terms," finally complete itself into the inference stage. For Physical AI data, where provenance is harder to state for manufacturing, sensor, and robot streams, the gap widens further. Carving a license is the starting line; making the right traceable all the way to the end is the finish line. What is empty right now is the road between them.

Frequently Asked Questions (FAQ)

We gathered the nine questions readers most often ask about this topic: what RSL is, how it differs from robots.txt, the distinction between pay-per-crawl and pay-per-inference, whether per-inference tracing is truly impossible, what happens if AI companies ignore it, its relationship to Cloudflare, how to apply it, why Reddit supports it, and why attaching a license is not enough on its own. In the end, every question converges on a single sentence: declaration and traceability are different layers.

R

References

RSL Standard · Industry Reporting

1.RSL Standard. (2025-12-10). "RSL AI Licensing 1.0 Now an Official Industry Standard." rslstandard.org/rsl
2.The Register. (2025-12-10). "Really Simple Licensing spec takes aim at AI scrapers." theregister.com
3.Search Engine Land. "Really Simple Licensing (RSL) explained." searchengineland.com/really-simple-licensing-461834
4.Digiday. "Arena Group, BuzzFeed, USA Today Co, Vox Media join RSL's AI content licensing efforts." Digiday.

Enforcement · Crawler Data

5.Cloudflare. "From Googlebot to GPTBot: who's crawling your site in 2025." blog.cloudflare.com
6.Cloudflare. "Introducing pay per crawl: Enabling content owners to charge AI crawlers for access." Cloudflare Blog.
7.Cloudflare. "AI crawler traffic by purpose and industry." blog.cloudflare.com
8.TechnologyChecker.io. "We Analyzed robots.txt Across Cloudflare's Network." (Analysis of AI-bot robots.txt non-compliance rates)

Academic — Limits of Traceability and Attribution

9.Longpre, S., Mahari, R., Chen, A., et al. (2023). "The Data Provenance Initiative: A Large-Scale Audit of Dataset Licensing & Attribution in AI." arXiv preprint. arxiv.org/abs/2310.16787
10.Longpre, S., et al. (2024). "A large-scale audit of dataset licensing and attribution in AI." Nature Machine Intelligence. nature.com/articles/s42256-024-00878-8
11.Alignment Forum. "Training Data Attribution: Examining Its Adoption & Use." alignmentforum.org

Legal Cost · Licensing Deals

12.Courthouse News. (2025). "Authors, publishers near final approval of $1.5 billion Anthropic copyright settlement." courthousenews.com
13.Adweek / Reddit IR. "AI licensing deals with Google and OpenAI make up ~10% of Reddit's revenue." Adweek; Reddit earnings disclosure.