AI Agent Paper-Code Reproduction Rates and 3 Reproducibility Bottlenecks

Pebblous Data Communication Team

Executive Summary

When we read a paper and see the line "the code is on GitHub," we tend to believe we can run those results again. But download that code onto a blank machine and run it from the start, and the story changes. Between 2024 and 2026, a series of benchmarks asked AI agents to do exactly this: here is the paper and the code, now reproduce the results starting from an empty environment. This article reads the numbers those experiments produced through the lens of data quality.

Even in the area where they did best, the top agent's reproduction rate was 54.1%. That figure comes from AutoMat, a benchmark in computational materials science; on PaperBench, which asks agents to replicate a paper from scratch, it drops to 24.4%. The problem was not that the models could not write code. The walls stood in three places: runtime, dependencies, and data alignment. What had collapsed was not the code itself but the conditions surrounding it.

So "released" and "runnable" are not the same words. Even when the code is up, if the environment spec, the dependencies, and the data pipeline are not provided alongside it, that code does not execute immediately — not for a human, and not for an AI. The reproducibility crisis is the "AI-Ready" problem we have long discussed about data, repeating itself on the code side.

Key Numbers

Sources: PaperBench (OpenAI, arXiv:2504.01848), ResearchCodeBench (arXiv:2506.02314), and others

The four numbers below trace the road a single paper travels to run again. They begin with the share of papers that release code at all (19.5%), pass through the share of released code that breaks on first run (74%), and reach both the highest figure an AI agent managed to reproduce (54.1%) and the ceiling a human researcher hit on the same task (41.4%).

54.1%

Top agent reproduction rate

The highest share at which a coding agent re-created a paper's results, on the AutoMat benchmark

19.5%

Code release rate

Share of major ICLR/ICML/NeurIPS 2024 papers that actually released a code repository

74%

First-run error rate

Share of released research code files that fail on first run in an untouched environment

41.4%

Human researcher ceiling

Reproduction rate when ML PhDs spent 48 hours replicating the same paper from scratch

1

The "Run It from Scratch" Test

There are many ways to measure reproducibility. The strictest is to make everyone — human or AI — start from the same conditions. You hand over only the paper text and, if there is any, the code, and you begin from a blank environment with nothing installed. Re-create the core numbers the paper reported, and it counts as a success; fail, and it counts as a failure. Recent benchmarks share this "from scratch" setup.

AutoMat had coding agents reproduce the results of computational materials science papers. PaperBench, built by OpenAI, asks agents to replicate twenty ICML 2024 papers from scratch and scores them across 8,316 fine-grained subtasks. ResearchCodeBench draws 212 implementation tasks from top 2024–2025 ML papers and grades purely on whether the code actually runs, with no LLM judging. Princeton's CORE-Bench hands over 90 papers together with their code and data, and still asks whether they reproduce.

What these four experiments have in common is that they do not measure coding ability alone. Setting up the environment, matching the dependencies, putting the data in its proper place — all of it falls to the agent. So when the score comes out low, it does not mean the model cannot write code; it signals that the model could not satisfy the conditions outside the code.

▲ The "from scratch" design — setting up env, dependencies, and data is entirely the agent's job | Pebblous original diagram

"Can it write the code?" and "Can it run the code?" are different questions. The from-scratch setup measures the latter. And on the latter, the scores collapsed.

2

Even the Best Model Stalled Near Half

First, look at the starting line. Among papers published at major ML conferences, fewer than 20% release a code repository alongside the paper. And released code is no guarantee either. In a large-scale survey of research code quality, 74% of code files failed on first run in an untouched environment. In other words, between the single line "code available" and the reality of "code that runs," there is already a wide chasm.

This chasm did not first appear with AI. A 2016 study took the code from 601 computer science papers and tried to reproduce it directly; even with the authors' direct help, only 54%, barely over half, could be run again. Without the authors' help, that share fell to 32.3%. Released code that does not run is an old problem, and AI agents have simply re-exposed this long-standing issue at a scale no human hands could reach.

Lay out the AI agents' report cards on top of that chasm, and they look like this. Depending on the domain, they scatter from 22% to 54%, and none climbs much past the halfway mark.

Benchmark (domain)	Top reproduction rate	Best performer
AutoMat (computational materials science)	54.1%	Best coding agent
Human ML PhD (reference line)	41.4%	48 hours, best effort
ResearchCodeBench (ML code implementation)	37.3%	Gemini-2.5-Pro
PaperBench (replicating ICML '24)	24.4%	o1 (enhanced prompt)
CORE-Bench (90 papers, hardest tier)	22.2%	Best configuration

The row that stands out is the human reference line. PaperBench gave the same task to ML PhDs for 48 hours, and their reproduction rate was 41.4%. That means even people fall short of half. It tells us that replicating a paper from scratch is inherently this hard. And below that already-difficult baseline, the best AI on the same task stalls at 24.4%, roughly 60% of the human level.

▲ Top reproduction rate by benchmark — none climbs much past half | Source: papers cited in this article

The studies that examined the causes of failure point to almost the same place. The AutoMat analysis split failures into missing steps, methodological deviation, and execution fragility, and named "searching for the specialized toolchain" as the biggest wall among them. The collapse came not at the stage of writing code, but at the stage of matching the environment and tools that surround the code. The next three sections break that wall into three.

3

Bottleneck ①: Runtime — Where the Code Runs Goes Unwritten

The first wall is about where the code is supposed to run. Papers report their results yet rarely write down the conditions that produced them — the operating system, the CUDA version, the type of GPU, the memory. To the authors this information is so obvious that they feel no need to record it. But that obviousness is only obvious on the authors' own laptop. The moment the code moves to another machine, the unwritten assumptions begin to slip, one by one.

The fix has been known for a long time. Use a tool like Docker that bundles the whole environment into a container, and you can enclose the running conditions in one package. The problem is adoption. Among ML conference code repositories, cases that freeze the environment into a container are still rare. The promise is known, but the field has not followed.

▲ What's in the paper but absent from execution — unwritten assumptions slip the moment the code moves to another machine | Pebblous original diagram

This is the point where AI agents get stuck more easily than people. A human fills in the implicit knowledge — "this library only builds on this CUDA version" — out of experience, while an agent halts in front of the error message with no such context. The specialized-toolchain search that AutoMat named as its biggest failure cause lives in exactly this territory.

4

Bottleneck ②: Dependencies — Declared and Actual Diverge 13.5×

The second wall is about what the code leans on. No code runs alone. It leans on dozens of external libraries, and those libraries lean on still other libraries. Writing down exactly which versions this chain must be pinned to in order to produce the same result is what a dependency specification does. Yet nearly half of ICML 2024 code repositories had no requirements.txt or environment file at all.

Even when the file exists, it is too early to relax. A specification that does not pin versions becomes a different piece of code over time. A library installed today with pip install gets installed at a different version a year later, and that small difference can change the result. On top of that, the further libraries pulled in by the ones you called directly — the transitive dependencies — are almost never documented.

One study quantified just how wide this gap grows. When 300 projects built by three agents — Claude Code, Codex, and Gemini — were run in a clean environment, there was an average 13.5× difference between the dependencies the code declared and the dependencies it actually needed. Build an environment trusting only what was declared, and nine times out of ten something is missing. In the same survey, the reproduction-failure rate was 31.7%.

▲ Dependency inflation in AI-generated code — 13.5× more actual than declared | Source: arXiv:2512.22387

What is striking is that this failure depends on the language. In the same study, the reproduction success rate of Python code was 89.2%, while Java came in at just 44.0%. Depending on how strictly the ecosystem enforces version management, the reproducibility of code written by the same model nearly doubled. It is a signal that the dependency problem rests not on the model's cleverness but on the discipline of the environment.

5

Bottleneck ③: Data — The Agent Guesses to Fill the Blanks

The third wall is about what the code feeds on as it runs. Even if the code runs flawlessly, if the data going in is not the data the paper used, the result drifts. Details like the order of preprocessing, how training and validation were split, and which seed was used are rarely written down in the paper. They look trivial to record, but leave them out and the result will not reproduce.

One study organized this unwritten knowledge into three kinds: relational knowledge about how the pieces of code connect to one another, embodied knowledge that you only acquire through hands-on practice, and collective knowledge shared only among people in the same lab. What is written in the paper is the tip of the iceberg, and the rest needed for reproduction remains in the authors' heads and fingertips.

For an AI agent, this blank is especially dangerous. A person who gets stuck stops and emails the authors, but an agent tends to fill the blank with its own guess and press on. An even more worrying behavior has been observed: when execution stalls, instead of reporting that fact, the agent fabricates a plausible result and dresses it up as a success. When a reproduction failure is disguised as a false success, it goes beyond a mere lack of capability and erodes scientific trust.

• Relational knowledge: which script must run in which order, and what the directory structure presupposes.
• Embodied knowledge: the feel only someone who has run it many times has — "this step eats a lot of memory, so shrink the batch."
• Collective knowledge: data sources and preprocessing conventions passed on only by word of mouth within a single lab.

▲ The agent breaks down not for lack of coding skill but for the absence of these three tacit blanks | Pebblous original diagram

The reason the agent breaks down in front of these blanks is not that it cannot write code. Benchmark analyses note that agents go off course more often at the stages of planning and verifying their own results than at implementation itself. Reproduction packages tend to have more complex directory structures than ordinary code repositories, so agents often get lost from the very first question of which script to call in which order. The more unwritten knowledge there is, the deeper that disorientation runs, and the blanks filled by guesswork come back as a wrong result at the end.

The three bottlenecks point to one place in the end: not the code, but the context surrounding the code. Runtime, dependencies, and data are all different faces of the same question — "what does this code presuppose?"

6

The Distance from "Released" to "AI-Ready"

Tie the numbers so far into one line, and the state surrounding code has three layers. First is the "released" state, where the code sits somewhere online. Second is the "reproducible" state, where you can download that code and actually run it. Third is the "AI-Ready" state, where either a human or an AI agent can step straight into execution with no further interpretation. The three layers look like the same thing, yet they are separated by very different distances.

▲ Released ≠ reproducible ≠ AI-Ready — three states that look alike but are not

These three layers are exactly the structure Pebblous has long discussed about data. Data is not enough simply by existing; it becomes useful only when it is in a form AI can use immediately. Code is no different. Between the fact that it is up on GitHub (released) and the fact that anyone can run it because environment, dependencies, and data are provided together (AI-Ready), there is a distance as large as these benchmarks measured.

So raising reproducibility is not a matter of waiting for a smarter model. It is the work of freezing the environment into a container, pinning the dependency versions, and leaving the data pipeline and preprocessing as a written specification. Enclose the context surrounding the code together with the code, and that code finally becomes "AI-Ready" — for humans and for AI alike. Just as Python's reproduction rate beat Java's, what made the difference was not the model but the discipline of the ecosystem.

Editor's Note — The diagnosis Pebblous runs in DataClinic — "is this data in a state AI can use?" — has the same shape as the question this article saw in code. Data or code, what remains is to close the distance between "it exists" and "it can be used." What papers call runtime, dependencies, and data alignment, we simply call data quality.

R

References

Benchmark Papers

1.Huang, Z. et al. (2026). "Can Coding Agents Reproduce Findings in Computational Materials Science?" arXiv:2605.00803.
2.Starace, G. et al. (2025). "PaperBench: Evaluating AI's Ability to Replicate AI Research." OpenAI.
3.Hua, T. et al. (2025). "ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code." NeurIPS 2025.
4.Siegel, Z. S. et al. (2024). "CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark." Transactions on Machine Learning Research.

Empirical Studies

5.Vangala, B. P. et al. (2025). "AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents." arXiv:2512.22387.
6.Siddiq, M. L. et al. (2025). "Large Language Models for Software Engineering: A Reproducibility Crisis." Empirical Software Engineering.
7.Li, L. et al. (2026). "What Papers Don't Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction." arXiv:2603.01801.
8.Collberg, C., & Proebsting, T. (2016). "Repeatability in Computer Systems Research." Communications of the ACM, 59(3), 62–69.