What AI Changed Wasn't the Hypothesis — It Was the Code That Tests It

Pebblous Data Communication Team

Executive Summary

For the past decade, "AI for Science" has mostly tried to automate the hypothesis. It predicted protein structures (AlphaFold), churned out candidate materials (GNoME), and generated research ideas (AI Scientist). ERA (Empirical Research Assistance), which Google Research published in Nature in 2026, moved the target one notch over. What ERA rewrites is not the hypothesis but the analysis code itself—the code that tests that hypothesis against data. This piece looks at what that one-notch shift means for every team that works with data.

The method is tree search. An LLM proposes a variant of the analysis code, ERA actually runs it and scores it on a benchmark, keeps only the strong branches, and mutates again. Running this loop produced a large number of new single-cell RNA analysis methods that surpassed the human-built top entries on a public leaderboard, and on COVID-19 hospitalization forecasting it recorded lower error than the U.S. CDC's official ensemble. The point is not a flash of genius but that a repeatable search beat the best records experts had set.

But the moment the reward function is a benchmark score, one question lingers. If the benchmark's data is biased, isn't ERA simply finding the code that reproduces that bias most faithfully? In an era where analysis code is generated automatically, the center of gravity for trust shifts from the model to the data and the validation process. The definition of "good data" moves one notch over to "good analysis"—and that notch is exactly where Pebblous has been working.

ERA, by the numbers

Every figure below is explained in the body. Sources: the ERA paper (Nature 654, 2026; arXiv:2509.06503v3) and GNoME (Nature, 2023).

40 / 87

Methods beating the leaderboard top

Of 87 single-cell methods ERA tried, 40 exceeded the prior leaderboard top on overall score

+14%

Gain over the prior best

In batch integration, ERA's BBKNN variant scored +14% on overall score versus the prior top, ComBat

26 vs 29

COVID forecast WIS

Average hospitalization-forecast error (WIS, lower is better). ERA 26, CDC ensemble 29 — but retrospective

736 / 2.2M

Prediction glut vs validation scarcity

Of the 2.2M crystal structures GNoME predicted, 736 were verified by independent experiment. Validation is always the bottleneck

1

The AI Became the Examiner — Rewriting the Analysis Code, Not the Hypothesis

Think of an exam. The "AI scientist" we're used to is closer to a student solving the problems. It predicts how a protein will fold (AlphaFold), pours out stable crystal-structure candidates (GNoME), and writes up new research ideas (AI Scientist). All of these produce answers—hypotheses and outputs. Where ERA operates is different. ERA is less like the student and more like an examiner who rewrites the grading scheme hundreds of times. It automatically rewrites the analysis code itself—the code that decides how data is read, cleaned, and compared.

The distinction looks minor but is in fact large. In science, the same data can yield different conclusions depending on which analysis pipeline you run it through. Dozens of methods compete for how to align single-cell RNA data without batch effects, or how to forecast hospitalization counts as a time series, and which ones you pick and how you string them together is precisely the analyst's expertise. ERA stepped into that very space of selection and assembly. Humans still pose the hypothesis, but the code that tests it against data is searched by the machine.

Genealogically, ERA stands on the path opened by Google DeepMind's FunSearch (2023) and AlphaEvolve. Both systems had an LLM propose code, then executed and evaluated it to search for mathematical functions and algorithms. ERA widened the same idea into an end-to-end analysis pipeline that handles experimental data. The object of the search expanded from "a better mathematical function" to "a better method of data analysis."

In one sentence: ERA rewrites analysis code, not hypotheses, through tree search. "AI that accelerates discovery" and "AI that invents analysis methods" resemble each other but stand in different places. And the moment a machine starts producing analysis methods, a new question opens: who decides whether those methods are right, and how?

▲ The AI-for-Science automation genealogy. AlphaFold, GNoME, and AI Scientist automate the hypothesis/discovery layer. ERA occupies a new position at the analysis-code layer. Pebblous original diagram (ERA Fig. 1 reinterpreted)

2

Rewriting Analysis Through Tree Search

ERA's operation boils down to a four-step loop. First, an LLM proposes a variant of the current analysis code. Rather than reviewing it in words, ERA runs the code on real data. Once results come back, it scores them on a fixed benchmark and selects only the high-scoring branches as the starting point for the next mutation. Good code gets explored more deeply, bad code gets pruned early—a search in the shape of a tree.

The search strategy is a PUCT-based Flat UCB Tree Search. A relative of the Monte Carlo Tree Search (MCTS) used by Go and chess engines, it applies a balancing rule (c_puct = 1) that favors promising branches while still giving under-explored ones a chance. But instead of deep recursion, it picks the next node flatly across the whole tree, so cost scales almost linearly with the number of nodes—meaning resource use is easy to predict.

One common misconception is worth correcting here. Early coverage described ERA as searching "tens of thousands of times," but by the paper's own numbers the actual scale is on the order of 500–2,000 code candidates per task, with scores typically saturating—no longer rising—somewhere between 300 and 1,000. The accurate phrasing is not "tens of thousands" but "hundreds to thousands of code-candidate evaluations." You don't need to inflate the numbers; the results are impressive enough.

ERA's core loop. Propose → run → score → select completes one cycle, and the next proposal starts from the highest-scoring branch. The paper's figure reinterpreted in the Pebblous color system.

Take the structure apart and ERA's strength and weakness turn out to come from the same place. The strength is that it doesn't evaluate code in words—it actually runs it and measures. A plausible explanation doesn't save a branch; the score does. The weakness sits in exactly the same spot. If the only criterion for keeping or killing a branch is the benchmark score, then wherever that score points is where ERA arrives. If the score rewards a bias in the data, ERA runs straight toward the bias. Section 4 confronts this problem head-on.

3

It Actually Beat 40 Single-Cell Methods and the CDC Ensemble

The sharpest result came in single-cell analysis. The stage was the batch integration benchmark from OpenProblems (v2.0.0). When you pool data from roughly 1.75 million cells (1,747,937, to be exact) measured across different experiments and platforms, technical noise—batch effects—creeps in; the benchmark scores how well a method strips that noise out while preserving the real biological signal, across 13 metrics and 6 datasets. ERA tried 87 methods here, and 40 of them surpassed the human-built top method on overall score.

The phrase "wrote 40 methods" needs to be read carefully. It does not mean all 40 stand alone in first place; it means that 40 of the 87 attempts cleared the level of the prior best record. The strongest variant, in the BBKNN family, raised the overall score about 14% above the prior top, ComBat. What's interesting is the strategy ERA favored. 55 of the methods were built by recombining two existing methods, and 24 of those beat both of their parent methods. Rather than inventing new principles, ERA persistently recombined good pieces that already existed.

▲ ERA's recombination strategy. 55 of 87 attempts combined two existing methods; 24 of those beat both parent methods. Pebblous original diagram (ERA paper reinterpreted)

The second stage was COVID-19 hospitalization forecasting. The 14 strategies ERA generated surpassed the U.S. CDC's official ensemble (CovidHub). Across 52 jurisdictions and a 4-week forecast horizon, the full-season average WIS was 26 for ERA versus 29 for the CDC ensemble—about 10% lower error. WIS (Weighted Interval Score) is an error metric that reflects both forecast accuracy and the calibration of uncertainty, so lower is better.

Task	Metric	Human best	ERA	Meaning
Single-cell batch integration	Overall score (13 metrics)	ComBat (prior top)	BBKNN variant, +14%	Higher is better
Single-cell batch integration	Methods beating the top	—	40 of 87	Recombination was the workhorse
COVID-19 hospitalization forecast	Average WIS	CDC ensemble 29	26 (~10% lower)	Lower is better; retrospective

This COVID result, though, comes with a caveat worth stating honestly. It is a retrospective evaluation. The CDC's forecast hub stopped accepting new submissions on May 1, 2024, and ERA scored its results by analyzing the same period after the fact. ERA did not beat the CDC in a real-time race to predict the future; it found better code when re-analyzing a period whose outcomes were already known. And what beat the CDC was not a single model but 14 strategies—in a real deployment you would pick only one or two of them. On GIFT-Eval, which scores time-series and tabular forecasting, ERA also led on 17 of 19 integration-related tasks, but this report keeps its focus on the two cases of single-cell and COVID.

4

The Validation Paradox: What Does Code Optimized to Win a Benchmark Actually Validate?

ERA's reward function is the benchmark score. The design is powerful, but it carries one built-in risk. If the validation data has a bias baked in—a particular patient cohort, sequencing platform, or batch effect—then the "optimal code" becomes the code that reproduces that bias most faithfully. It learns the shadow of the data, not the analyst's intent. This is the validation paradox: the more you automate analysis to push the score up, the heavier the responsibility for what the score actually measures.

▲ The validation paradox — benchmark bias feedback loop. When validation data carries a bias, ERA finds the code that reproduces that bias most faithfully. Pebblous original diagram (ERA §validation discussion reinterpreted)

The concern is not abstract. Single-cell foundation models (scGPT, Geneformer) have been reported to lag behind simple classical methods on some zero-shot tasks, and one critique (SC-ARENA) flagged suspected data leakage where the CELLxGENE data used in pretraining overlapped with the evaluation set. Overfitting to a static leaderboard is also an old trap across machine learning. One survey found that only 63.5% of ML research was reproducible; when AI components are added, the non-reproducible share climbs to roughly 70%, and the annual cost of reproducibility failures is estimated at about $28 billion.

In truth, the asymmetry of "abundant prediction, scarce validation" is a structural problem that predates ERA. DeepMind's GNoME predicted 2.2 million new crystal structures, but only 736 were independently verified by outside experiment—0.033% of the predictions. The more exponentially AI multiplies candidates, the more the validation bottleneck stays linear. ERA's automated analysis moves the same asymmetry up one layer, to the layer of "analysis code." Code candidates pour out by the thousands, but the work of independently verifying whether that code is right still falls to people and data.

Prediction (abundant)

2.2M

Crystal structures GNoME predicted. AI generates candidates almost without limit.

Validation (scarce)

736

Structures verified by independent experiment. Validation, bound to data, time, and cost, grows only linearly.

In the end the question moves from the model to the data. Topping a benchmark is not the starting line but the final checkpoint. Unless you first ask which data did the scoring, where it came from, and what it left out, the automatically generated "optimal code" can become the code that copies the bias most precisely. This is the substance of the claim that the definition of "good data" moves to "good analysis."

5

So Who Validates Data Quality?

In an environment where analysis code is generated automatically, the first thing a team needs to change is the order of validation. It's tempting to ask "where does this method rank on the benchmark" first, but that question should come last. What you should ask first is "what is the data behind this score, and where did it come from?" The four steps below are an order that data teams reviewing or adopting ERA-style automated analysis can follow.

① Data provenance transparency

Trace which cohort, platform, and period the validation data came from. If you don't know the source, you can't know what the score means.

② Bias mapping

Map the skews baked into the data—batch effects, sequencing platforms, cohort composition—in advance, so you can anticipate what the optimal code will copy.

③ Reproducibility check

Confirm that the same code delivers the same performance on independent data and holdouts not used in training or validation. A score on a single leaderboard is a hypothesis, not evidence.

④ Benchmark

Only after passing the three steps above do you look at the benchmark score. The score is not the conclusion but the final stamp of confirmation.

Implications for Korea's Bio and Health Data

Korea is rapidly accumulating large-scale clinical data. The National Bio Big Data Project (NIBDCP) is a roughly $420M (KRW 606.5B) effort targeting data from 770,000 people, moving in step with a 2025 plan to produce 98,000 whole-genome sequences (WGS) and the petabyte-scale (3 PB) infrastructure of the K-Health Data Platform (KHDP). Yet single-cell-level public data and the analysts to handle it remain limited relative to the global frontier, and a sizable share of domestic scRNA-seq projects (an estimated 30–40%) remain insufficiently analyzed.

In this landscape, the first bottleneck when adopting ERA-style automated analysis is not the model. Batch-effect correction and provenance transparency come first. Systematically validating biobank data quality—for example, the kind of checking that Europe's DQ4HEALTH reported at an error rate of about 0.74%—becomes a precondition for automated analysis. Even when a machine finds good analysis code, shoring up the data floor that code stands on is still a human job.

In the age of automated analysis, competitive advantage turns less on "a better model" than on "more trustworthy validation." A validation infrastructure that traces data provenance, maps bias, and guarantees reproducibility rises to become the precondition for automating analysis. The question Pebblous has worked on through DataClinic and AI-Ready Data—"data determines the trust we can place in analysis"—finds one more confirmation in ERA, an external case.

For the same current viewed from other angles, see How AI Is Reshaping Scientific Discovery on AI that accelerates discovery overall, The Age of AI Writing Science Papers on the reality and hype of hypothesis-generating AI, and NVIDIA's Virtual Cell Challenge on single-cell models and data curation. This piece focuses on one point among them: data quality at the validation stage.

R

References

Primary sources — ERA

1.Google Research. (2026). "An AI system to help scientists write expert-level empirical software." Nature 654, 909–916 (2026-05-19). nature.com/articles/s41586-026-10658-6
2."Empirical Research Assistance (ERA)." arXiv preprint. arXiv:2509.06503 v3 (key source for mechanism and quantitative detail). arxiv.org/abs/2509.06503
3.Google Research Blog. (2026). "Empirical Research Assistance (ERA): From Nature publication to catalyzing computational discovery." research.google
4.google-research/era. Official GitHub code repository. github.com/google-research/era

Benchmarks and evaluation standards

5.OpenProblems. "Batch Integration Benchmark (v2.0.0)." openproblems.bio
6.Luecken, M. D. et al. (2025). "Defining and benchmarking open problems in single-cell analysis." Nature Biotechnology. nature.com/articles/s41587-025-02694-w
7.CDC COVID-19 Forecast Hub / CovidHub Ensemble (WIS methodology). covid19forecasthub.org

Lineage and reproducibility

8.Merchant, A. et al. (2023). "Scaling deep learning for materials discovery (GNoME)." Nature. deepmind.google
9.Semmelrock, H. et al. (2025). "Reproducibility in Machine Learning-based research." AI Magazine. onlinelibrary.wiley.com

Korea and Pebblous-adjacent

10."Korea's Bio Big Data Project: Governance and Data Utilization (NIBDCP)." Healthcare Informatics Research 31(3), 226 (2025). e-hir.org
11.Pebblous. "Korea Research Data Act 2026 — A Data Governance Report." blog.pebblous.ai