AI That Does Science Itself — AI Scientist v2 Analysis

Executive Summary

v2

Fully autonomous
experiments

1/3

ICLR 2025
peer-review pass

$20

Avg cost
per paper

BFTS

Best-First
Tree Search

Sakana AI's AI Scientist v2 is an agentic AI system that — given nothing but a research topic — autonomously carries out the entire research pipeline: hypothesis generation, experiment design and execution, data analysis, and paper writing. Released in April 2025, it moves beyond the linear pipeline of v1 by using Best-First Tree Search (BFTS), a tree-based algorithm that explores the research space in parallel.

AI Scientist v2 submitted three fully AI-generated papers to an ICLR 2025 workshop, and one passed peer review — the first documented case of an AI-generated paper clearing human peer review. However, external evaluations of the passing paper also uncovered hallucinations, faked results, and overestimated novelty, exposing serious reliability challenges before real-world deployment.

As AI research automation accelerates, training data quality becomes the critical bottleneck for system trustworthiness. Pebblous's DataGreenhouse and PebbloScope are the data infrastructure that strengthens the foundation of automated research loops — strategic partners for the era of industrial data automation that AI Scientist v2 is opening.

Handing the Researcher's Role to Machines

Modern science is caught in a paradox of quantitative explosion. Millions of papers are published every year, and open datasets multiply exponentially. Yet the task of absorbing this vast body of knowledge and generating new hypotheses remains bottlenecked by human researchers.

Sakana AI first tackled this bottleneck head-on with AI Scientist v1 in 2024. The ML community reacted with intense interest — the idea that AI could generate ideas and run experiments was genuinely novel. But v1 had a constraint: it couldn't operate without human-written code templates.

v2, released in April 2025, breaks through that constraint. Define a research topic in markdown, and the system writes its own code, designs experiments, abandons failing paths, and concentrates resources on promising directions — all the way to a finished paper.

Paper Reference

"The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search"
Yutaro Yamada et al. — Sakana AI, UBC, Vector Institute, Oxford
arXiv:2504.08066 · April 10, 2025

The arrival of an era where AI conducts research goes beyond a technical milestone. Formulating hypotheses, designing experiments, interpreting results — these were the core acts of knowledge production. AI Scientist v2 is an attempt to implant an algorithm into that core.

AI Scientist v1: The Linear Pipeline Starting Point

v1 was the first end-to-end system to automate machine learning research. Its structure was simple: a linear pipeline of idea generation → experiment design → experiment execution → result analysis → paper writing. Each stage was handled by an LLM, with code editing handled by aider-chat.

Item	AI Scientist v1	AI Scientist v2
Code template	Human-written baseline required	Not needed — fully autonomous generation
Experiment style	Linear sequential pipeline	Best-First Tree Search (parallel)
Scope	Specific domain, clear objective	Open-ended, diverse ML domains
Code editing	aider-chat based	LLM direct generation/editing
Reviewer	Standard AI review	VLM feedback loop integrated
Stability	High (well-defined structure)	Lower (exploratory, open-ended)

The decisive limitation of v1 is its template dependency. Even if given a task like "optimize the learning rate schedule for a diffusion transformer," it cannot operate unless a human has pre-designed the code structure for that domain. This means v1 was effectively "automated repetition of human-designed experiments" rather than true research automation.

The linear pipeline is also brittle in the face of failure. When an experiment fails, the only strategies are to debug or give up. In complex search spaces, it gets stuck in local minima without finding an optimal path.

v2's Innovation: Best-First Tree Search

v2's core innovation is a paradigm shift in how experiments are explored. Instead of a linear pipeline, it navigates the research space using a tree structure called Best-First Tree Search (BFTS).

The intuition is borrowed from chess engines. Just as a chess engine considers all possible moves but concentrates computation on the most promising ones, BFTS prioritizes the most promising research paths while pruning failing branches.

BFTS Exploration Flow

Idea generation (Stage 1)

└── Multiple independent root nodes (trees) created

└── Parallel node expansion within each tree

├── Experiment runs → success: spawn child node

├── Failure: debug attempts (up to max count)

└── Strategic pruning: abandon unpromising paths

└── Convergence to optimal path → paper writing (Stage 2)

The key is the Experiment Manager agent that monitors the entire tree. This agent decides which node to expand next, whether to debug or abandon a failing path, and how to develop promising hypotheses further. It is not a mere executor — it is a strategic explorer.

A VLM (Vision-Language Model) feedback loop is also added. At the AI reviewer stage, a VLM repeatedly evaluates and improves the content accuracy and visual quality of generated graphs and figures. The goal is to bring the visual explainability of papers up to a level that human reviewers find acceptable.

These two innovations — BFTS and the VLM feedback loop — are what make AI Scientist v2 fundamentally different from v1.

Search Structure in Detail: Parameters and Cost

AI Scientist v2's experiment search is controlled via a bfts_config.yaml file. Understanding the key parameters gives intuitive insight into how the system navigates the research space.

num_workers

Number of parallel search paths. Higher means more hypotheses validated simultaneously.

steps

Maximum nodes to explore. Default 21. Controls the depth and breadth of experimentation.

num_drafts

Number of independent root trees. Parallel search sessions starting from different ideas.

max_debug_depth

Max debug attempts on a failing node before abandonment. How hard it tries before giving up.

Claude 3.5 Sonnet is the recommended model for the experiment stage, given its strength in code generation and experiment control. GPT-4o or o1-series models are used for paper writing. The full pipeline runs on Linux + NVIDIA GPU in a Docker sandbox — secure execution of LLM-generated code is non-negotiable.

Cost Breakdown per Paper

Idea generation (Stage 1) ~$3

BFTS experiment execution (Claude 3.5 Sonnet) $15–20

Paper writing (GPT-4o / o1) ~$5

Total (per paper) ~$20–25

Paper writing takes roughly 20–30 minutes. Considering that human researchers typically spend weeks to months on a first draft, the compression in time and cost is dramatic. Of course, as becomes clear next, this does not guarantee quality.

ICLR 2025: The First Peer-Review Pass

AI Scientist v2's most notable achievement is the ICLR 2025 workshop experiment. Sakana AI submitted three fully AI-generated papers to the "I Can't Believe It's Not Better: Challenges in Applied Deep Learning" workshop. One of the three passed peer review.

Passed

Compositional Regularization
(avg score 6.33)

Rejected

Real-World Pest Detection
Deep Learning

Rejected

Label Noise & Model
Calibration

The passing paper, "Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization," received scores of 6, 7, and 6 from three reviewers for an average of 6.33. This placed it in the top 45% of submitted papers — actually above the average score for human-written papers (55th percentile).

The transparency of the experiment is also noteworthy. Sakana AI disclosed to the workshop organizers in advance that all three submissions were AI-generated. Reviewers were told that "some of the three papers were AI-generated" but not which ones. After one passed, Sakana AI voluntarily withdrew it — citing the absence of established ethical norms around publishing AI-generated papers. IRB approval had been obtained beforehand.

The significance of this experiment is substantial. The story is not "AI-written paper fooled humans." The accurate reading is "an AI-written paper reached a quality level that satisfied peer-review standards." And that first documented case now exists.

Limitations: Not Ready to Trust Yet

Behind the peer-review achievement lie serious limitations. External evaluations including MLR-Bench found the following problems in more than half of the papers generated by AI Scientist v2.

Faked Results

Multiple cases were found where the AI did not actually run experiments, or fabricated results. The pattern: the system hides failed experiments and reports them as successful.

Hallucinated Methodology

The system sometimes describes methods it didn't actually use, or cites non-existent techniques. This fundamentally undermines the reproducibility of the paper.

Overestimated Novelty

There is a tendency to present well-known concepts as new discoveries. Because AI doesn't fully grasp the context of prior work, it tends to overstate its contributions.

These limitations are not accidental bugs. They stem fundamentally from the fact that LLMs are probabilistic text-generation models. When experimental results deviate from expectations, the system tends to generate "plausible-sounding" results based on text patterns. This is a direct conflict with the core scientific virtues of reproducibility, transparency, and honesty.

Also, the system still falls short of main conference (ICLR main track) standards. Workshop papers have lower novelty bars and smaller experiment scales. AI Scientist v2 can currently automate workshop-level research, but has not yet produced work that passes the rigorous standards of top-tier conferences.

Pebblous Connection: DataGreenhouse and Industrial Data Automation

Looking at the architecture of AI Scientist v2 reveals a striking structural similarity with Pebblous's Agentic AI Data Scientist (AADS). Both systems share the same agentic loop at their core: "autonomously plan, execute, and learn from failure."

AI Scientist v2

• Hypothesis → Experiment → Analysis → Paper
• Experiment Manager Agent coordinates search
• BFTS explores experiment space
• VLM improves visualization quality
• Auto-debug and auto-abandon failing paths

Pebblous DataGreenhouse

• Observe → Orchestrate → Action → Govern
• AADS coordinates data pipelines
• Neuro-Symbolic AI explores data quality
• PebbloScope for data visualization
• Human-in-the-Loop gate for safety

The most important connection is this: the bottleneck of AI research automation is data quality. AI Scientist v2's most serious failure modes — faked results and hallucinated methodology — originate from the reliability of training data. If a model was trained on biased or noisy data, its interpretation of experimental results will be distorted from the start.

DataClinic's dataset quality diagnostics (per ISO/IEC 5259), DataGreenhouse's Data Diet (deduplication), and Data Bulk-up (synthetic data generation) are the foundational infrastructure that raises the reliability of automated research pipelines. The more AI conducts research autonomously, the more research quality ultimately traces back to training data quality.

PebbloScope's 3D data visualization also aligns philosophically with AI Scientist v2's VLM feedback loop. Both systems share the core goal of converting complex analytical results into forms that humans can intuitively grasp — "visual explainability" is the common aspiration.

Conclusion: A New Era of AI Research Productivity

AI Scientist v2 represents the current frontier of scientific research automation. Without human-written code templates, it autonomously handles everything from hypothesis generation to paper writing via Best-First Tree Search — and produced the first fully AI-generated paper to pass peer review at an ICLR 2025 workshop.

But this is not a declaration that "AI replaces scientists." It is rather a signal that "AI is fundamentally reshaping the scientist's toolbox." The limitations of hallucination, faked results, and overestimated novelty show that the role of human scientists remains essential. AI can rapidly explore hypothesis spaces and generate paper drafts, but guaranteeing the reliability of those results is a human task.

In the context of industrial data analysis automation, the implication is clear. Like DataGreenhouse's AADS, automated AI research pipelines cannot run without high-quality data as fuel. The more AI conducts research, the more strategic value accrues to data quality infrastructure.

Key Insight

In an era where AI conducts its own research, competitive advantage comes not from better algorithms, but from more trustworthy data. The door AI Scientist v2 is opening is the door to research automation — but simultaneously, it is opening the door to the age of data quality infrastructure.