Executive Summary

We took 68 data-analysis tasks pulled from real Kaggle notebooks and handed them to five frontier models. Each task ran about 33 turns on average, 2,225 turns in total, long and drawn out. The results were sobering. The best model averaged 48.45% accuracy, never crossing half. Models that handled a single short question just fine began to quietly get things wrong as the analysis stretched on. Through a data team's eyes, this piece reads where, and why, that collapse begins.

The most painful number is the gap between early and late turns. Within a single task, accuracy on later turns fell nearly 47pp below the early ones. And it wasn't because the models forgot. It was because they failed to hold, revise, and restore the "analytical state" (the definitions, filters, and assumptions set earlier) in time, so that one wrong intermediate result contaminated every calculation that followed.

More steps weren't the answer. The model that used the most steps actually scored lower than the top model. The bottleneck wasn't how many times it tried, but how correctly it held the analytical state. If your team has already handed data work to an agent, this is the moment to ask whether that work is drifting wrong in ways too subtle to catch the longer it runs.

Key figures

Source: LongDS-Bench (arXiv:2605.30434)

The four numbers below capture both the size and the cause of the collapse. The first two show how large it is; the last two point to why it's a problem of state management, not memory.

48.45%

Best-model accuracy

Even top-ranked Gemini-3.1-Pro couldn't cross half

47pp

Late-turn drop

Accuracy gap from early to late turns in the same task

52–69%

Long-horizon errors

Share of all failures that are state-management type

11.3 turns

Avg. state-dependency distance

A request relied on state from 11.3 turns earlier on average

1

68 Tasks That Look Like Real Analysis

Released in May 2026 by a joint team from Zhejiang University and Ant Group, LongDS-Bench has a different grain than the data-analysis benchmarks that came before it. Most existing benchmarks pose a single, self-contained question and score the answer. But real data analysis doesn't flow that way. You define a cohort, apply a filter, build an intermediate metric, quietly change that definition a few days later, then roll back to a version you froze yesterday to answer a question, all tangled across many turns.

To reproduce that flow, the researchers drew 68 tasks from real Kaggle notebooks and rebuilt them as multi-turn conversations. They span six domains (earth science, business, education, sports, social good, and community), averaging 33 turns per task, 2,225 turns in all. What matters is that these turns aren't independent. A single request depended on state from 11.3 turns earlier on average, and a single turn had to reference 2.9 prior states at once. Miss what was defined earlier, and how, and everything downstream shakes loose.

Analytical state dependency chain — each request references 11.3 turns back on average T1 T5 T11 T17 T22 T28 T33 ← 11 turns back Initial Update Rollback Avg. dependency 11.3 turns States per turn 2.9 states Avg. turns/task 33 turns
▲ Analytical state dependency chain — how state builds and loops across turns | Pebblous original diagram (LongDS-Bench reinterpretation)

The researchers call the thing that holds this chained work together "analytical state." It isn't just the numbers produced earlier. It's the entire context of which scope, definitions, and assumptions a calculation was run under. When conditions like "the LTV of new customers last quarter, North America only, excluding repeat buyers" pile up turn after turn, the agent has to keep tracking which conditions are currently in force. The tasks capture five patterns for handling this state.

State-evolution pattern What the agent must do
Initial Establish a reusable analytical object (cohort, metric, definition) for the first time
Update Revise a prior definition or filter, and make the revision the new default
Counterfactual Introduce a temporary assumption that applies only to this turn
Rollback Return to an earlier frozen analytical version to answer
Composition Explicitly combine several state operations

A single task held, on average, 5.8 Rollback turns and 8.6 Composition turns. In other words, going back and stacking things on top of each other is the default, not the exception. The context a human analyst leaves naturally in notes and code cells, the agent had to hold on its own every turn.

Why this benchmark matters: LongDS-Bench doesn't ask "can AI do data analysis?" It asks "can it hold state across a long, continuing analysis?" Because that is exactly the shape of the work we actually hand to agents.

2

Even the Best Model Fell Short of Half

When five frontier models were run on the same 68 tasks, even first place stalled just at the halfway mark. Gemini-3.1-Pro led at 48.45% on average, followed by GPT-5.4, Claude-4.6-Sonnet, Kimi-K2.6, and DeepSeek-V4-Pro. For reference, this scoring matched human verification 93.11% of the time (Cohen's κ = 0.86), so the scores themselves are hard to doubt.

Model Average accuracy
Gemini-3.1-Pro 48.45%
GPT-5.4 43.50%
Claude-4.6-Sonnet 41.56%
Kimi-K2.6 39.72%
DeepSeek-V4-Pro 31.97%

Scarier than the averages is the change within a task. Models that got things roughly right on the early turns went sharply wrong as the same task moved toward its later turns. The accuracy gap between early and late reached nearly 47pp. The very model someone might have judged "good enough" after watching only the opening was quietly collapsing at the end.

Accuracy within a single task — early turns vs. late turns 100% 50% 0% ~70% ~23% ▼ 47pp Early turns Late turns Conceptual diagram — accuracy gap within same task (LongDS-Bench)
▲ 47pp accuracy drop from early to late turns | Pebblous original diagram (LongDS-Bench reinterpretation)

This is precisely where things get dangerous in practice. Data-analysis results usually come from the output of the last few turns: the final aggregation and conclusion. Yet the point where accuracy collapses is exactly that late stage. A pattern where the early exploration looks plausible but the final answer is wrong is the hardest kind of failure for a reviewer to catch.

The core finding: The problem isn't "AI can't do data analysis." Over short spans, it does fine. The real signal is that accuracy collapses the longer it runs, especially in the late stage where conclusions are drawn.

3

What Breaks Is State, Not Memory

The researchers dissected the failures and split the causes broadly into two camps. Traditional errors (coding, planning, domain reasoning) accounted for 31–48%, and the remaining 52–69% were "long-horizon errors" that arise only in long tasks. In other words, more than half of the failures happened not because the model lacked intelligence, but because it couldn't handle the state that carries forward. These long-horizon errors split again into three kinds.

Error type What fails Scale
Cascade Error A wrong intermediate state propagates through all downstream calculations Largest
State Management Error Fails to select, update, and restore the correct state Middle
Context Memory Error Fails simply to recall earlier information Smallest
Failure breakdown — over half is a state-management problem All failures Traditional errors ~40% Long-horizon errors ~60% Long-horizon breakdown Cascade (largest) State Mgmt Context largest middle smallest Proportions estimated from 52–69% range midpoint | Source: LongDS-Bench (arXiv:2605.30434)
▲ All failure causes and long-horizon error breakdown | Pebblous original diagram (LongDS-Bench reinterpretation)

Here an interesting paradox surfaces. What we usually worry about is that "AI forgets what came earlier." Yet that simple memory failure (Context Memory) was the smallest category. Pulling information back up was relatively easy. Where things truly broke was in applying the information correctly: judging which state is currently valid (State Management) and passing it correctly into the next calculation (Cascade).

Why the largest category, Cascade Error, is so dangerous is right there in the name. Get one definition wrong early, and every filter, aggregation, and conclusion stacked on top of it is contaminated. That's why late-stage accuracy sagged by 47pp. Not because the later parts were inherently harder, but because a state that went wrong early flowed downstream and ate away at the answer. A single subtle state drift quietly colors every turn that follows.

A shift in perspective: The diagnosis that "AI gets it wrong because it can't remember" is only half right. In data analysis, the real bottleneck isn't retrieval. It's state. What counts is which definition is valid right now, and how it's passed into the next calculation.

4

More Steps Won't Buy Accuracy

It's easy to think intuitively like this: if the later stage is hard, just give the agent room to think more and try more. LongDS-Bench's results rebut that expectation head-on. The researchers state flatly that "additional steps do not necessarily improve performance."

The clearest evidence is inside the leaderboard. Among the models tested, the one that used the most steps was Claude-4.6-Sonnet. Yet its score came in below Gemini-3.1-Pro, which moved far less. The model that tried harder and more often lost. When the number of attempts doesn't rise alongside accuracy, it's a signal that the bottleneck lies somewhere other than the amount of effort.

Steps vs. accuracy — more attempts did not mean higher score Gemini-3.1-Pro Claude-4.6-Sonnet Steps used Accuracy relatively fewer most steps ← 48.45% 41.56% ✓ #1 most steps → #3 Source: LongDS-Bench (arXiv:2605.30434)
▲ The steps paradox — the most-steps model ranked 3rd, not 1st | Pebblous original diagram (LongDS-Bench reinterpretation)

The researchers' reset experiment points to that "somewhere else." Agents tended to take fewer exploration steps as they moved into the later stage. Since a state that went wrong early had already contaminated the whole late stage, giving more steps just spins in place on a broken foundation. That's why the answer is to build a point where state gets corrected, rather than to inflate the budget.

In one line: The accuracy bottleneck isn't "how many times it tries," it's "how correctly it holds analytical state." A bigger budget and more steps become waste when they sit on top of a wrong state.

5

A Third Warning for Data Teams

This blog has covered the failure of long-horizon agents twice before. Once it was an infrastructure problem: a weak harness kills the agent. Another time it was a memory problem, where the secret of the agent that lasted a year wasn't cleverness but a notepad. LongDS-Bench is the third scene in that arc. This time, in the concrete task of data analysis, it nails down across 68 tasks and 2,225 turns that the cause of failure is neither too few steps nor too little memory, but the holding of state.

Read from a data point of view, the conclusion is clear. The analytical state an agent must hold (definitions and scope, assumptions and intermediate results) is ultimately data that has to be managed explicitly, outside the model. Unless it's structured so you can tell under which definition, and when, that data was updated, and which version is valid now, even the smartest model loses its way in the late stage. In the age of autonomy, the bottleneck isn't greater reasoning power. It's how well you've designed the state there is to hold.

So the question shifts to this. Is the multi-step data work your team has handed to an agent missing the contamination in the later parts, just because the earlier parts look plausible? There are three things to check. Is there a structure that explicitly stores and re-references intermediate definitions and assumptions? Have you tested Rollback, returning to a previously frozen version? And are you separately verifying the later turns, in case accuracy is only good early on? Before switching to a bigger model, it's time to put the state you need to hold in order first.

Agent Analysis State Checklist — 3 Things to Verify Explicit State Storage Is there a structure that stores and re-references definitions and assumptions explicitly? Rollback Test Has returning to a prior analysis version actually been tested? Late-Turn Verification Are the final turns— where conclusions emerge—separately verified? ⚠ Early success masks late-turn contamination — the most common failure mode
▲ Agent analytical-state checklist — 3 verification points | Original Pebblous diagram (LongDS-Bench reinterpreted)

To close: In long data analysis, it's not that AI can't do it. It quietly gets more wrong the longer it goes. What breaks that quiet isn't a bigger budget, but a design that holds analytical state as explicit data. The habit of looking separately at the later stage is where that starts.

R

References

  • 1.Xu, K., Lu, X., Qiao, S., Ding, Z., Xu, H., Liang, L., & Zhang, N. (2026). "LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis." arXiv:2605.30434. — A multi-turn data-analysis benchmark of 68 tasks and 2,225 turns built from real Kaggle notebooks. The top model, Gemini-3.1-Pro, averaged 48.45%, with late-turn accuracy dropping about 47pp below early turns. Of all failures, 52–69% were long-horizon errors (Cascade > State Management > Context Memory), and additional steps were found not to guarantee performance.