Executive Summary

The center of gravity in the LLM race is shifting from "one bigger model" to "a trained coordinator that directs several models." The clearest example is Fugu Ultra, which Sakana AI moved to general availability in June 2026. Fugu is not a new foundation model. It is a model trained to call other LLMs: behind a single endpoint it breaks a task apart, decides which sub-model should handle what, then verifies and synthesizes the results. What sets it apart from if-else rules or embedding-similarity prompt routing is that the routing policy itself is learned from a reward signal. This report looks at what that shift really is, and why it is, at bottom, a data problem.

Sakana announced that Fugu Ultra led or tied the field on 8 of 10 benchmarks against Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. Read closely, the claim is thinner than it sounds. The outright wins number seven, and the two clear losses both land on long-context tasks, where orchestration adds coordination noise rather than removing it. Every baseline score is vendor-reported, and the open-versus-closed makeup of the model pool is left undisclosed, so the interesting question is less the leaderboard than its auditability.

Fugu's marketing wrapper is "AI sovereignty": the claim that it stays competitive without the frontier models that export controls keep out of its pool. But a coordinator can only route well if its picture of which model is good at what is accurate, and that picture is only as good as the benchmark and evaluation data behind it. The ceiling on orchestration performance is the quality of model-evaluation data. Seen from Pebblous's vantage, that is the natural extension of "data over models": now the model itself has to be diagnosed, selected, and managed like data.

The four numbers below mark the corners of that argument: Fugu's reported benchmark standing, the cost environment that makes routing urgent, and the measured edge a good combination wins over any single model.

8 / 10

Benchmarks led or tied

Sakana's claim; seven are outright wins (vendor-reported)

~10× / yr

Inference cost decline

Equal-quality $/token, a16z "LLMflation" (estimate)

75%

Routing cost savings

RouteLLM at 95% of GPT-4 quality

+7.6%p

Combination's accuracy edge

Mixture-of-Agents over GPT-4 Omni on AlpacaEval

1

The Model That Conducts Models: What Fugu Ultra Actually Did

The easiest way to misread Fugu Ultra is to file it next to GPT-5.5 or Gemini 3.1 Pro as one more frontier model. It is a different kind of object. Behind a single API endpoint, Fugu takes an incoming task, decomposes it, decides which underlying LLM should handle each piece, calls those models (recursively, when a sub-task needs further breaking down), and then verifies and synthesizes their outputs into one answer. Sakana's own shorthand for it is blunt: an LLM trained to call other LLMs. The work that a human systems engineer would do by hand when stitching several models into a pipeline is folded into the model itself.

User Task Fugu Ultra Learned Coordinator ① decompose ② route (RL policy) ③ verify + synthesize Model A coding / SWE Model B science / reasoning Model C long-context / agentic Verified Answer
▲ Fugu Ultra's orchestration flow: the learned coordinator decomposes a task, routes sub-tasks to specialized models via an RL-trained policy, then verifies and synthesizes the result. | Pebblous original diagram (Fig. 1 reinterpretation)

Sakana ships the product in two tiers. Fugu Mini is tuned for latency, doing lighter coordination with fewer downstream calls; Fugu Ultra is tuned for peak performance, willing to spend more calls and more verification to push quality up. The target workloads are the ones where decomposition pays off: software engineering, scientific reasoning, and agentic tasks that span multiple steps. In an internal AutoResearch run, Fugu Ultra spent roughly 14 hours on a single H100 conducting 123 experiments and reached a lowest bits-per-byte of 0.9748, above every single-model baseline in that setup. The point of the demo is not the precise number but the shape of it: the coordinator earns its keep by trying many configurations and keeping the best.

The pricing tells you who Sakana thinks the buyer is. Subscriptions run at $20, $100, and $200 per month; enterprise usage is metered at roughly $5 per million input tokens and $30 per million output tokens, with long-context calls (beyond 272K tokens) billed at double, and a launch promotion waiving the second month for anyone who subscribes before the end of July 2026. On paper the per-token output rate sits near Opus 4.8's, but a true comparison is harder than it looks: a single Fugu call may fan out into several downstream model calls, so the headline rate and the real cost are not the same thing.

The category to watch is not "another model" but "a product whose job is orchestration." Fugu's bet is that as capable models pile up, the scarce skill is no longer building one of them — it is deciding, automatically and well, which of them to use for what. That decision is what Fugu sells.

2

The Lineage of the Learned Orchestrator

Fugu did not appear from nowhere. It sits at the end of a line of techniques that all answer the same question — how do you get more out of a set of models than the best single one can give you — and the line has been moving steadily toward letting the system learn the answer instead of hand-coding it. The earliest step is plain prompt routing: an if-else rule or an embedding-similarity match sends each query to a chosen model. Useful, but the policy is fixed by a human.

From there the ideas stack up. Mixture-of-Agents (MoA) layers several models so each refines the others' drafts; its most striking finding is "collaborativeness," where even a low-quality helper output lifts the main model's answer. Model cascades, as in FrugalGPT, run models in sequence from cheap to expensive and let a quality judge decide when to escalate. Learned routers, as in RouteLLM, train a router on preference data to predict which model will clear a quality bar at the lowest cost. The latest step, and Fugu's direct ancestor, is the learned coordinator: Sakana's own Conductor and TRINITY papers train the routing policy itself with reinforcement learning, shaping the reward around both correctness and cost.

The quantitative case that a combination can beat the best single model is already on the record, and it is not marginal.

Approach How it picks Representative work Reported gain
Prompt routing if-else rules / embedding similarity OpenRouter-style infra Cost & latency control (infra level)
Mixture-of-Agents layered aggregation of several models Wang et al., 2024 +7.6%p on AlpacaEval vs GPT-4 Omni (open-source only)
Model cascade cheap → expensive, judge gates escalation FrugalGPT, 2023 73% cost cut, +1%p accuracy vs GPT-4
Learned router preference-trained router predicts best model RouteLLM, 2024 75% cost cut at 95% of GPT-4 quality
Learned coordinator RL on the routing policy (correctness + cost reward) Conductor / TRINITY (Sakana, ICLR 2026) 7B coordinator scores GPQA Diamond 87.5% (> Gemini 2.5 Pro 84.8%)

Lineage of orchestration techniques. Gains are reported by the cited work; cross-system comparison is approximate. Sources: arXiv:2406.04692 (MoA), arXiv:2406.18665 (RouteLLM), FrugalGPT (TMLR 2024), Sakana ICLR 2026.

What is genuinely new about Fugu is not that it routes — routing is old — but that the routing policy is learned as a first-class objective. Conductor's 7B coordinator, built on Qwen2.5-7B, beats a frontier model on GPQA Diamond not by being smarter than them but by knowing, from training, when to defer to which. The coordinator's intelligence is, in the most literal sense, a function of the data it learned to route on.

3

The Truth Behind "8 of 10": What the Lead Says and What It Hides

Sakana's headline is that Fugu Ultra "led or tied on 8 of 10 benchmarks" against Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The claim is true in the narrow sense and misleading in the way it is usually heard. Counting outright wins, Fugu leads on seven: SWE-Bench Pro, TerminalBench 2.1, LiveCodeBench, LiveCodeBench Pro, Humanity's Last Exam, CharXiv Reasoning, and GPQA-D. The eighth is a statistical tie on SciCode, where Gemini edges ahead by 0.2 points. And on the two it loses — Long Context Reasoning and MRCRv2 — both go to GPT-5.5, and both are long-context tasks.

The table below reproduces Sakana's published figures in full. Every number is vendor-reported, and that caveat is not a footnote — it is the main story of this section.

Benchmark Domain Fugu Ultra Opus 4.8 Gemini 3.1 Pro GPT-5.5 Leader
SWE-Bench Pro Coding / SWE 73.7 69.2 54.2 58.6 Fugu
TerminalBench 2.1 Agentic 82.1 74.6 70.3 78.2 Fugu
LiveCodeBench Coding 93.2 87.8 88.5 85.3 Fugu
LiveCodeBench Pro Coding 90.8 84.8 82.9 88.4 Fugu
Humanity's Last Exam General knowledge 50.0 49.8 44.4 41.4 Fugu ≈ Opus
CharXiv Reasoning Scientific reasoning 86.6 84.2 83.3 84.1 Fugu
GPQA-D Science 95.5 92.0 94.3 93.6 Fugu
SciCode Scientific coding 58.7 53.5 58.9 56.1 Gemini (by 0.2)
Long Context Reasoning Long context 73.3 67.7 72.7 74.3 GPT-5.5
MRCRv2 Long context 93.6 87.9 84.9 94.8 GPT-5.5

⚠ All scores are Sakana self-reported (vendor-reported) and have not been independently verified. Baseline figures are each provider's own numbers, gathered under differing scaffolds. Some third-party reports conflict (e.g., Fugu's SWE-Bench Pro has appeared as both 73.7 and 54.2). Source: Sakana AI official table, transcribed by officechai / explainx, June 2026.

Why Both Losses Land on Long Context

The two defeats are not random. Sakana itself concedes that the lighter Fugu does better on document-heavy work because it introduces less "over-coordination" noise. That is an honest and revealing admission: orchestration helps when a task can be split and recombined, and hurts when a task is really one long single pass that wants a model to hold the whole context in mind at once. The losses are a quantitative signal of orchestration's natural boundary, not a rounding error.

The Real Weakness Is Transparency, Not Score

Step back from the individual cells and the deeper problem is that none of this is independently reproducible. Every baseline is provider-reported. The ratio of closed to open models in Fugu's agent pool is not disclosed. Anthropic's published Opus scores come from Anthropic's own scaffold, while Sakana ran a different one, so the comparison is not apples to apples. And the cross-report conflicts on a single number — a SWE-Bench Pro that swings between 54 and 74 depending on who transcribed the table — are exactly what you would expect when no neutral party has rerun the suite. A benchmark you cannot audit is a marketing artifact, however good the numbers look.

Take the "8 of 10" at face value and you learn that Fugu is strong on coding and science and weak on long context. Take it apart and you learn something more useful: in a world where orchestrators will increasingly be judged on benchmarks, the scarce and valuable thing is an evaluation you can trust — neutral scaffolds, disclosed pools, reproducible runs. That is a data-governance problem wearing a leaderboard's clothes.

4

The Veneer of AI Sovereignty: Export Controls and the Global Contest

Fugu's most quoted talking point is "AI sovereignty," and to read it fairly you have to start with the rules it is responding to. In January 2025 the US Bureau of Industry and Security issued the Framework for Artificial Intelligence Diffusion, sorting the world into three tiers. Tier 1, with effectively unrestricted access, covers the US plus eighteen allies, Korea and Japan and Taiwan among them. Tier 2 countries face volume caps and verified-end-user licensing. Tier 3 — China, Russia, Iran, North Korea, and others — is export-banned in all but name. The same regime is why certain frontier models are simply unavailable in parts of the world.

AI Diffusion Rule (January 2025) — Export Tier Structure Tier 1 — Unrestricted Access US + 18 allies: Korea, Japan, Taiwan… Tier 2 — Volume Caps + Verified End-User Most countries — individual model caps, licensing required Tier 3 — Effectively Export-Banned China, Russia, Iran, North Korea — frontier models inaccessible
▲ The AI Diffusion Rule's three-tier pyramid: Tier 1 allies enjoy unrestricted access; Tier 2 faces volume controls; Tier 3 adversary nations are effectively cut off from frontier models. Fugu's "AI sovereignty" strategy targets the gap between Tier 1 and restricted zones. | Pebblous original diagram

That is the gap Fugu's narrative aims at. Its claim is that you can reach competitive performance without the export-restricted frontier models — without, say, Fable 5 or Mythos in the agent pool — by combining the models you can legally reach. Stated generally, this is a circumvention strategy for any non-US ecosystem: stop trying to own the single best model, and get good at orchestrating the accessible ones instead. Sakana rolled Fugu out everywhere except the EU and EEA, where GDPR and EU AI Act compliance work is still underway, which means Korea and Japan — both Tier 1 — can use it immediately.

There is a geopolitical undertone worth naming. Among Sakana's investors is In-Q-Tel, the venture arm associated with the US intelligence community, which reads as a signal of interest from the defense and intelligence sector in exactly this kind of orchestration capability. Sovereignty here is not a purely commercial story.

But the sovereignty claim carries a paradox it cannot shake. Fugu still calls closed, external APIs to do its work, so it is not independent in any complete sense. What it offers is a partial sovereignty: not being locked to one vendor or one model, and being able to swap pieces in and out. The honest version of the argument is that the foundation of independence is no longer owning a model — it is the capability to orchestrate and evaluate, and beneath that capability sits data.

5

Treating Models Like Data: Practice for the Orchestration Era

Pull the threads together and they meet at one place. A learned coordinator routes well only if it holds an accurate capability profile of its pool — which model is good at what, at what cost, with what latency. That profile is not an intuition; it is data, produced by running evaluations and logging results. Whatever bias or contamination lives in those evaluation sets passes straight through into the routing decisions. So the ceiling on orchestration quality is set by the quality of the evaluation data underneath it. Data-quality problems do not disappear in the orchestration era; they move up a layer, into model selection.

There is also an economic reason this layer keeps gaining value rather than losing it. The price per token for equal-quality output is falling by roughly ten times a year, which sounds like it should make routing matter less. It does the opposite. As unit prices fall, call volume explodes — Hugging Face now hosts more than two million public models, and OpenRouter alone moved about 8.4 trillion tokens in a recent month, up four-fold year over year — so total enterprise inference spend rises even as each token gets cheaper. This is Jevons' paradox applied to inference: the cheaper a resource gets, the more of it we use, and the more it matters to spend it wisely. The layer that decides what to call, and when, grows more valuable precisely as the underlying calls get cheaper.

Jevons Paradox: Cheaper Tokens, Higher Total Spend Relative Level 2022 2023 2024 2025 2026 $/token ↓ ~10×/yr Total spend ↑ cheaper per-token → more volume → routing value grows
▲ As inference cost per token falls ~10× per year, call volume surges (OpenRouter: 4× YoY), so total enterprise spend rises — Jevons Paradox at work. The routing layer that decides when and what to call gains value as each individual call gets cheaper. | Pebblous original diagram

For a team building its own LLM pipeline, that turns into three concrete disciplines. They are unglamorous, and they are where the leverage is.

  • Measure capability as data. Build a per-model profile — task type × accuracy × cost × latency — on your own evaluation sets, not on vendor leaderboards. Routing is only as good as this profile, and a profile you did not measure is a profile you cannot trust.
  • Record decisions for reproducibility. Log which model handled which sub-task, why, and how it scored. Without that trail, a routing system is unauditable, and you cannot explain or reproduce why an answer came out the way it did.
  • Govern the dependencies. Treat closed-API reliance as a managed risk: keep models swappable, watch for drift as their capabilities change underneath you, and price the lock-in. The teams already applying intelligent routing report cutting their bills by 40–85%, which is the ROI case for doing this deliberately rather than by accident.

The shape of this work should feel familiar. Diagnosing what a model is good at, selecting against a measured profile, and managing the result over time is the same loop that data teams already run on datasets. The phrase "AI-Ready Data" extends cleanly to "AI-Ready Model": in an orchestration world, the model is one more asset you have to keep clean, current, and accounted for.

If "AI sovereignty" is really moving from owning a model to orchestrating models well, then the bedrock under that capability is data — capability profiles, evaluation sets, decision logs. The orchestration era does not retire the data question. It promotes it.

Editor's Note

The work Pebblous has focused on — diagnosing and cleaning data quality (DataClinic) and producing data in a usable form (AI-Ready Data) — lands in the same place as the baseline demand this report describes. We read the orchestration shift less as a contest between models than as a reason to treat models, too, with the discipline we already bring to data: measured, recorded, and governed.

R

References

Academic (arXiv / Conference)

Industry & Primary Sources

Policy, Statistics & Market

  • 11.US Bureau of Industry and Security. (2025, January 13). "Framework for Artificial Intelligence Diffusion" (the AI Diffusion Rule — three-tier structure).
  • 12.Appenzeller, M. (a16z). (2024). "LLMflation"; Epoch AI, LLM inference price trends (≈10×/year decline).
  • 13.MarketIntelo. (2025). "AI Multi-Agent Orchestration Market" ($5.8B in 2025 → $37.4B by 2034, CAGR 23.7%; estimates vary by firm).

※ Fugu Ultra benchmark figures are vendor-reported and not independently verified; market sizes vary by source definition and are presented as estimates. Where reports conflict (e.g., SWE-Bench Pro 54.2 vs 73.7), the more widely transcribed figure is used in the table and the conflict is noted.