Executive Summary

In June 2026, Hugging Face co-founder Thomas Wolf posted a short tweet. More than 100 AI agents, he wrote, had spent a week in open collaboration raising Gemma 4's vLLM inference speed 5×. The headline everyone latched onto was "AI collaborated." But the part worth pausing on lies elsewhere. The agents weren't discovering drugs or analyzing data — they were fixing the infrastructure that runs AI itself: the kernels of the inference engine. AI, once a tool for solving external problems, had moved into the seat where it works directly on the system that runs it.

The 5× is no magic trick; it is the product of diagnosable engineering. Gemma 4 is built in a way that disables the standard attention kernel, so vLLM fell back to a slow detour path — and on some GPUs that pushed throughput down to 9 tok/s, nearly 14× slower than comparable models. Simply restoring an abnormally low starting line to normal already explains the 5× on technical grounds. The tweet, though, was cut off at "but," and after that "but" there are almost certainly conditions and limits.

What we focus on is exactly what comes after that "but." When 100 agents pour out patches at once, the verification and curation that decide what to keep and what to discard were the invisible infrastructure holding up the 5×. In an era where AI fixes AI, the real bottleneck is not compute but data quality. This piece takes the 5× apart and follows the real bottleneck beyond the truncated tweet.

Editor's note. What makes performance, at the boundary between AI and data — that is the question Pebblous keeps returning to. This piece does not indulge the "AI replaced humans" hype. Wolf's experiment was an open collaboration that humans and agents took part in together, and the measurement conditions behind the 5× were never published, because the tweet was cut off. So every figure here was re-checked against primary sources (the vLLM issue tracker, official blogs, papers), and we chose the phrasing "technically explainable" over flat assertion.

The Numbers That Matter

Four numbers form the spine of this piece — the size of the gain and the cost hiding behind it, side by side.

final inference speedup

result of a week of open collaboration (measurement conditions undisclosed)

9 → 60–100

tok/s recovery range

the span recoverable by normalizing the fallback kernel

100+ / 1 wk

agents / collaboration window

collective work on a public message board

+58–285%

multi-agent token overhead

hidden cost that grows with the collaboration structure

1

What Happened — a Week of 100 Agents

It started with a one-line tweet. Thomas Wolf, co-founder and chief science officer of Hugging Face, wrote that "multi-agent collaboration is one of the most interesting agent behaviors right now," and that more than 100 agents had improved Gemma 4's inference speed in vLLM 5× over a week of open collaboration. Then the sentence broke off at "but." What the limits were, under what conditions the 5× held — all of that would have followed, but the published tweet stopped there.

The stage for this experiment was the Fast Gemma Challenge, run jointly by Google and Hugging Face. Participants watch, copy, and build on each other's optimization attempts on a shared message board — a format that is part competition, part collaboration. It wasn't a humans-only contest; coding agents took part in force, and the process and results piled up on a public dashboard exactly as they happened. "Open collaboration" means humans and agents moving together on one board. It was not an event where AI pushed humans aside.

That is why you have to look one layer beneath the surface story ("100 AIs collaborated"). Almost every famous multi-agent case so far has had agents performing an external task — searching for drug candidates, handling customer inquiries, operating industrial data. This experiment runs the other way. The agents fixed the ground they themselves run on: the kernels of the inference engine. It is a glimpse of an AI system stepping into the meta level of iteratively improving itself.

Previous Paradigm AI Agents External Task Execution Drug discovery · Customer support Data analysis · Code generation Meta-level Shift This Experiment — New Paradigm 100+ AI Agents Direct AI Infrastructure Fix vLLM inference kernel optimization The system that runs itself
▲ The meta-level shift in agent roles: external task execution → direct AI infrastructure optimization — Original Pebblous diagram (Fig. 1, reinterpreted)

Compressed to a line: an open, community-style swarm of agents directly improved AI infrastructure itself — for the first time, at scale, and in a way a machine could score instantly. The question "is it faster?" does not wait on human interpretation. That is what made this task an unusually good fit for agent collaboration.

A note on sources: the original tweet is cut off at "5x ... but," so this piece does not assert the measurement conditions of the 5×. The human-plus-agent structure of the Fast Gemma Challenge, the shared message board, and the 100+ TPS record on an A10G environment were cross-checked against the Hugging Face dashboard and official Google / Hugging Face materials.

2

How the 5× Was Possible — Anatomy of a Bottleneck

A number like 5× makes it easy to imagine a new algorithm. In this case it was the opposite. The starting line was abnormally low, so simply returning that starting line to normal produced a large multiple. The key lies in the architecture peculiar to Gemma 4.

2.1Why It Was So Slow

Gemma 4 mixes local and global attention in a 5-to-1 hybrid, and its attention head dimensions (head_dim) of 256 and 512 fall outside the norm. The problem is that the fast FlashAttention kernel vLLM uses by default does not support those non-standard dimensions. When it isn't supported, the engine drops to a slow detour path (a Triton fallback). The result was roughly 9 tok/s reported on an RTX 4090 — nearly 14× slower than a comparable Llama 3.2 3B, which cleared 100 tok/s (vLLM Issue #38887).

Before Optimization After Kernel Fix Gemma 4 (head_dim 256 / 512) Non-standard attention architecture FlashAttention ✗ Unsupported dim Forced fallback to Triton detour path ~9 tok/s ~14× slower than comparable models (RTX 4090) Gemma 4 + Kernel Patch Restored attention path supporting head_dim FlashAttention ✓ Normal operation Fast GPU kernel path activated 60–100 tok/s Inference speed restored — the key lever for 5×
▲ Gemma 4 kernel path: before and after optimization — Original Pebblous diagram (Fig. 2, based on vLLM Issue #38887)

In other words, what the agents found was not new mathematics but "the work of closing a mismatch between model and kernel." Clearing this one bottleneck alone recovers 9 tok/s to the 60–100 tok/s range, so a single lever already explains the 5×. Of course, the actual contest stacked several optimizations together. Below is a cumulative view of the representative levers that lift inference speed.

Baseline — stuck on the Triton detour path. 9 tok/s.

×many

Kernel normalization — restoring an attention path that supports the non-standard head_dim. The single largest lever.

×3–5

PagedAttention · continuous batching — cuts memory fragmentation and bundles requests without gaps to grow throughput.

×~2

FP8 · NVFP4 quantization — lowers precision to save compute and bandwidth (with quality verification as a precondition).

×1.7–2.66

MTP (multi-token prediction) — drafts tokens ahead to accelerate. Measured 1.7–2.66× (Google officially claims up to 3×), with an acceptance rate around 96.5%.

+

torch.compile — graph compilation that trims kernel-call overhead to finish things off.

One caveat, stated plainly. The multiples above were each measured under different conditions, so they must not be combined by simple multiplication. The levers overlap and have ceilings. But the direction is clear: kernel normalization did the heaviest lifting, and the rest sat on top of it. The 5× is less a stack of small miracles than the result of properly closing one diagnosable mismatch.

3

"AI Fixing AI" Is Already Here

Treat Wolf's experiment as a one-off news item and you miss the essence. The trend of agents autonomously optimizing GPU kernels has been accumulating across research and products for a while. Wolf's experiment is simply the largest-scale version of that paradigm — more than 100 agents going at it in the open for a week. The table below gathers the representative precedents of the past year or two.

Case Result Characteristics
KernelSkill KernelBench L1 5.44× Multi-agent GPU kernel optimization
MiniMax M3 Autonomous FP8 GEMM 9.4× 24 hours · 147 submissions · zero human intervention
Cursor × NVIDIA B200, 235 kernels, geomean 1.38× Targeting real production kernels
ISO-Bench 46.2% autonomous improvement on real vLLM tasks Coding-agent benchmark
Submissions vs. Performance — Persistence Decides Performance (inference speed↑) Submission count ~30 100 147 0 Generic models — plateau at ~30 attempts MiniMax M3 Submission #147 → peak performance
▲ MiniMax M3's persistent optimization versus the generic model plateau — Original Pebblous diagram (Fig. 3, based on MiniMax M3 2026)

One shared lesson surfaces among the numbers. What separated the results was not model size but the verification loop and sheer persistence. ISO-Bench's dominant failure mode was "Good Intent, Bad Execution." Unlike other models that stalled within 30 attempts, MiniMax M3 hit its best performance on its 147th submission. The side that doggedly repeated the cycle of try, score, and fix again is the side that won.

So the real asset is not a single block of model weights but the history of try-verify-retry. Which change raised speed, and which one quietly degraded quality — that entire trajectory becomes the signal that teaches the next optimization.

4

What Hid After the 'but' — the Real Bottleneck

After the truncated tweet's "but," there were almost certainly limits and conditions. Those limits split into two layers. One is the reproducibility of the result itself; the other is the cost of the multi-agent approach as such.

4.1How Far Did the Result Reproduce

The peak performance reproduced on the newest Blackwell GPUs. In other environments, the 14× slowdown sometimes remained intact. On top of that, lowering precision brought quiet quality degradation along with it. FP8 block quantization came with a logit-saturation bug that skewed the output distribution abnormally (Issue #39407); large prefills came with a hang where the engine simply froze (Issue #39914). In short, the speed could be bought at the price of losing accuracy or stability somewhere.

4.2Does Adding More Agents Help

Multi-agent itself isn't free. As the tokens exchanged for collaboration grow, overhead has been reported at +58% in independent configurations and up to +285% in centralized ones. Google Research noted that a poorly configured multi-agent setup can perform 39–70% worse than a single agent, and industry tallies put the production success rate at around 23% with a recommended team size of just 3–5. The intuition that "more is always better" is often wrong.

Yet that same multi-agent setup clearly shines under specific conditions: when the task can be finely decomposed, run in parallel, and scored instantly by a machine. Kernel optimization is exactly that sweet spot. The fork in the road is not the number of agents but how the collaboration is structured. Design the structure well and overhead can flip into savings: collaboration frameworks like MARS have been reported to cut tokens and time in half while holding accuracy. On decomposable tasks, Finance-Agent posted +80.8%, and a five-agent ensemble hit 89% on HumanEval. The contrast table below shows the divide.

Dimension When multi-agent hurts When multi-agent helps
Task nature General reasoning, hard to decompose or verify Decomposable, parallel, machine-verifiable
Representative figures Overhead +58–285%, performance ↓39–70% Finance-Agent +80.8%, HumanEval 89%
Recommended team size 20+ stays persistently underperforming 3–5, with ~23% success rate
100+ Agents Patches · Experiments · Commits Hundreds of concurrent submissions Open collaboration format Fast Gemma Challenge Data Quality Layer Validation · Curation · Diagnostics Performance scoring (machine-automated) Quality bug detection Accept / Reject decisions = The hidden infrastructure of 5× Accepted ✓ Passed performance & quality check Rejected ✗ Quietly degrading patches filtered out
▲ Data quality layer as the hidden infrastructure — filtering the multi-agent patch flood — Original Pebblous diagram (Fig. 4, reinterpreted)

So the real bottleneck behind the 5× was not compute. Among the patches 100 agents threw out at once, deciding what to keep and what to drop — and catching which change, like logit saturation, shaves quality while looking perfectly fine on the surface — that is, verification and curation, was the bottleneck. In other words, data quality.

5

What the 5× Changes — Inference Economics and Physical AI

What changes when inference speed turns 5×? A threshold moves in two places. One is money, the other is physics.

5.1Inference Economics

The unit cost of inference has been falling steeply. The price of a token delivering the same performance dropped from about $20 per million tokens in 2022 to roughly $0.40 in 2026 — about 50× lower (per Epoch AI, around 10× a year). Inference now accounts for roughly 67% of AI compute cost, and the 2026 inference market is growing fast enough that estimates range from $5.9B to $22.7B depending on the source. Drop the unit cost one more 5× and real-time, high-volume inference services that don't pencil out today cross into profitability.

5.2The Threshold for Edge and Robots

The shift on the physical side is more direct. Real-time control of a robot or drone demands 100–1,000 tokens per second, while today's edge chips sit at 10–30. That is a 10× to 100× gap. A 5× in inference is one of the key levers for closing it, and the combination of a small-active-parameter Gemma 4 MoE (≈3.8B active) plus multi-token prediction is being floated as a strong candidate for crossing into the edge.

Robot real-time control demand100–1,000 t/s
Current edge-chip throughput10–30 t/s

▲ The edge-inference threshold gap. The bars are a simplified schematic of relative scale (not a log scale).

To sum up, a 5× in inference is not a mere benchmark boast. It is a lever that lowers cost to make more services viable, and raises physical throughput to lower the bar for robots and the edge. And to pull that lever safely, you have to be able to trace — at the data level — what was traded away for speed.

6

Why Pebblous Is Watching

This experiment meets the three axes Pebblous works on head-on. The buzz was about "100 agents," but the place we look is who sorts the quality of what those agents produced.

6.1Business and Technical Connection

The optimization patches, experiment logs, and failed attempts that 100 agents poured out are themselves the next generation's training signal for "which change raises speed." Deciding what to adopt and what to discard is data curation. A 5× in inference is a direct lever bringing the edge closer to the threshold for real-time control of robots and drones (Physical AI), and automatically diagnosing and filtering which of the agents' commits quietly degrade quality is a natural extension of DataClinic.

6.2The Data-Quality Lens

The success or failure of multi-agent collaboration is decided not by model size but by quality control over the output. ISO-Bench's "good intent, bad execution" failures, the FP8 logit-saturation bug, the performance paradox (a 39–70% drop) are all problems of output that looks plausible but is actually degraded. The quality of the data entering the training and optimization pipeline — the patches, checkpoints, and synthetic signals — directly governs the model's internal representations and its final performance.

6.3Practical Implications for Customers and Partners

When an enterprise adopts multi-agent systems, the variables that decide ROI are clear. First, is the task decomposable and verifiable? Second, is the team size kept within 3–5? Third, is the try-verify-retry history preserved as an asset? The vague hope that "adding more agents makes it better" comes back as token overhead and the performance paradox.

In an era where AI fixes AI, Pebblous's place is the data-quality layer of collective agent collaboration: verifying and curating the code, patches, and experiment data agents produce, diagnosing quiet degradation, and separating adoptable signal from noise. Compute and models commoditize fast (unit cost down 50×), but the judgment of "what counts as good data" does not commoditize. This experiment shows that exactly that judgment was the hidden engine behind the 5×.

References

The academic papers, vLLM and Google technical docs and issues, and industry and benchmark materials cited here are grouped by source. Because the tweet was cut off and the 5× measurement conditions were never published, every figure in this piece was re-checked against primary sources.

Academic (arXiv)

Technical Docs · Issues · Official Blogs

  • 6.vLLM GitHub Issue #38887 — Gemma 4 E4B ~9 tok/s, Triton fallback.
  • 7.vLLM GitHub Issue #39407 — Gemma 4 31B FP8_BLOCK logit saturation bug.
  • 8.vLLM GitHub Issue #39914 — Gemma 4 engine hang during large prefill.
  • 9.vLLM GitHub Issue #39749 — Q2 2026 Roadmap.
  • 10.Google (2026). Accelerating Gemma 4: faster inference with MTP drafters (blog).
  • 11.Red Hat Developer (2026-04). Speculative decoding in vLLM.
  • 12.Hugging Face. Fast Gemma Challenge dashboard.
  • 13.vLLM Blog. 2024 Retrospective and 2025 Vision · Performance update / v0.6.0 (vllm.ai, 2024-09-05).

Industry · Benchmarks · Data

  • 14.Latent Space AINews (2026-06-25~27). Thomas Wolf, 100+ agents top tweet (primary source).
  • 15.Epoch AI. LLM inference price decline (10–50×/year since 2022).
  • 16.Allen Kuo, Medium (2026). Gemma 4 on vLLM vs Ollama: 96 GB Blackwell benchmarks.
  • 17.Spheron Blog (2026). vLLM vs TensorRT-LLM vs SGLang H100 benchmarks / GPU cost per token.
  • 18.Cursor × NVIDIA (2026). Blackwell B200 multi-agent kernel optimization · MiniMax M3 (2026-06). Autonomous FP8 GEMM kernel optimization on Hopper.