Executive Summary

In June 2026, SpaceX went public at $135 a share, crossed a $2 trillion market cap on its first day, and recorded the largest IPO in history. Around the same time, a researcher ran a different experiment on that same registration filing (the S-1). He handed LLM financial analysts 1,000 due-diligence questions and graded which of them got closer to the truth. This article is about who wrote the answer key for that grading.

No human wrote the answer key. Several LLMs answered the same questions, candidate facts were pulled from that ensemble of answers, and the grading rubric was generated automatically after auditing for omissions, hallucinations, and duplicates. A person only checked the quality at the final gate. Of the 1,000 questions built that way, 930 were never released — because the moment they go public, models learn the answers and the grading data is contaminated.

One question follows the instant agents start doing real work: who, and with what, grades the result. The answer this study points to is clear. To trust AI, you first have to trust the data that grades it — and that grading data is itself an asset that has to be managed.

Key Figures

Four numbers sum up the study: who was most accurate, what was locked away, how much one round of grading cost, and how many people wrote the answer key. That last number is where this article begins.

Source: Benhenda, arXiv:2606.23032 (2026)

79.4%

Qwen 3.7 Max accuracy

Prior ceiling 57.9% → +21.5pp

930

questions kept locked

Only 70 released

$0.05

MiMo-2.5 Pro cost per query

~1/50 of Gemini's $2.51

0

people who drafted the answer key

Written by an LLM ensemble; humans reviewed

1

SpaceX's S-1: A 1,000-Question Exam

SpaceX filed its S-1 confidentially on April 1, 2026, made it public on May 20, and listed on June 12. The offering priced at $135 per share, with 555.6 million shares issued and a target raise of $75 billion. The first-day market cap opened at $1.77 trillion and crossed $2 trillion shortly after the open. By size alone, it is the largest IPO ever.

The filing itself was tricky. The S-1 folded in the full acquisition of xAI, completed in February 2026, so on a consolidated basis it bundled 2025 revenue of $18.67 billion with a net loss of $4.94 billion. Starlink was profitable, with $11.4 billion in revenue and $4.4 billion in operating income, while xAI spent $14 billion the same year and earned $3.2 billion. A profitable business and a loss-making one, acquisition accounting and governance — all overlapping inside one document.

An S-1 for an IPO is a different animal from the 10-K and 10-Q filings a public company files each quarter. On top of historical financial statements, it carries pro forma accounting, capital-raising structure, governance, and acquisition-risk disclosures, and it runs far longer. The existing financial-analysis benchmark, Finance Agent v2, covered only those periodic disclosures. Its retrieval was simple chunk search with no added context, so on a long document like an IPO filing it was easy to miss what mattered.

So this study built a separate exam tailored to IPO due diligence. Working from the SpaceX S-1, it generated 1,000 questions on financial-statement analysis, pro forma accounting, governance, capital structure, and risk disclosure. Of those, only 70 were released and 930 were kept private. The reason for locking them comes later.

The thing to watch here is not the exam questions but the grading criteria. Each of the 1,000 questions needs a correct answer before it can be graded. Who built that answer key, and how, is the study's real contribution.

2

An AI Built the Answer Key

Usually the grading criteria for a benchmark are written by hand, by experts. In a field like financial analysis, an accountant or analyst defines, question by question, the facts that an answer must contain. Finance Agent v2 had 537 questions written directly by experts. It is an expensive, slow way to work.

This study automated that work. It had AI generate the grading rubric itself — the criteria that define what each correct answer must include. Instead of writing the answers, people only review the machine-made criteria at the end.

The pipeline runs in five stages. First, several LLMs answer the same question on their own. Candidate facts are extracted from that ensemble of answers, and overlapping criteria are merged into one. Then comes a three-way audit.

1 LLM Ensemble Multi-model answers 2 Candidate Facts Extract & Merge 3 Three-way Audit Omission · Hall. · Dedup 4 Human Review Quality check 5 Rubric Criteria finalized
▲ Automated rubric generation — 5-stage pipeline (Pebblous original diagram, Fig. re-interpretation) | Source: Benhenda, arXiv:2606.23032 (2026)

Omission detection

Checks that no key information is missing. If a fact that must appear in the answer drops out of the rubric, grading goes slack.

Hallucination check

Filters out facts that don't match the source. If a plausible falsehood a model invented slips into the criteria, the grading itself is contaminated.

Deduplication

Merges criteria that say the same thing in different words. Leftover duplicates score one fact multiple times and skew the result.

Only criteria that pass the audit move on to human review, where the final rubric is locked in. Grading does not end in a single pass either. A stage that grades model answers with the generated rubric (the evaluator) and a stage that re-tunes the criteria from the grading results (the optimizer) interlock and repeat. The grading criteria are refined as they cycle through.

Retrieval was reworked too. In place of the existing benchmark's simple chunk search, the study applied contextual retrieval, which raises search accuracy by attaching surrounding context to short chunks. That difference mattered a lot for finding scattered evidence across a long IPO filing.

Mixing answers from several models to build the answer key is no accident. When the model that makes the data and the model that grades it are the same, a bias creeps in where each gives the other generous scores. Using an ensemble of diverse models scatters that bias. The fairness of grading comes from the diversity of the data the grading criteria were built from.

3

Why 930 Questions Stay Hidden

Good benchmarks carry a paradox. The better the criteria, the more you want to share them, and the moment you share them they break. Once questions and answers go public, the next generation of models sucks them in as training data. Then a model earns its score by replaying a memorized answer rather than by actually analyzing. This is benchmark contamination.

There is a close precedent. On SWE-bench, which measures code-fixing ability, OpenAI found signs that some models were copying the reference patch verbatim. It ended up halting those score reports. The benchmark was measuring memorization instead of the ability it set out to measure. The scores on a public benchmark lose trust as time passes.

This study's fix is simple. Only 70 of the 1,000 questions are released as examples, and the official ranking is tallied solely on the private 930. Because a model cannot see the answers in advance, its score moves closer to the result of actually reading and analyzing the filing, not to a memorized answer. The locked 930 hold up the trustworthiness of this benchmark.

Public 70 Private 930 (locked) 70 930 locked Official ranking — prevents model training SpaceX S-1 — 1,000 total questions for IPO due diligence
▲ 70 released, 930 locked — Pebblous original diagram | Source: Benhenda, arXiv:2606.23032 (2026)

One thing becomes clear here. Well-made grading data is a scarce asset whose value evaporates the moment it goes public. The very act of locking away the 930 is evidence that the data has worth. Evaluation criteria are not a public good to be released for free, but an asset to be defended from contamination.

4

Qwen Beat Gemini

The grading results cut against expectations. The top score did not go to the most expensive model. Alibaba's Qwen 3.7 Max came first at 79.4%, at a cost of $0.30 per query. Xiaomi's MiMo-2.5 Pro followed close behind at 76.8%, for just $0.05 per query.

The benchmark it was compared against, Finance Agent v2, topped out with Gemini 3.5 Flash at 57.9% and a cost of $2.51 per query. The leading models on the new benchmark pushed that ceiling up by more than 20 percentage points. MiMo-2.5 Pro delivered higher accuracy at roughly one-fiftieth of Gemini's cost.

Model Accuracy on IPO Finance Agent prior ceiling Qwen 3.7 Max Alibaba 79.4% $0.30/q MiMo-2.5 Pro Xiaomi · ★ best value 76.8% $0.05/q Gemini 3.5 Flash Google · prior best 57.9% $2.51/q ★ MiMo-2.5 Pro: higher accuracy at ~1/50 the cost of Gemini
▲ Model accuracy and cost per query on IPO Finance Agent — Pebblous original diagram | Source: Benhenda, arXiv:2606.23032 (2026)

What this reversal signals is clear. Skill as a financial analyst was decided not by a model's size or price, but by the retrieval architecture that finds evidence in long documents and the quality of the criteria that grade the answers. The same model can score differently depending on how it searches and what it is graded against.

What flipped the scoreboard was not a bigger model but better data design. Context-preserving inputs at the retrieval stage and criteria refined through automated generation worked together. The arena of model competition is shifting from parameter count to data and evaluation design.

5

Trust the Grading Data to Trust the AI

This story is not only about financial analysis. The same question follows everywhere agents start standing in for real work. Who, and with what, grades whether the model's result is right? If the grading criteria are slack, every score and ranking built on top of them stands on sand.

So the trustworthiness of evaluation starts not with the model but with the grading data. If a key fact is missing from the criteria (omission), if a plausible falsehood is mixed in (hallucination), if the same thing is measured several times (duplication), the score fails to point at ability. That is exactly why this study audited all three explicitly. Without managing the quality of the answer key, the grading results cannot be trusted either.

The "publish it and it gets contaminated" paradox runs along the same line. The evaluation criteria are a data asset, and an unmanaged asset rots. Auditing for omissions, hallucinations, and duplicates, and locking part of it away to prevent contamination, are all matters of data quality. The question of whether you can trust an AI loops back, in the end, to whether you can trust the data that graded it.

Editor's Note. The challenge Pebblous meets in working on data quality touches the same point: before you grow a model, you have to fix the data that grades it. Securing the reliability of the evaluation criteria themselves is the next data problem AI governance has to solve.

?

FAQ

R

References

R.1Academic Papers

R.2Official Documents & Press

Thank you for reading. Each time you meet a result from an AI, the habit of also asking "what graded that score?" is what will let you tell a good-looking report card apart from a trustworthy one. If you have thoughts or counterarguments on this topic, we would love to hear them.

Pebblous Data Communication Team
June 24, 2026