Executive Summary

A meta-analysis gathers hundreds of scattered studies and compresses them into a single conclusion. The first step is sifting through an enormous pile of literature to find the papers that fit the question, and for a long time that work needed human hands. A benchmark released in June 2026 tested whether AI can take over that step, using 442 expert meta-analyses as the answer key. This article reads the results through the lens of data quality.

The headline first: AI pulled 90.9% of the correct papers out of a pool of 140,000. Search is all but solved. Yet of the papers that actually belonged in the meta-analysis, it selected only 52.7%. The failure was not in finding them. It was in filtering out the "plausible" papers — the ones close in topic but off on the criteria.

So AI's bottleneck turns out to be selection, not search. The hard part is judging which data fits the criteria, a problem of labeling and standards. Put another way: where search ends, data quality begins.

Key Figures

Source: Xie et al., MetaSyn benchmark, arXiv:2606.17041 (2026)

The four numbers below trace the same pipeline from its entrance to its exit. Between the ceiling search reaches (90.9%) and the floor screening actually recovers (52.7%) sits a 38-point gap, and the cause of that gap is compressed into the ratio of traps buried in a single query (16 to 184).

90.9%

Search recall

Share of correct papers the fine-tuned MA-Retriever pulled into its top 200

52.7%

Screening recall

Best system's rate of selecting the truly included papers from the retrieved pool

38pp

Search-to-screening gap

Distance between the ceiling search reaches and what screening actually recovers

16 vs 184

Traps per query

~16 papers to include, sitting beside ~184 look-alikes that miss only on the criteria

1

What MetaSyn Tested

Built by researchers at Tsinghua University, MetaSyn is a benchmark that measures how far an LLM agent can take over the work of a meta-analysis. Its raw material is 442 expert-curated meta-analyses published in Nature Portfolio journals. Each one's set of included papers becomes the answer key, and the test asks whether AI can find and pick those same papers again.

The corpus to search is a body of 140,585 PubMed papers. Only 8,674 of them actually belong; the remaining 131,911 are "hard negatives" — papers that look alike in topic but miss on the criteria, the tricky traps. The problem AI has to solve is to pull only the real matches out of that vast pile.

What sets MetaSyn apart is that it measures the pipeline stage by stage. Because retrieval and screening are scored separately, the bottleneck that a single overall score would have hidden becomes visible. Nine RAG variants plus one protocol agent — twelve configurations in all — were compared on the same yardstick.

MetaSyn Two-Stage Pipeline PubMed Corpus 140,585 papers baseline pool Search Search Results 90.9% recall Screen Included Papers 52.7% recall ↔ search-to-screening gap: 38pp
▲ Original diagram (MetaSyn pipeline reinterpretation) | Source: Xie et al., arXiv:2606.17041

Fold search and screening into one score and you get "AI does about half of a meta-analysis," and there it ends. Split the two stages and the story changes. One was nearly finished; the other was the thing holding everything back.

2

Search Reached 90.9%

Start with the search stage. The job is to narrow a 140,000-paper pile down to a few hundred candidates, and here AI did well. Performance varied sharply by search method, and MA-Retriever — a dense retriever fine-tuned on meta-analysis data — went the furthest.

Retriever Recall @ top 100 Recall @ top 200
BM25 (keyword) 65.4% 77.0%
Dense (BGE) 78.2% 86.8%
MA-Retriever (fine-tuned) 83.7% 90.9%

Widen the net to the top 200 and 90.9% of the correct papers fall inside it — 13.9 points higher than keyword search (BM25), with even larger gains on big meta-analyses that include more than 50 papers. In other words, AI already figures out almost entirely "where the right answers are."

That 90.9% matters for a separate reason. Any paper that search misses is never recovered by a later stage. Search recall is the ceiling for the whole pipeline, and if that ceiling sits at 90.9%, then no amount of perfect screening downstream can rise above it. The trouble is that screening did not even come close to this ceiling.

3

Screening Stalled at 52.7%

Screening is the stage that decides, within the pool search hands over, whether "this paper truly belongs." Here the numbers collapse. Even the best-performing system selected only 52.7% of the papers that should have been included, a drop of 38 points from the 90.9% search ceiling.

System Screening recall Screening precision
RAG (GLM-5) + MA-Retriever 52.7% 26.6%
RAG (GPT-5) + BM25 42.5% 36.1%
ProtoMA + MA-Retriever 35.6% 55.5%

GLM-5, which pushed screening recall highest, has the lowest precision at 26.6%. It caught more of the right answers by including more, but it dragged in just as many wrong ones. ProtoMA sits at the other end: the highest precision at 55.5%, but recall stops at 35.6%. It chooses carefully and accurately, yet misses many papers. No system managed to hold both recall and precision high at once.

Screening Stage — Recall vs Precision Recall Precision GLM-5 52.7% 26.6% GPT-5 42.5% 36.1% ProtoMA 35.6% 55.5% Original diagram (Table 2 reinterpretation) | Source: Xie et al., arXiv:2606.17041
▲ Recall and precision across three screening systems | Source: Xie et al., arXiv:2606.17041

The most paradoxical result came from GPT-5. Swapping in a better retriever actually lowered its screening recall (42.5% → 31.7%). The sharper search becomes, the more tightly the pool fills with surface-plausible look-alikes, and that density made screening harder. Search's success came back as screening's burden.

This wall is not unique to MetaSyn. Earlier work on AI literature screening reported the same asymmetry. One systematic review found that even as AI screening tools push recall above 90%, precision stays around 20%. The more you try to catch every correct paper, the more wrong ones flood in — a structure that looks less like one model's weakness and more like a property of the screening task itself.

The better the search, the harder the screening. That one line names what the gap really is. The problem is not failing to find the right answers, but telling them apart from the wrong ones lined up right beside them.

4

PI/ECO: Four Judgments at Once

Why is screening this hard? Whether a paper goes into a meta-analysis is decided by four criteria, known together as PI/ECO, that must all be satisfied at once: the study Population, the Intervention or Exposure, the Comparison group, and the Outcome measure. If even one of the four gates fails, the paper is out.

  • P — Population: the same drug studied in children rather than adults is a different study.
  • I/E — Intervention/Exposure: for the same condition, a different dose or method of treatment is not a comparable case.
  • C — Comparison: whether the comparison is against a placebo or against another drug changes whether it is included.
  • O — Outcome: if the measured outcome differs from the one the meta-analysis is examining, the paper is excluded.

In one antidepressant meta-analysis, 33 of the 40 relevant papers were retrievable from the PubMed corpus. The trouble was that around them sat a heap of "near-miss" papers that differed on just one of population, comparison, or study design. Identical by topic, wrong on a single criterion. Today's LLMs are weak at applying these four criteria together in a single judgment.

Put the per-query environment in numbers and it becomes plain. The papers to include average about 16, but beside them lie roughly 184 look-alikes that miss on one of the PI/ECO criteria. The traps outnumber the real papers roughly eleven to one. The 52.7% wall grows out of that asymmetry.

Per-Query Average Composition ~16 papers Eligible (8%) ~184 papers PI/ECO-failing look-alikes (92%) Original diagram (MetaSyn query composition) | Source: Xie et al., arXiv:2606.17041
▲ ~16 eligible vs ~184 look-alike papers per query (11× more traps) | Source: Xie et al., arXiv:2606.17041
5

Where Search Ends

Step back from the result and AI's limit was not in finding information. Search, at 90.9%, is already impressive. What stalled was the stage that judges, among the candidates it brought back, which ones fit the criteria. Not a problem of the search engine, but a problem of judging against a standard.

And judging against a standard is, in the end, a data-quality problem. The question "does this paper fit PI/ECO?" has the same shape as the question "does this record fit the schema, the validity, the context?" in a data pipeline. Only the names differ; both are the labeling work of telling what counts as "data that meets the criteria."

Gathering data keeps getting easier, and AI is good at it. The hard part is the judgment of which of the gathered items meet the criteria. The 38-point gap MetaSyn showed is exactly the gap that opens at that point of judgment. The place where humans must stay in the loop also shifts — not in front of search, but at the seat where the screening criteria are defined.

Editor's Note — The "data quality" Pebblous works on in DataClinic has the same structure as this screening-criteria problem. Collection is something AI already does well. What remains is the labeling and the standards that judge which data meets the criteria. What this paper calls PI/ECO, we simply call schema, validity, and contextual fit.

R

References

Academic

Industry & Press