Executive Summary

Drug discovery AI tends to rate a candidate's activity higher than it really is. The cause is not the noise or errors that usually get blamed, but the things that never made it into the data at all. Experiments that went poorly, compounds that failed to bind, attempts that came to nothing — most of these never become papers. AI grew up seeing only the filtered successes, and it inherited a map on which promising candidates look far more common than they actually are. Let's follow, through the lens of data, how those blank spaces distort what the model predicts.

One number shows the scale. By one analysis, roughly 60% of negative experimental results in drug research disappear from the public record. When an AI pipeline learns from a literature tilted this way, the bias is not washed out but amplified. In a three-stage system that automatically links retrieval, generation, and evaluation, one estimate puts the growth of that distortion at about 2.18 times.

What matters most is that this flaw is not fixed by cleaning the data. Data that does not exist cannot be scrubbed or polished into being. So the question we ask of AI-Ready data has to change as well. Not "is this data clean," but "does this data include the failures too."

Key Figures

The size of the problem compresses into four numbers: the share of negative results that vanish, the multiplier by which AI pipelines inflate the bias, the imbalance in screening data where success and failure are inverted, and how prediction accuracy changes once failure data is added.

Sources: arXiv 2606.04220 (2026), ChemDiv Datasets

~60%

Negative results lost

Share of failed experiments missing from the public record

2.18×

Bias amplification

Multiplier by which a three-stage AI pipeline inflates literature bias

50:1–1000:1

Active:inactive imbalance

Degree to which successes are over-represented in screening data

0.35 → 0.80

Accuracy change

Toxicity prediction accuracy after adding failure data (hERG case)

1

The Weight of Unpublished Experiments

Most of what gets tried in a lab ends in failure. A candidate does not bind to its target protein, the expected reaction never appears, toxicity gets in the way. These outcomes are a normal part of science. And yet those failures rarely become papers. Journals favor new and positive findings, and a researcher's time and budget go toward chasing what works rather than recording what does not.

For an individual lab, this choice is rational. The problem surfaces when tens of thousands of such choices pile up into a single body of literature. What remains in the world is a record of successes, while failures stay quietly in the drawer. This phenomenon is called publication bias. One analysis estimates that about 60% of negative results in drug research never make it onto the public record. More than half of the failures are invisible from the start.

Experiment Results Success ~40% Failure ~60% Published Papers & literature AI Training Data ⚠ Biased toward success File Drawer Unpublished (hidden) ~60% of negative results never reach the public record — AI inherits a tilted map (arXiv 2606.04220)
▲ How publication bias distorts AI training data — Pebblous original | Source: arXiv 2606.04220

When a human reads papers, they account for this bias to some degree. Experience teaches them that what gets published is not everything. AI, however, takes what it is given as the whole world. Train a model on a corpus tilted toward successes, and it mistakes that tilt for the actual shape of reality. It inherits, intact, a map on which promising candidates look more common than they are.

Publication bias is not a problem of the data being wrong. Each surviving record of a success is, on the whole, accurate. The problem is that the record of failure, which should sit right beside it, is missing entirely. It is like a student who studies from a workbook that keeps only the correct answers and comes away underestimating how hard the exam will be.

2

The Illusion of an AI That Only Saw Success

How much the missing failures shake prediction becomes clear when you look at the balance of the data. In high-throughput screening data, the ratio of active to inactive compounds stretches from roughly 50:1 to 1000:1. In reality, most of what was tried is inactive, yet the record keeps only the actives thickly stacked. When this imbalance is severe, a model can label every candidate inactive and still score high on accuracy. The very number we call accuracy loses its signal.

Imbalance in Screening Training Data Active compounds Published (over-represented) Inactive compounds Unpublished (missing) 50:1 to 1,000:1 (active : inactive in training data) Publication bias over-represents active compounds in training data. Source: ChemDiv Datasets
▲ Active/inactive imbalance in training data — Pebblous original | Source: ChemDiv Datasets

The distortion also runs the other way. AI draws the density of active compounds higher than it should across chemical space, while underestimating activity in the unfamiliar regions of molecules that do not resemble known drugs. The irony lands exactly where generative drug discovery needs it least: in the chemical space no one has explored, the model misses by the widest margin. In toxicity prediction, positive cases are so rare that the model leans toward calling things safe, and it lets rare but lethal toxicity slip through as false negatives.

There is a concrete case of what changes when failure data is added. The compound supplier ChemDiv, noting that public databases overwhelmingly hold only active compounds while unsuccessful experiments go unpublished, folded both successes and failures into thirty years of accumulated in-house experimental data. In a case predicting hERG inhibition toxicity, accuracy rose from 0.35 to 0.80, and Cohen's kappa, which measures agreement in prediction, jumped from 0.044 to 0.565. What changed was not the size of the model but the single fact that failures had entered the data.

hERG Toxicity Prediction: Before vs After Failure Data No failure data 0.35 Accuracy 0.044 Cohen κ Add failure data Data design change With failure data 0.80 Accuracy 0.565 Cohen κ What changed: not model size, but the fact that failures entered the data (Source: ChemDiv Datasets)
▲ hERG toxicity prediction accuracy before and after adding failure data — Pebblous original | Source: ChemDiv Datasets

The problem grows when an automated pipeline inherits the same bias. A three-stage AI system that gathers literature through retrieval, generates hypotheses on top of it, and then evaluates them automatically adds a little bias at each step. One analysis estimates that this accumulation grows the original bias by about 2.18 times. An AI scientist trained on biased literature accelerates science, and accelerates science's blind spots right along with it.

3

A Flaw That Cleaning Cannot Fix

Here lies a common misunderstanding. Told there is a problem with the data, many teams assume the answer is to scrub the data cleaner: remove duplicates, standardize formats, filter out outliers. But this flaw sits beyond the reach of such cleaning. Cleaning refines data that already exists, and the failed experiments were never in the dataset to begin with. What is absent cannot be washed into existence.

So it helps to name the problem precisely. This is not a problem of dirtiness but of representation. If data is contaminated, you can purify it; but if data holds only one side of the world, purification cannot summon the other side. The absence of failure is itself a structural flaw, and this flaw is made not at the stage of handling data but at the stage of collecting it.

Traditional approach Data collection Record successes only Training data Success only AI prediction Overestimation → clinical failure risk Representative approach (OpenBind) Data collection Successes + failures Training data Includes failures AI prediction Learns real boundaries → reliable The flaw arises in collection design — and so does the fix. Cleaning data and designing data are on different levels.
▲ Collection design: traditional vs representative — Pebblous original | Ref: Oxford OpenBind (2026)

If the flaw arises in collection design, the fix has to be found in collection design too. OpenBind, released in May 2026 by a consortium led by the University of Oxford, shows that direction. Backed by £8 million in UK government funding, the project aims to release, over five years, more than 500,000 structures of protein–drug complexes — roughly twenty times the volume of data available today. Its first release paired 699 X-ray structures of a single viral target protein with 601 binding-strength measurements.

The point is not the scale but the method. OpenBind measures not only the compounds that bound successfully but the strength of binding, under a standardized experimental design. Capture what binds and how strongly, systematically from the start, and the cases that bind poorly remain in the data as well. The researchers leading it say the biggest bottleneck in AI drug discovery is the shortage of reliable, large-scale experimental data showing how molecules bind to proteins. Instead of making the model bigger, it re-lays the very ground the model learns from.

Just as AlphaFold2 leapt forward on top of decades of accumulated protein structure data, a model's ceiling is ultimately set by the design of its data. Deciding at collection time to include failures, and cleaning up finished data afterward, are tasks on entirely different layers. It is the former decision that draws the boundary of what a model can ever reach.

4

Rethinking What AI-Ready Means

This story is not confined to drug discovery. The structure in which successes are recorded and failures quietly vanish exists in every field. In hiring, only the people who got the job remain as data; in credit assessment, only approved cases are observed; in equipment maintenance, only the normal readings before a breakdown accumulate. Models trained on data gathered this way lean toward optimism, each in its own way. The absence of failure is not an accident of any one domain but a general trap in how data captures the world.

This is why, when Pebblous talks about data quality, it puts representativeness first. A biased sample builds a biased model, and that bias hides well behind a number called accuracy. So when you inspect AI-Ready data, the question to ask shifts. Beyond "are the values non-empty" and "does the format fit," you have to ask whether the data holds even the opposite side of the world it is meant to capture.

In concrete terms, this means checking whether failure cases are explicitly present in the training data, whether the ratio of success to failure reflects the real-world ratio, and whether negative results were excluded at the collection-design stage. This check is not the cleaning you do after all the data is gathered; it begins at the moment you decide what to gather. Just as, in drug discovery, predicted data and measured data were not the same, here too clean data and representative data are not the same.

Editor's Note. In addressing the conditions for AI-Ready data, Pebblous has particularly emphasized representativeness among the many dimensions of quality. This drug discovery case is one scene showing how that abstract principle turns into concrete loss on the ground of science. The question of whether data includes the failures is, before it is a matter of regulation or ethics, a matter of how honestly a model predicts.

An AI that learned only success draws the world as a place where things work out more often than they do. That optimism is not erased by scrubbing the data cleaner; it is erased by bringing the missing failures back into the data. The next time a model confidently points to a candidate as promising, it is worth asking again whether that confidence is a real signal or an illusion born of never having seen failure. Thank you for reading to the end.

Pebblous Data Communication Team
July 5, 2026

R

References

R.1Academic Papers

R.2Industry & Press

R.3Datasets & Official Documents