Executive Summary
"AI-Ready Data" has long been, in Pebblous's telling, a case to be argued rather than a thing to be measured. Whether, and how far, a dataset was actually ready was usually judged after the fact, by a person with a checklist. A study released in April 2026 moves that judgment seat wholesale. The party doing the scoring shifts from a human to a multi-agent system; the object being scored shifts from a finished, trained model to the dataset about to enter training; and the moment of scoring shifts from after the fact to before the data ever enters the pipeline. For the first time, readiness becomes something you can put a number on.
The scoring runs along a four-dimensional rubric called Sci-TQA²: governance trustworthiness, data quality, AI compatibility, and scientific adaptability. The system reads roughly 80 heterogeneous datasets across six fields — from astronomy to socio-economics — and writes, on its own, whatever analysis tool each dataset needs to be scored. Overall evaluation success reached 89.0%, and human experts rated the accuracy of that scoring at 4.15 out of 5. On the surface, reassuring numbers.
But the moment an agent does the grading, the trust question climbs one level — from the data to the rubric itself. Agreement between the system and human experts came in at ICC 0.742: a hair (0.008) short of the 0.75 threshold usually called "good," useful but not yet fully trusted. And when the loop that lets the system re-check its own scores is removed, success collapses from 89.0% to 33.0%. This report dissects the shift that turned readiness into a scorable object — and then asks the next question it forces: who vouches for the rubric?
89.0%
Evaluation success rate
Share of datasets the agents actually produced a score for
97.4%
Tool auto-generation success
Writes a bespoke analysis tool per dataset (1.19 attempts on average)
ICC 0.742
Human–system agreement
0.008 short of the 0.75 "good" line — a ceiling on rubric trust
89→33%
Without the verification loop
Remove Self-Correction and success drops to a third
Readiness: From Slogan to Score
Any data team has fielded the question at least once: "Is our data actually usable for AI?" Until now, the way to answer it was more or less fixed. An experienced hand opens a checklist, scans for missing values and label errors, and rules the data roughly "usable" or "needs more work." The verdict leaned on human eyes and instinct, and it usually came after the data had already been handled a great deal.
This is precisely the spot Pebblous has returned to on the blog more than once. We wrote that "cleaning your data is only where AI-readiness begins," and we asked, "the model is ready — is your table?" We made the case for why readiness matters. What we left blank was who scores it, and against what standard. A study named SciHorizon-DataEVA (arXiv:2604.26645) fills exactly that blank.
1.1 Three Things Moved at Once
What makes this study interesting is that it changes not one thing but three at the same time. In the act of judging readiness, the who, the what, and the when all move together.
| Axis | The old way | This study's way |
|---|---|---|
| Who — scores it | A person with a checklist | A multi-agent system with a rubric |
| What — is scored | The performance of a trained model | The dataset itself, before training |
| When — it happens | After-the-fact diagnosis | A gate before data ingestion |
The shift in timing carries the most practical weight. Discovering "the data was the problem" only after a model is fully trained is a different order of cost from filtering the data out before it ever enters the pipeline. When readiness scoring moves upstream, the losses bad data would have caused get blocked before training even begins.
Figure 1. Readiness scoring timing compared. Legacy after-the-fact diagnosis (top) vs. SciHorizon-DataEVA's pre-ingestion AI gate (bottom). Original Pebblous diagram.
1.2 The Scale of the Problem Is Already Documented
Why automating readiness scoring feels urgent is a story the numbers already tell. In a 2025 Gartner survey, 57% of organizations rated their own data as "not AI-ready," and a separate Gartner survey (Q3 2024, 248 data-management leaders) found 63% still lacked a data-management practice built for AI. The share of a data scientist's time spent preparing data runs, depending on the survey, from 45% up to 80%. Gartner projected that a substantial fraction of projects without AI-ready data would be abandoned partway through by 2026.
Data problems also top the list of reasons AI projects fail. Because every survey measures "failure" differently, the reported rate swings roughly between 70% and 85%.* Whichever figure you take, the direction converges on one point: it is the data, not the model, that trips projects up. Automated readiness scoring reads as an attempt to catch that bottleneck upstream.
* Gartner puts failure attributable to data quality at 85%; RAND (2024, a meta-analysis of 65 cases) puts outright project abandonment at 80%; McKinsey puts falling short of targets at 70%. The three are not directly comparable — each defines "failure" differently.
Anatomy of the Sci-TQA² Rubric
To score, you need a scorecard. This system's scorecard is Sci-TQA². The name is an acronym for its four axes, splitting data readiness into governance trustworthiness (T), data quality (Q), AI compatibility (Ac), and scientific adaptability (As). Each dimension then breaks down further into sub-indicators. The diagram below lays out all four dimensions and their sub-indicators at a glance.
Figure 2. The four-dimensional Sci-TQA² rubric and its sub-indicators. Source: SciHorizon-DataEVA (arXiv:2604.26645), redrawn by Pebblous.
2.1 From "Clean" to "Safe to Rely On"
Of the four dimensions, Q (data quality) is familiar. Completeness, accuracy, uniqueness, and consistency are line items from the old data-quality textbook. What's new is the other three — especially T and As. T asks where the data came from, under what license it can be used, and whether it clears ethical review. As looks at whether a model trained on this data generalizes to unfamiliar tasks, how well it holds up in data-scarce regimes, and whether the causal structure is intact.
This expansion is the crux. Traditional data quality asked whether data was "clean." Sci-TQA² adds "can it be trained on (Ac)" and "can it be trusted (T·As)." The very definition of readiness has widened. Its significance lies in bundling axes that prior work — ML Data Readiness Levels, Datasheets for Datasets, the FAIR principles — had each covered only in part into a single scorable rubric.
The moment T (governance trustworthiness) and As (scientific adaptability) enter the scorecard, readiness widens from a data-engineering problem into a data-governance one. License and provenance, causal completeness — these don't get filled in by scrubbing the data "cleaner." They are items you can only score by looking at the context the data was born in and the purpose it's put to.
How the Agents Do the Scoring
Even with a scorecard in hand, applying it consistently across datasets from six different fields is no small feat. Astronomical observations and socio-economic statistics differ completely in format and structure. No single fixed analysis tool can score both. The system's answer is not to build tools in advance but to generate, on the fly, whatever analysis tool each dataset needs.
3.1 The Ability to Build Its Own Tools
Tool auto-generation succeeded 97.4% of the time, taking on average 1.19 attempts to finish a single tool. In other words, it produces workable analysis code on nearly the first try. This capacity is the backbone of heterogeneous-data scalability: when a dataset from a new field arrives, no human has to hand-write a fresh tool. The system processed roughly 80 datasets spanning six fields (astronomy, biomedicine, earth science, materials, physics, socio-economics) this way, and posted an overall evaluation success rate of 89.0%.
There's a reason for deliberately gathering six fields in one place: scientific data is the harshest test bed for readiness scoring. Awareness of the FAIR principles doubled over a decade, from 40% to 80%, yet actual compliance ranges from single digits to the 40s depending on the field. Reproducibility is in a similar state — in a 2016 Nature survey, 70% of scientists said they had failed to reproduce another lab's experiment. It is a world where data varies wildly in format and in governance maturity alike.
Paradoxically, that barrenness is what raises the experiment's value. If a single rubric can score data this uneven consistently, there's little reason it would fail on the tidier data of an enterprise. Indeed, AI and ML, as an open-data culture took hold, saw reproduction success climb from 28% in 2014 to 64% in 2024. Scoring readiness upstream in the pipeline reads as the next step in that same trajectory.
Looking at the scores actually assigned, they run fairly high and even across fields. Below are example readiness scores for representative fields.
| Field | Readiness score |
|---|---|
| Socio-economic | 95.3 |
| Physics | 93.2 |
| Biomedical | 91.4 |
| Earth science | 90.2 |
| Astronomy | 88.6 |
3.2 What the Ablation Exposed: Three Pillars
Take the system apart and three elements hold the performance up: Knowledge Planning, which works out what is needed where; Tool Memory, which remembers and reuses tools it has built; and Self-Correction, which re-checks and fixes its own scores. The ablation study removed these one at a time and measured how performance changed.
The results pinpoint exactly where this architecture is fragile. As the table below shows, removing Self-Correction sinks the success rate from 89.0% to 33.0% — the single most decisive of the three.
| Configuration | Evaluation success rate | Drop vs. full |
|---|---|---|
| Full system | 89.0% | — |
| − Knowledge Planning | 51.7% | −37.3%p |
| − Tool Memory | 82.6% | −6.4%p |
| − Self-Correction | 33.0% | −56.0%p |
Table 1. Component ablation results. Source: SciHorizon-DataEVA (arXiv:2604.26645).
Figure 3. Component ablation — evaluation success rate when each element is removed. Source: SciHorizon-DataEVA (arXiv:2604.26645), redrawn by Pebblous.
This number opens the door to the next section. More than half of that impressive 89.0% comes from a single layer — self-verification. However smoothly the scoring seems to run, its trust rests heavily on one thin verification loop. Which raises the question: who verifies that verification layer itself?
Who Vouches for the Rubric
When a person does the grading, we take their expertise as the ground of our trust. When an agent does it, the ground shifts. The question to ask is no longer "is this data good?" but "is this rubric — and this scoring system — trustworthy?" The trust question climbs one level, from the data to the scorecard.
4.1 ICC 0.742 — A Number on the Border
The researchers set the system's scoring against human experts' and measured the agreement. The result was an intraclass correlation coefficient (ICC) of 0.742. Human experts rated the scoring's accuracy at 4.15 out of 5, and a related assessment at 4.11. On the numbers alone, respectable. The catch is where 0.742 sits.
Two conventions for reading ICC dominate. Koo & Li (2016) treat 0.5–0.75 as "moderate" and 0.75–0.9 as "good." By that standard, 0.742 is the top rung of "moderate." By Cicchetti's (1994) standard, it just scrapes into "good." Either way, they share one thing: it falls 0.008 short of the 0.75 threshold. The diagram below shows where it lands.
Figure 4. Where 0.742 lands among the ICC reliability bands. Band boundaries follow Koo & Li (2016). For reference, the agreement of some recent agentic evaluation systems clusters near this same threshold (e.g., GPT-5-class κ=0.754).
How should we read this spot? For early-filtering purposes — research or screening — 0.742 is perfectly usable. For high-stakes contexts like clinical decisions or legal judgments (which typically demand ICC 0.80–0.90 or higher), it falls short. "Usable, but on the border of fully trustworthy" is the most accurate description. And this borderline value happens to be one that a good many of today's agent-based evaluation systems share.
4.2 Three Risks the Rubric Carries
Automate the scoring and a new class of risk — one the data layer can't catch — appears at the rubric layer. There are three, broadly.
- · Rubric bias. If the scorecard is designed to favor a particular domain or data shape, that bias gets stamped straight into the scores. It can look like a well-built scorecard while the standard itself is tilted.
- · Gaming. If the party building a dataset optimizes it "to score well," the score rises independent of actual quality. It's a vulnerability that shows up whenever the grader is an LLM.
- · Over-reliance on the verification loop. As we saw, more than half the trust rests on a single self-verification layer. If that layer wobbles, the whole scoring wobbles.
None of these three gets filtered out no matter how meticulously you verify the data, because the object of verification is not the data but the scoring standard itself. What's needed is an audit one level up — an "evaluation of evaluation" layer that independently interrogates whether the scoring criteria are justified.
The definition of data governance widens by one notch here. Until now, governance was "the layer that verifies data." The moment an agent scores readiness, governance has to encompass a meta-layer too — one that verifies the verification criteria. Who vouches for the rubric? Only an organization with an answer to that question can plant automated scoring into its pipeline with any real peace of mind.
The Pebblous View: From Ingestion to Inference
Read from the Pebblous vantage point, this study reveals an intriguing overlap. The layered structure Pebblous arrived at empirically while diagnosing data with DataClinic maps, in good part, onto the four dimensions of Sci-TQA². Two different starting points arrived at much the same map.
5.1 Where the DataClinic Layers Meet Sci-TQA²
DataClinic diagnoses data by splitting it into surface (L1) and interior (L2·L3). Overlay those layers onto the four Sci-TQA² axes and they line up like this.
| DataClinic diagnostic layer | Key indicators | Corresponding Sci-TQA² axis |
|---|---|---|
| L1 · Surface | Completeness · accuracy · uniqueness · consistency | Q (Data Quality) |
| L2 · Interior | Class balance · feature importance | Ac (AI Compatibility) |
| L3 · Interior | Task generalization · scarcity · causal completeness | As (Scientific Adaptability) |
| Governance | Trustworthiness · provenance · license | T (Governance Trustworthiness) |
This convergence is closer to an intellectual finding than a marketing line. If four axes defined by academic research and four layers accumulated in field diagnosis independently drew the same shape, that shape is likely close to the real structure of readiness. Q and Ac are axes that transplant across domains, so they've already been validated in enterprise data diagnosis. T and As carry a strong science-specific flavor and, in an enterprise context, their indicators have to be redefined before they transfer. The rubric's "frame" ports; its "indicators" get customized.
5.2 An Adoption Path for Data Teams
For a team looking to bring automated scoring into practice, the order runs roughly like this.
- · Customize the rubric starting with the axis where your organization's risk is greatest. For a regulated industry, that's T; where scarce data abounds, As comes first.
- · Decide how far to trust the agent's scores — set a confidence threshold — and place human-in-the-loop checkpoints below it. The ICC 0.742 borderline is a realistic starting point for designing that threshold.
- · Plant the scoring at an ingestion gate so risk gets filtered upstream in the pipeline.
- · Finally, stand up a meta-procedure that audits the rubric itself on a regular cadence. Bias and gaming get caught here, not in the data.
Figure 5. Four-step adoption path for a data team. Steps 1–3 apply the scoring; step 4 is the meta-governance layer that audits the rubric itself. Original Pebblous diagram.
That readiness has become a scorable object also means a data team's questions have doubled. On top of "is our data AI-ready?" comes "can we trust the rubric that scored that readiness?" The team that plants both questions in the pipeline together is the one that captures the speed and the trust of automation at once.
Editor's Note
The data-centric AI platform market is projected to grow from about $2.5 billion in 2024 to $22.3 billion by 2033, a 28.7% CAGR (Growth Market Reports, 2025). The "evaluation of evaluation" meta-governance layer is closer to an unclaimed gap with no clear owner yet. Pebblous is exploring a direction that joins the field-diagnosis experience built up through DataClinic with the academic frame this study offers — treating both the layer that scores data readiness and the layer that audits the trust of that scorecard. This paragraph is editorial background and should be read apart from the analytical argument of the main text.
References
Academic
- 1.SciHorizon-DataEVA Research Team (2026). "SciHorizon-DataEVA: Evaluating the AI-Readiness of Scientific Data via Sci-TQA² and Multi-Agent Systems." arXiv:2604.26645. (submitted 2026-04-29 / revised 2026-05-28) — primary source
- 2.SciHorizon Research Team (2025). "SciHorizon: Benchmarking AI-Readiness for Science." arXiv:2503.13503. — original framework
- 3.Koo, T. K., & Li, M. Y. (2016). "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research." Journal of Chiropractic Medicine, 15(2), 155–163. PMC4913118.
- 4.Cicchetti, D. V. (1994). "Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology." Psychological Assessment, 6(4).
- 5.Baker, M. (2016). "1,500 scientists lift the lid on reproducibility." Nature, 533, 452–454. doi:10.1038/533452a
- 6.Semmelrock, L. et al. (2025). "Reproducibility in Machine Learning." AI Magazine. doi:10.1002/aaai.70002
- 7."The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation." (2026). arXiv:2606.13685.
Policy, Statistics & Market
- 8.Gartner (2025-02-26). "Lack of AI-Ready Data Puts AI Projects at Risk." Gartner Newsroom. (57% of organizations AI-unready; 60% of projects projected to be abandoned by 2026)
- 9.Gartner (Q3 2024). Data Management Leader Survey. (248 data-management leaders; 63% of organizations lack a data-management practice for AI)
- 10.RAND Corporation (2024). "The Root Causes of Failure for Artificial Intelligence Projects." (80% of enterprise AI projects abandoned; data quality identified as top failure factor)
- 11.Anaconda, Inc. (2020). "State of Data Science 2020." (Data scientists spend 45% of time on data preparation)
- 12.Digital Science; Springer Nature; Figshare (2025). FAIR Data Awareness Report. (4,700 respondents across 151 countries; average FAIR compliance score 9.4/22)
- 13.Growth Market Reports (2025). Data-Centric AI Platform Market: Size, Share & Forecast 2025–2034. (Market projected from $2.54B to $22.31B, CAGR 28.7%)
Pebblous-Related
- 14.Pebblous Data Communication Team (2026-05-26). "5 Signals of AI-Ready Data — DataClinic Report." Pebblous Blog. (134-dataset diagnostic; five AI-readiness signals)
- 15.Pebblous Data Communication Team (2026-06-08). "What Is AI-Ready Data? Quality, Lineage & Governance Guide." Pebblous Blog.
- 16.Pebblous Data Communication Team (2026-07-01). "Claude Science: AI Workbench for Reproducible Research." Pebblous Report.