A Data-Defect Taxonomy of Enterprise AI Agent Pilot Failure

Pebblous Data Communication Team

Executive Summary

"Most enterprise AI pilots never reach production, and the culprit is the data, not the model." By now that sentence is conventional wisdom. What still has no answer is the sentence after it. Which data defects actually stall an agent, how do they do it, and who owns the exceptions when they surface? Instead of restating the top-line diagnosis one more time, this report answers those three questions one defect at a time.

The answer runs along two axes. The first is a defect taxonomy. The data flaws that genuinely stop agents fall into six kinds — semantic drift, duplicate and ghost records, misread nulls, silent type coercion, silent schema change, and staleness. They share one property: every one of them passes schema validation. Formally the data is valid; to an agent it is not usable. The second axis is the ownerless exception. The edge cases a human analyst used to catch by hand in the rules-based era are now swallowed in silence by autonomous agents. The exceptions didn't disappear. They lost their owner.

The real lever that moves a pilot into production isn't model tuning. It is naming the defects and assigning ownership of the exceptions. Not "clean data" but "agent-ready data" — measuring that gap in concrete numbers is where this report is trying to land.

The gap already shows up in the numbers. Only a sliver of enterprises say their data is fully ready for AI; unready data comes back as tens of millions of dollars per company every year; and when scaling fails, the blame usually can't find an owner. Flip that around, and the companies that did name an owner were markedly more likely to carry a pilot into production.

7%

of enterprises say their data is completely AI-ready

Cloudera/HBR 2026, n=230+

$12.9M

annual cost of poor data quality per company

Gartner

89%

of scaling failures traced to unclear ownership

AgentMarketCap 2026

2.7×

higher pilot-to-production rate with a named agent owner

Forrester

1

Pilots Aren't Dead — They Just Stall at the Exception

Enterprise AI discourse already has its settled truth: the pilot works, it can't cross into production, and the reason is tangled data rather than the model's intelligence. That diagnosis is correct. So this report starts there but refuses to stay there. It folds "the data is the problem" into the premise. The question practitioners actually want answered comes next: exactly which defect, along exactly which path, brings an agent to a halt.

Peel back the "can't reach production" cliché and the picture is more complicated than it looks. RAND found that 80.3% of enterprise AI projects failed to deliver the business value they promised — but it didn't lump those failures together. It split them into four: 33.8% scrapped before ever reaching production, 28.4% that reached production yet fell short of expected value, 18.1% that run but never recoup the investment, and 19.7% that met the business case. The second bucket is the one to watch. Failing after reaching production is nearly as common as being scrapped before it. A pilot's death doesn't end at the moment of deployment.

The point where projects fall out is specific, too. AgentMarketCap's 2026 maturity model maps the enterprise agent journey in four stages. 78% reach the prototype stage (Stage 1, fewer than five agents), but the moment they try to scale to 5–20 agents, 60% get trapped in what the report calls the "Stall Zone." Only 31% of companies reach stable production (Stage 3) — 21% in finance, just 8% in healthcare. The dropout isn't caused by insufficient model performance; it happens at the scaling threshold, where defects multiply.

▲ 78% enter the pilot stage but only 31% reach stable production (Stage 3) — 60% stall at the scaling threshold | Source: AgentMarketCap (2026)

Maturity stage	Definition	Company distribution
Stage 1 · Pilot	Fewer than 5 agents, single-function prototype	78% reach it
Stage 2 · Stall Zone	Attempting to scale to 5–20 agents	60% stall here
Stage 3 · Production	Stable operation + governance	31% (finance 21% · healthcare 8%)
Stage 4 · Scale	20+ agents running org-wide	A tiny minority (12% on a central platform)

Source: AgentMarketCap, The Enterprise Agent Deployment Maturity Model 2026 (2026-04-11)

One sentence runs through this entire report. Valid and usable are different axes. Passing schema validation does not make data something an agent can consume without misreading it. The six defects dissected below share this property without exception. The parser lets them through, the pipeline throws no error, the dashboard is green — and the agent confidently returns the wrong answer.

The table below gathers the six defects in one place: how widely each occurs, why it slips past validation, and the specific mistake it drives an agent to make. The sections that follow take each row in turn.

Defect	Scale	Why it passes validation	How the agent misfires
① Semantic drift	73% have seen inconsistent AI output	Field is populated and correctly typed	Picks one of several definitions and answers wrong
② Duplicate / ghost records	15–25% duplication rate in databases	Each record is individually valid	Counts one customer as many, skewing metrics
③ Misread nulls	41–86.7% multi-agent failure	null is a legal value	Reads null as 0 / default, flipping the decision
④ Type coercion	<40% compliance without enforced schema	Serialization format is perfectly valid	Quietly changes types or drops fields
⑤ Silent schema change	39% of teams name it their top risk	No schema validation at ingest	No error, just an empty dashboard
⑥ Staleness	37.6% error rate without detection	Data exists and the format is fine	Keeps deciding on stale context

2

Defects ① and ②: Fields Whose Meaning Drifts, Records That Turn to Ghosts

The first two defects aren't about the "content" of the data but its "meaning." The values are fine. What's blurred is what those values point to. Because these defects never throw an exception, no red light comes on anywhere in the pipeline.

2.1Semantic Drift: Same Name, Different Meaning

Definition. Semantic field drift is the phenomenon where a field with the same name means something different from system to system. The textbook case is "active customer." In one system it means a customer who purchased within 90 days; in another, one who logged in within 30. When Finance, Marketing, and Sales all send the same prompt — "total customer value in Q3" — they get three different answers. No department is wrong, yet the answers disagree.

Why it passes validation. The field is populated, the type is right, it isn't null. What a schema checks is "is there a value," not "what does that value mean." Meaning lives outside the schema, so no formal validation catches it.

How the agent fails. In the rules-based BI era this ambiguity wasn't a problem. Earley Information Science puts the diagnosis precisely: "BI tools and operational systems have tolerated semantic ambiguity for decades because the analyst acted as an implicit reconciliation layer. Agentic AI removes that buffer." Where a person would have asked back, "which definition do you mean?", the LLM picks one of the conflicting definitions with no flag at all and returns a confident wrong answer. Already, 73% of enterprises say they've seen inconsistency in AI output.

▲ One field name "active_customer" carries three different meanings across systems — the agent picks one without noticing the conflict and returns a confident wrong answer | Source: Earley Information Science (2026)

Detection and repair. The starting point is pinning definitions down as a contract that lives outside the code. When the Open Semantic Interchange v1.0 launched in January 2026 with 30-plus partners including Snowflake, dbt Labs, and Salesforce, it marked the industry's first move to elevate the semantic layer into a shared standard. Nail the definition of "active customer" into a data contract, and the moment the definition changes it surfaces as a contract violation.

2.2Duplicate and Ghost Records: When One Person Becomes Four

Definition. A duplicate record is a state in which a single entity exists as several copies inside the data. The textbook case is the same customer holding four IDs across four systems after an M&A. By Landbase's 2026 figures, the duplication rate in an average enterprise database runs 15–25%. In a base of 10 million customer records, that's 1.5–2.5 million ghosts.

Why it passes validation. Each record is perfectly valid on its own. It has a name, an ID, the right format. Duplication is never caught by validation that looks at a single record. It only shows up when you look at the relationships between records — and most schema validation stops at the row level.

How the agent fails. When an agent counts "number of customers" or sums "total purchases," ghost records quietly inflate the number. Deterministic rules resolve only about 80% of matches; the remaining 20% contaminate every downstream model, campaign, and compliance report. The agent then executes actions — "auto-reorder," "extract the high-value segment" — on top of that contaminated aggregate.

Detection and repair. Entity resolution and deduplication pipelines are the answer. Landbase reports that introducing algorithmic matching cuts duplicates by 30–40% within the first few months. The key is profiling the duplication rate quantitatively at every join and handoff point. What you don't measure, the ghosts keep multiplying.

Semantic drift and duplication look like different defects, but they fail the same way. Neither throws an exception, and both make the agent "confidently wrong." With the human analyst — the buffer — gone, these two defects translate directly into flawed autonomous action.

3

Defects ③ and ④: Agents That Read null as 0, and Type Coercion in Silence

The third and fourth defects are the purest specimens of the "valid but not usable" state. The data passes the schema, the JSON parses perfectly — and inside it, meaning quietly warps.

3.1Misread Nulls: When "Absent" and "Zero" Blur Together

Definition. null means "no value" or "not yet known" — it does not mean "0." Yet as an agent serializes data into a prompt or feeds it into a calculation, null is easily flattened into 0 or a default. Read a null in an inventory field (= not verified) as 0 (= out of stock), and the agent cancels the auto-reorder. Read a null balance as 0, and a credit decision flips wholesale.

Why it passes validation. null is a legal value in most schemas. The validator only checks "is null allowed" and lets it through. The problem is how the downstream interprets null — and that interpretation is not what validation covers.

How the agent fails. The "lossy serializer" phenomenon documented by Microsoft's ISE team lays bare what this defect really is. JSON produced by an LLM is formally 100% valid, yet its contract reliability is effectively near zero. Required sub-fields go missing, arrays come back empty — and the parser lets it through. An operator opens the summary and finds no event log and no open tasks. Not because there were no incidents, but because the model silently dropped them. This kind of silent omission is part of why production failure rates in multi-agent systems reach 41–86.7% (MAST, NeurIPS 2025, tracing 1,600+ runs).

3.2Type Coercion: The Format Is Right, the Value Is Wrong

Definition. Type coercion is the phenomenon where an LLM arbitrarily changes a value's type when it calls a tool or emits structured output. It returns customer_id as an integer instead of a string, drops a boolean field entirely, or wraps the whole JSON in an escaped string. According to OpenAI's internal research, without schema enforcement GPT-4's output-schema compliance runs below 40%. Flip that around, and more than 60% of tool calls carry the risk of a type mismatch or a dropped field.

Why it passes validation. The serialization format itself is often perfectly valid. The braces match, the syntax is correct. It passes format validation — but because the customer_id inside was turned into an integer, the follow-up lookup quietly returns an empty result. It's a pattern reported over and over in real open-source projects, like deepset-ai/haystack issue #6098.

Detection and repair. Apply schema enforcement (structured output / constrained decoding) and compliance drops to near-zero failure. But enforced schemas alone won't stop a misread null. Pin type and meaning down together as a contract, and profile at the point where data enters the agent.

When these two defects combine, you get the quantitative identity of that famous "it worked in the pilot, so why not in production." In a pilot — a single run over curated test data — the agent scores 60% accuracy. But in production — multiple runs over live data laced with nulls, duplicates, and type mismatches — it drops to 25% consistency. A 3× decline. The gap's chief cause isn't that the model suddenly got dumber; it's that the meaning of a null, handled consistently in the pilot data, varied from system to system in the production data.

Environment	Conditions	Performance
Pilot	Single run, curated test data	60% accuracy
Production	Multiple runs, live data with nulls and type mismatches	25% consistency

Source: synthesis of multiple LLM reliability studies. Even the best model on data-agent benchmarks reaches only 38% pass@1 (arXiv 2025).

4

Defects ⑤ and ⑥: When Yesterday's Good Data Quietly Goes Wrong Today

The last two defects play out on the axis of time. Data that was fine yesterday quietly goes wrong today. It isn't the pipeline that notices the drift — it's the agent, after the fact.

4.1Silent Schema Change: The Column Nobody Told You About

Definition. A silent schema change is when a data provider alters the schema without notifying the consumer. The typical scenario: a vendor inserts a column into the middle of a JSON payload. The ingest process has no schema validation. So with no error and no alert, processing goes wrong, and all that's left is an empty dashboard.

Why it passes validation. Because there was no schema contract to validate against in the first place. The data still flows in, and the format looks plausible. To catch the change you'd have to compare "is the schema now the same as the schema before" — and most pipelines never make that comparison.

Scale. In the 2026 State of Database Change Governance report, 39% of teams named schema drift their top AI risk. Liquibase claims 64% of AI risk actually lives at the schema layer, not the model (a single-vendor estimate, so read it only as directional evidence). It's why the data-engineering community has nicknamed it the "silent killer."

Detection and repair. Put a schema contract and automated schema validation in front of ingest, and a change surfaces immediately as a contract violation. 2026 frameworks like RIVA have demonstrated an LLM-agent-based approach that reliably detects configuration drift.

4.2Staleness: Decisions on Stale Context

Definition. Staleness is a state where the data exists and the format is fine, but it's already past its expiry. If an agent makes a decision once per second while the data refreshes once per hour, then for that hour 3,600 decisions run on the same stale context.

How the agent fails, and the prescription. The NANDini multi-agent study measured a decision error rate of 37.6% when there was no staleness-detection mechanism. In the same study, applying the Data Facts metadata schema dropped that rate to 8.6%. A single detection mechanism cut the errors to less than a quarter. This is why the report doesn't stop at pessimism. Name a defect and attach a detector, and the improvement shows up as a number.

Condition	Agent decision error rate
No staleness-detection mechanism	37.6%
Data Facts schema applied	8.6%

Source: arXiv 2606.26211, Data Facts: A Metadata Schema ... in the NANDini Multi-Agent Ecosystem (2026)

5

Exceptions Nobody Owns: The Governance Vacuum of the Agent Era

Once you've dissected the six defects one by one, you arrive naturally at the second axis. The fact that these defects don't throw exceptions means, flipped around, that even when an exception does occur, no one catches it. In the era of rules-based ETL and BI, analysts and engineers were the implicit reconciliation layer. When a strange value showed up, a person caught it by hand. Autonomous agents remove that buffer. Exceptions still occur, but with no explicit owner they're now swallowed in silence.

This vacuum isn't an abstract worry — it's measured. AgentMarketCap attributes 89% of agent scaling failures, among traceable causes, to "unclear ownership." Only 21% of companies have a mature governance model for autonomous AI agents. In the EY/AIUC-1 consortium survey, just 38% of companies monitor AI traffic end to end, and only 17% continuously monitor agent-to-agent interactions. 64% of companies with over $1 billion in revenue reported losses exceeding $1 million from an AI incident in 2025.

▲ AI agent security responsibility split across three functions — without a clear owner, exceptions have no owner either | Source: AgentMarketCap (2026), EY/AIUC-1 (2026)

Why did ownership scatter so badly? Surveys show responsibility for agent security split across the security team (39%), the IT department (32%), and a dedicated AI-security function (13%). A structure where everyone owns a little is, in the end, one where no one owns anything. Blackstraw's Atul Arya sums up the dynamic well: "ROI shows up somewhere between six and twelve months, but executive sponsorship disappears at the same point — because success metrics were never defined up front."

The prescription is optimistic. Companies with a "named agent owner" who holds both budget authority and measurement goals convert pilots to production at 2.7× the rate. Among companies that made the transition, 94% had a named owner, and 87% run automated evaluations before changing a prompt, model, or tool. The practical instruments for reclaiming ownership are exactly data contracts and observability. A data contract makes an explicit promise between producer and consumer — on schema, meaning, and freshness SLAs — and alerts the exception's owner the instant that promise is broken.

Exceptions didn't disappear in the agent era. They just lost their owner. If naming the defect is the first axis, deciding who catches that named defect when it appears is the second. The two axes move together. You can't assign an owner to a nameless defect, and a defect with no owner stays neglected even after you name it.

6

From "Clean Data" to "Usable Data"

The six defects and the ownerless exception all converge on one distinction. Validity and usability are different axes. Clean data is data that passes schema validation. Usable data is data an agent can consume without misreading it. However perfect the format, if a null gets read as 0 or "active customer" means something different in every system, that data is valid but not usable.

▲ Valid (passes schema) and Usable (agent can consume without misreading) are different axes — 93% of enterprise data sits in the bottom-right where format is correct but agents misread | Source: Pebblous reinterpretation, Cloudera/HBR 2026

The gap is startlingly wide. In the 2026 survey by Cloudera and Harvard Business Review Analytic Services, only 7% of companies said their own data was "completely ready" for AI. Deloitte puts the share of companies with a data architecture suited to agentic AI deployment at 14%. The two figures use different definitions but point the same way: the data at most companies hasn't even reached "clean," let alone "usable."

The most dangerous thing is the misconception about how to close that gap. In the same survey, 47% of companies said they believe "agentic AI will fix data quality problems on its own." The picture the previous five sections drew is the exact opposite. Far from fixing data quality, agents amplify defects by removing the human reconciliation buffer. Data quality is not an output of adopting agents; it's a precondition for it.

If so, the conditions for "usable data" are clear: none of the six defects, and an owner for whatever exceptions remain. Gartner's decision to add "agent-ready data" to its glossary in 2026, as a criterion more specific than "clean data," reflects the same shift in thinking. The question is how to measure that state. And yet 59% of companies don't measure data quality at all. Measurement is the first step.

7

Why Pebblous Pays Attention to This Problem

Pebblous DataClinic is a product that diagnoses a dataset quantitatively to judge whether it's in a "usable state." The six defects this report dissects — semantic drift, duplicates, nulls, types, schema, freshness — are in fact the very defect axes DataClinic actually catches (integrity, distribution, duplication, missingness), rewritten in the language of agent operations. So this piece is less abstract discourse than a document that translates, defect by defect, why the practice of data diagnosis is needed at all.

The 3× accuracy decline (60% → 25%) and the before/after of staleness detection (37.6% → 8.6%) are especially the cleanest evidence for a proposition Pebblous has long held: defects in training and input data propagate into the behavior of models and agents. That this proposition can be quantified not as an abstract number like "85% quality" but defect by defect — null-read-as-0, fields that differ across systems, stale context — is where this report lands.

The prescription for an organization that can't carry a pilot into production isn't "tune the model more." It's "diagnose these six defects first, and assign an owner to the exceptions that remain." The table below simply carries the six defects over into a diagnosis-and-ownership checklist. Profiling these items at every join and handoff point in your own pipeline is the first step.

Defect	Diagnostic metric	Ownership question
Semantic drift	Source and version match of field definitions	Who owns the definition of "active customer"?
Duplicate / ghost records	Duplication rate, entity-match confidence	Who manages the deduplication rules?
Misread nulls	null ratio, whether null semantics are documented	Who notices when a null is read as 0?
Type coercion	Schema compliance rate, whether enforced schema is applied	Who validates the tool-output contract?
Silent schema change	Schema version control, change-alert system	Who approves a provider's schema change?
Staleness	Freshness lag, refresh cadence vs. SLA	Who is accountable for the freshness SLA?

Editor's Note. This section is analysis written from the Pebblous point of view. The six defects and the governance vacuum above hold as industry phenomena independent of any particular product, and DataClinic is one approach to diagnosing those phenomena quantitatively. Defining and measuring not just "clean data" but "data an agent can use" — that mostly-empty space is where we mean to stand.

R

References

Industry Reports & Surveys

1.AgentMarketCap. (2026-04-11). The Enterprise Agent Deployment Maturity Model 2026: Why 86% of Companies Are Stuck in Pilot Purgatory.
2.Cloudera / HBR Analytic Services. (2026-03-05). Only 7% of Enterprises Say Their Data Is Completely Ready for AI. (n=230+)
3.Writer.com & Workplace Intelligence. (2026-04-07). Enterprise AI Adoption in 2026. (n=2,400)
4.Informatica. (2025). CDO Insights 2025. (n=600 data leaders)
5.Deloitte. (2026). State of AI in the Enterprise 2026. (N=3,235)
6.Landbase. (2026). Duplicate Record Rate Statistics: 32 Key Facts for 2026.
7.Liquibase. (2026). The Real AI Failure Mode: Data Quality at the Schema Layer, Not the Model.
8.Earley Information Science. (2026). Why Enterprise AI Stalls: Semantic Infrastructure.
9.EY / AIUC-1 Consortium. (2026-03). AI Ownership and Accountability Report. Includes Forrester root-cause analysis (named agent owner, 2.7×).
10.Business Standard. (2026-06-30). Experts explain why enterprise AI projects struggle to move beyond pilots.
11.Gartner. (2025-06-25). Poor Data Quality Costs Organizations $12.9M Annually; Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027.

Academic Papers & Technical Reports

12.MAST Research Team. (2025). MAST: A Multi-Agent System Failure Taxonomy. NeurIPS 2025. (1,600+ traces, 14 failure modes)
13.NANDini Research Team. (2026). Data Facts: A Metadata Schema for Structured Data Exchange in the NANDini Multi-Agent Ecosystem. arXiv: 2606.26211.
14.Agentic AI Fault Research Team. (2026). Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes. arXiv: 2603.06847.
15.Microsoft ISE Developer Blog. (2026). Separating Deterministic Extraction from AI Inference. (lossy serializer)
16.RAND Corporation. (2024). Enterprise AI Project Failure Analysis. (80.3% failure, MECE breakdown)
17.Lanham, M. (2026). LLM Output Compliance Without Schema Enforcement. Medium/@Micheal-Lanham. (schema enforcement absent: compliance <40%)

Some figures (64% schema-layer risk, 65% context drift, and others) are single-vendor or informal synthesized estimates, flagged as "estimate" in the body.