Executive Summary

Companies have poured billions into AI for four years running, and the returns are hard to find. Gartner expects 60% of AI projects to be abandoned this year because of poor data quality. Clario, a startup co-founded by Yousuf Khan, a five-time CIO, takes that diagnosis head-on: instead of building a bigger model, it says, clear out the garbage data that has piled up across the enterprise first. On that thesis it raised a $6M seed. This article looks at how far that bet reaches and where it stops.

What Clario targets is ROT, data that is Redundant, Obsolete, or Trivial. In partner analyses, as much as 60% of enterprise data qualified. Clario connects to repositories like Google Drive, SharePoint, and Confluence, surfaces garbage candidates with metadata heuristics, and routes keep-or-delete decisions through Slack and Teams workflows. The semantic AI embedding analysis that reads meaning is still on the roadmap; today's technology focuses on clearing away the "obvious garbage."

That is where the question for Pebblous readers begins. Once you clear the garbage, does the data that remains become data you can use for AI? Cleanup is a necessary condition, not a sufficient one. Clean data with no structure, no context, and no labels still sits in front of a RAG pipeline or an LLM as "clean garbage."

$6M

Clario seed round

Led by Preface Ventures, 11 investors (Jun 2026)

60%

AI projects to be abandoned

Due to data quality, within this year (Gartner)

Up to 60%

Of enterprise data is ROT

Measured in partner analyses (Clario)

95%

Gen AI pilots with no return

No measurable P&L impact (MIT NANDA)

1

What Breaks Isn't the Model

When an AI project fails, the first suspect is usually the model. The belief is that scaling it up or swapping in a better one will fix things. But the statistics of the past few years point somewhere else. Gartner projects that 60% of AI projects will be abandoned this year over data quality problems. Back in 2024, RAND already found that more than 80% of AI projects fail to deliver the business value they were meant to, and in 2025 MIT's Project NANDA reported that 95% of generative AI pilots produced no measurable profit-and-loss impact.

Abandonment rates are climbing fast. According to S&P Global Market Intelligence, the share of companies scrapping most of their AI initiatives jumped from 17% in 2024 to 42% in 2025. What these numbers say in common is simple: the bottleneck isn't the algorithm, it's the data going into the algorithm.

Where AI Projects Break Down Data Input Stage ⚠ 60–95% fail here Preprocessing AI Model Business Value Based on Gartner, RAND, MIT Project NANDA · Pebblous Original Diagram
▲ The AI pipeline — failures concentrate at the data input stage, not in the model. | Pebblous Original Diagram

Clario's co-founder and CEO, Yousuf Khan, saw this wall from the field. A five-time CIO who has talked with hundreds of fellow CIOs, he says every AI project stalls in the same place: data that has never once been cleaned up. "'Garbage in, garbage out' isn't a cliché," he puts it, "it's an enormously expensive mistake." Companies are burning real money, he argues, by feeding terabytes of garbage data into AI that promises revolutionary results.

2

What $6M Buys Clario

Clario came out of stealth in June 2026, disclosing a $6M seed round. Preface Ventures led, with Ridge Ventures, Rain Capital, Transform VC, and others making 11 investors in total. CTO Madhu Vohra, who founded the company with Khan, is an infrastructure veteran who led engineering at Oracle OCI Storage, NetApp's clustered SAN, and Nutanix. The company introduces itself as "the first dedicated platform built to remove enterprise data ROT."

2.1The ROT Diagnosis

ROT refers to data that is Redundant, Obsolete, or Trivial. Clario's diagnosis is that 78% of the data enterprises store is unstructured, and that conservatively more than a third of it is effectively garbage. When it actually analyzed partner companies, the garbage share rose as high as 60%. We're talking about an MP3 someone downloaded long ago, the manual for a discontinued product, a legacy format that no longer even opens, a movie a former employee uploaded to the company drive.

ROT Data in Enterprise — Scale Clario partner analysis ROT up to 60% Active 40% Veritas Databerg Report ROT + Dark Data 85% 15% ROT Breakdown Redundant — duplicate files and copies Obsolete — discontinued and outdated docs Trivial — off-topic files Pebblous Original Diagram
▲ A significant share of enterprise data qualifies as ROT. Clario partner analyses measured up to 60%; Veritas's Global Databerg Report puts ROT or dark data at 85%. | Pebblous Original Diagram

2.2How It Works

The product moves in four steps. First it connects to the content systems already in place, like Google Drive, SharePoint, OneDrive, Box, and Confluence. Then it scans metadata, file checksums, naming patterns, last-accessed time, whether the format is still supported, to narrow down ROT candidates. It sends those results to Slack or Teams to collect keep, archive, or delete decisions, and bills on a pay-per-decision model, charging only when a decision is made. As user decisions accumulate, it learns a system that cleans up automatically on a regular cadence.

One point is worth flagging. The basis of this detection is metadata-driven heuristics. The AI embedding analysis that reads a file's meaning to judge it is still on the roadmap. In other words, what Clario does well right now is clear away "obvious garbage" quickly and safely. That alone is plainly valuable. Saad Siddiqui of investor Preface Ventures called Clario "the only company actually doing the work to let enterprises start from an AI-ready foundation."

3

What's Left After the Cleanup

Say you've cleared out all the garbage. The drive is clean, storage costs drop, search gets faster. But whether AI gives good answers off the data that remains is a separate question. After the cleanup, what's usually left looks like this.

  • Thousands of Word documents with no labels and no classification
  • PDF reports with no metadata beyond a title
  • Confluence pages scattered with no links between them
  • Policy documents where no one can tell which version is current

These are not ROT. They are clearly valuable data, which is exactly why Clario's metadata heuristics leave them untouched. The problem comes when you push this perfectly good data straight into a RAG pipeline or an LLM. With no structure, no context, and no labels, the model has no way to know which document is the authoritative current version, or in what context a given sentence was written. The result drifts back toward garbage in, or more precisely, clean garbage.

ROT Removal ≠ AI-Ready — The Clean Garbage Problem All Enterprise Data incl. ROT outdated · duplicate · trivial Clario ROT Removed Looks clean, but... no labels · no context no structure · version unknown RAG / LLM ⚠ Clean Garbage 깨끗한 쓰레기 Model can't determine correct context Pebblous Original Diagram
▲ Removing ROT leaves data that looks clean but still fails in front of a RAG pipeline — "clean garbage." | Pebblous Original Diagram

Khan knows this himself. In one interview he said that when the agents and RAG systems he built in-house ran on top of outdated policies, discontinued-product docs, and retired support documents, "the LLM burned its compute budget filtering out the noise." Clario reduces that noise. But reducing noise and making the signal usable are different jobs.

4

The Distance From Deletion to Readiness

Making data usable for AI splits into two stages. The first is cleanup, the work of subtracting garbage. This is where Clario excels. The second is preparation, the work of turning the data that remains into a form AI can read. The two stages face the same direction, but they are not the same job. Here is how that distance breaks down.

Cleanup (Subtraction) · Clario's domain

  • Remove duplicate files
  • Delete discontinued and obsolete documents
  • Clear out legacy formats
  • Handle trivial and never-accessed files

Preparation (Making) · The next stage

  • Structuring — assigning schemas and taxonomies
  • Contextualizing — linking source and origin
  • Labeling — semantic tags and annotations
  • Retrieval optimization — shaping data into embeddable form
  • Versioning & governance — managing currency, ownership, lineage
From Cleanup to AI-Ready — The Gap Cleanup (Delete) Clario's domain Remove ROT data Stage 1 Preparation (Build) Next stage — making data AI-Ready Structuring Contextualizing Labeling Search Optim. Versioning AI-Ready Stage 2 (the longer distance) Cleanup tools can't solve preparation problems, and preparation tools can't replace cleanup Pebblous Original Diagram
▲ The cleanup stage is the starting point of the AI-readiness journey. Preparation is the longer road ahead. | Pebblous Original Diagram

What Clario calls a "foundational level of AI-ready" is, precisely, the starting point of this journey. Investors use the same language, but what the product actually does is "remove what gets in AI's way," not "make data AI can use." The gap between the two is exactly the distance between preventing garbage in and creating readiness.

So the question an organization asks itself converges on one thing: where is our data right now? Are we still clearing out garbage, or have we cleared it and still can't get answers from AI, meaning we need the next stage? The two questions call for entirely different investments and tools. You can't solve a preparation problem with a cleanup tool, and you can't substitute cleanup with a preparation tool.

The Pebblous View

When Pebblous talks about AI-Ready Data, what it points to is the second stage, preparation. The starting premise is that cleaned data does not automatically become ready for AI. If a tool like Clario pushes stage one along quickly, the work that remains is filling in structure, context, labels, and lineage to turn data into signal.

So Clario's arrival reads less as a competitive signal than as a sign that the market is moving toward the same problem. The very fact that $6M has been attached to the thesis of "data over models" is evidence that data quality is no longer something to put off. And if you push that thesis all the way through, preparation always comes after cleanup.

R

References

Press & Official Announcements

Statistics & Market

  • 4.Gartner. (2026). "60% of AI projects projected to be abandoned due to data quality issues."
  • 5.RAND Corporation. (2024). "More than 80% of AI projects fail to reach intended business value."
  • 6.MIT Project NANDA. (2025). "95% of generative AI pilots show no measurable P&L impact."
  • 7.S&P Global Market Intelligence. (2025). "Share of companies abandoning most AI initiatives rose 17% → 42%."
  • 8.Veritas. "Global Databerg Report — roughly 85% of enterprise data is ROT or dark data."

Pebblous Adjacent

※ Figures such as Clario's adoption and garbage share (up to 60%) are based on company statements and partner analyses, with no independent third-party verification. The Gartner, RAND, MIT, and S&P statistics cite each organization's published projections and reports.