If It's Not on the List, You Can't Govern It: The AI Bill of Materials and Data Lineage

Pebblous Data Communication Team

Executive Summary

A software bill of materials (SBOM) is a list of which code components went into a product. But an AI system does not run on code alone. Model weights, training datasets, system prompts, external APIs, guardrails, and increasingly agents and MCP servers are all tangled together at runtime. Look only at the code list, and those components stay in a blind spot. The name for that blind spot is "shadow AI." This report sets out what an AI bill of materials (AI BOM) records once it reaches beyond code, and why it is hardening into a regulatory requirement.

The size of the blind spot shows up in the numbers. The average company runs 14 AI tools but IT only knows about four or five of them, and one in four organizations cannot say which AI services are running inside the company right now. What is not on the list cannot be governed or audited. That is why a breach involving shadow AI costs an average of $670,000 more than a standard one and takes 267 days to detect.

Yet "the dataset is on the list" and "the dataset can be trusted" are two different claims. Security tools are good at finding AI components, but they do not judge whether the underlying data carries contamination, duplication, label errors, or license violations. This is where Pebblous starts from a single premise: data you cannot audit is data you cannot regulate. If an AI BOM is the parts list, then data quality and provenance monitoring are the certificate of quality stapled to each part.

63%

No AI governance policy

Share of orgs running AI with no list and no control baseline (IBM 2025)

+$670K

Added cost of a shadow-AI breach

Average premium over a standard breach (IBM 2025)

73%

Exposed to prompt injection

Production AI found vulnerable in security audits

€35M / 7%

Top EU AI Act penalty

7% of revenue or €35M for prohibited practices (Article 99)

1

The Limits of a Code-Only Manifest

A software bill of materials works much like the ingredient label on a food package. Record which library went in, at which version, under which license, and when a vulnerability later surfaces in one of them you can quickly find every product it touched. After the 2021 Log4Shell crisis, SBOMs became table stakes for supply-chain security, and Gartner expects SBOM adoption among large enterprises to climb from 56% in 2025 to 85% by 2028.

The SBOM rests on a single assumption: the parts are fixed at build time. Once you ship, the list of libraries inside does not change, so one static list is enough. AI systems break exactly that assumption.

Models keep evolving through fine-tuning and retraining. At runtime they call external APIs and pull in fresh data through retrieval. The same code behaves differently the moment you swap a system prompt or a guardrail. The system stays alive long after the build has finished. A static parts list cannot keep pace with a system that flows like this.

So an AI bill of materials does not stop at listing parts. It tracks the relationships between them. Which application uses which model (USES_LLM), which tools it calls (USES_TOOL), which retrievers and memory stores it connects to (USES_RETRIEVER, USES_MEMORY) — all drawn as a graph. It is not a list but a dependency graph.

Laying the two manifests side by side, item by item, makes the difference plain.

Dimension	SBOM (software)	AI BOM (AI bill of materials)
What it tracks	Code, libraries, dependencies	Models, datasets, prompts, agents, even MCP servers
Time assumption	Fixed at build (static)	Keeps evolving in training and runtime (dynamic)
Structure	A list of parts	A dependency graph that captures relationships
Core risk	Known CVEs, vulnerable libraries	Data poisoning, prompt injection, model backdoors
Quality question	Is the version current?	Can this data be trusted?

▲ An SBOM lists only code parts. An AI BOM tracks models, datasets, prompts, and MCP servers as a dynamic dependency graph that captures their relationships. | Pebblous original diagram

An SBOM asks "what went in." An AI BOM adds "how are these things connected, and are they still what they were?" Where a single static list used to be enough, an inventory that follows a living system now takes its place.

2

What an AI BOM Records: From Models to MCP Servers

So what, concretely, goes into an AI bill of materials? The open-source AI BOM scanner that Cisco released recognizes 30 component types, which boil down to eight kinds of part. Set against an SBOM that recorded a single category — code dependencies — the list of things you have to write down has grown by whole columns.

Models

Weights, version, origin, license

Datasets

Training and fine-tuning data and its lineage

Prompts

System prompts and templates

Guardrails

Safety filters and policy rules

Secrets

API keys, tokens, credentials

Agents

Autonomous logic and tool calls

MCP servers

External context and tools the model connects to

Identity

Service accounts, permissions, access scope

Listing these eight separately is not enough. The heart of an AI BOM is that it records how they connect. Only by graphing which agent calls which model, which dataset that model was trained on, and which MCP server it pulls tools from can you trace how far the blast radius reaches when one part fails.

▲ The core of an AI BOM is not listing eight parts separately but connecting them as a dependency graph — USES_LLM, USES_TOOL, USES_RETRIEVER relationships and all. | Pebblous original diagram

The formats are already here

Fortunately, the standards for what to record and how are already in place. OWASP's CycloneDX has supported an ML-BOM for machine-learning components since v1.5 in June 2023, and it has since reached v1.7, standardized as ECMA-424 2nd Edition. The Linux Foundation's SPDX 3.0 adds an AI Profile, and OWASP runs a separate AIBOM Project. ISO/IEC 5259 underpins data quality, and ISO/IEC 42001 underpins governance of AI management systems.

The academic roots of the idea run deeper still. "Datasheets for Datasets," proposed by Timnit Gebru and colleagues in 2018, argued for attaching a spec sheet to every dataset recording its origin, composition, and intended use; "Model Cards," from Margaret Mitchell and colleagues the same year, asked for the same thing for models. The idea of attaching documentation to each part predates the name "AI BOM" by years.

Given the scale, this is not work you can do by hand. Hugging Face alone hosts over 2 million public models and more than 1.5 million datasets. So the inventory falls to scanners that analyze source code and harvest dependencies automatically. The problem, then, is not a missing format. What remains is what fills those fields and how reliably.

3

Shadow AI: What's Off the List Becomes the Attack Surface

The case for an AI BOM is not abstract. Right now most organizations cannot fully say what AI is running inside them. By Productiv's 2026 tally, the average company uses 14 AI tools but IT accounts for only four or five. The rest runs out of sight.

The chart below shows that visibility gap directly. One bar is the number of AI tools employees actually use; the other is the number IT knows about.

Average AI tools per company versus the number IT can account for. Source: Productiv, 2026.

Other surveys point at the same picture. In Wiz's cloud security report, 25% of responding companies said they could not account for the AI services running right now; in IBM's breach-cost report, 63% of organizations had no AI governance policy at all. A Gartner survey found 68% of employees using AI tools their company had not approved. That visibility gap is shadow AI.

If it is not on the list, it cannot be controlled or audited, so shadow AI becomes the attack surface itself. According to IBM, a breach involving shadow AI cost an average of $670,000 more than a standard breach and took 267 days to detect and contain (against 241 for a standard one). Among the organizations breached, 97% lacked access controls for AI.

The new attacks grow in data and prompts, not code

The harder problem is that this new risk does not register with traditional code-dependency scanners. Prompt injection sits at the top of OWASP's risk list for LLM applications, and security audits found 73% of production AI vulnerable to it. The attacks grow not in the code layer but in the data and prompt layers. In its 2025 threat report, ENISA newly classified three emerging supply-chain attacks.

Rules File Backdoor

Malicious instructions are planted in an AI coding assistant's configuration file. The backdoor lives in the "rules," not the code.

Slopsquatting

Attackers pre-register package names an LLM hallucinated. More than 20% of Python code suggestions point to packages that do not exist.

Malicious LoRA

A backdoor is hidden in a lightweight adapter of a few tens of megabytes, outwardly indistinguishable from normal fine-tuning.

▲ SBOM scanners only see the code layer. Rules Backdoor, Slopsquatting, and Malicious LoRA grow in the data and prompt layers — without an AI BOM, that entire surface stays invisible. | Pebblous original diagram

Add to these the zero-days that slip past malware detection in model files (for example, CVE-2025-10155). The common thread is clear: none of this leaves a trace on a code-dependency list. Fail to inventory the origins of models, data, and prompts separately, and the fastest-growing attack surface stays entirely invisible.

4

When Inventory Becomes a Regulatory Requirement

AI inventory is moving from recommendation to requirement. But it is not moving in a single direction. Regulation is splitting into two tracks: an EU-style tightening of substantive duties and a US-style turn toward discretion and disclosure. Before reading it one way, you have to look at both.

The EU is tightening its obligations. The EU AI Act requires technical documentation and data-governance evidence for high-risk systems, and mandates a public summary of training data for general-purpose AI (GPAI). Penalties run up to 7% of revenue or €35M, exceeding GDPR's top fine (4% / €20M). Full enforcement for high-risk systems is scheduled for 2 August 2026 (a delay remains possible). California's AB 2013 requires disclosure of generative-AI training data from 1 January 2026. In academia, the Leiden Declaration — led by mathematicians — called for consent, attribution, and transparency in AI training data, and its signatories had grown to 2,821 as of 30 June 2026.

At the US federal and state level, the movement runs the other way. Colorado repealed the AI law it had enacted in 2024 (SB 24-205), replacing it on 14 May 2026 with SB 26-189 and pivoting from governance duties toward disclosure and process. The White House Office of Management and Budget memo M-26-05 (23 January 2026) rolled SBOM requirements back from mandatory to agency discretion in software and hardware supply-chain security. Note that this is not a memo mandating an AI BOM — it is, if anything, an example of loosening manifest requirements.

Direction	Representative cases	Character
EU-style substantive duty	EU AI Act, California AB 2013, Leiden Declaration	Mandated documentation and data-governance evidence, strong penalties
US-style discretion / disclosure	Colorado SB 26-189 (repeal), OMB M-26-05	Eased duties, disclosure-centric, SBOM made discretionary

However the direction splits, one premise holds firm: you cannot map a risk you have not inventoried. Whether the duty tightens or shifts to disclosure, if you do not know what is inside your own system you cannot meet either side's requirements.

Industry is strong at "discovery" but blank on "trust"

Security vendors have already moved into the AI BOM market. Yet what they do well and what they leave empty separate sharply. The table below lays out three vendors' strengths and the spot they all leave blank.

Vendor	Strength	Data-trust gap
Cisco AI Defense	Auto-inventories agent dependencies from source scans (30 component types)	Finds the parts, but does not judge data quality or lineage
Wiz AI-SPM	Links identity and cloud permissions	13% real-world usage; data-origin verification out of scope
Palo Alto Prisma AIRS	Runtime defense and threat detection	Detection-focused; says nothing about whether the data parts are trustworthy

All three are good at finding what is there. None of them judges whether that data can be trusted. The fact that 96% of scanner alerts are false positives captures the limit well. Discovery has been automated; trust remains an empty field.

5

An Auditable Inventory: Data Lineage Secures Trust

Here we return to where this report began. "The dataset is on the list" and "the dataset can be trusted" are entirely different problems. Writing a dataset's name into the parts list tells you nothing about whether that data is contaminated, piled with duplicates, mislabeled, or in license violation. If inventoried data carries those flaws, the AI BOM becomes a box-ticking false reassurance.

How large this gap looms in practice is something the numbers spell out. In Stanford HAI's 2026 report, 74% of respondents named "inaccuracy" as AI's top risk (up 14 points year over year), and 52% pointed to "data quality" as the biggest obstacle to deploying AI agents. The bottleneck of trust lies not in the model but in the data.

So an auditable inventory is completed where two lineages meet. One is model lineage: the record of which data a model was trained on and how it has evolved. The other is data lineage: the record of where that data came from, what transformations it went through, and what its quality is now. When both lineages attach to the parts list, the AI BOM moves past a checklist to become an auditable foundation of trust.

So what does an "inventory with lineage attached" look like in practice? For each data part, the results of checking for outliers, duplicates, label errors, and license violations follow along as a quality certificate, and where the data came from and what transformations it passed through stay on record as a provenance trail. A single line in the list then changes from "dataset name and version" into "evidence you can trace back to when, with what, and how it was validated." The standards seen earlier have already framed this blank: ISO/IEC 5259 provides the measures of data quality, and ISO/IEC 42001 provides the governance framework that records the validation process. The one question an auditor asks — can I trust this data? — comes to be answered by the list itself.

▲ Data lineage completes the auditable AI BOM entry in three steps: origin record → transformation log → quality check → audit trail. | Pebblous original diagram

An AI BOM that is only a list

The dataset name and version are written down. But the grounds for trusting that data are blank. The form is filled in, yet it cannot be audited.

An AI BOM with lineage attached

Each data part carries a quality certificate (checks for outliers, duplicates, label errors) along with its origin and transformation history. The list becomes auditable evidence.

That second picture sits in the same place as what Pebblous has long said about data quality and lineage. An AI BOM's "data-part inventory" is in effect a data-quality report, and the "AI Act documentation requirement" is met by a provenance trail that automatically records the origin and transformation history of data. For an inventory to secure trust, the record has to include where the listed data came from and how it was validated. At that point the manifest moves past false reassurance.

On the ground, this difference also shows up as speed. With an inventoriable data pipeline in place, data preparation itself gets faster. In one internal Pebblous case, the time to collect and clean data on a manufacturing floor fell from three to five years down to two weeks (internal data). Traceable data is grounds for safety and certification, and at the same time it is business velocity.

Data you cannot audit is data you cannot regulate. If an AI BOM is the parts list, then data quality and lineage monitoring are the quality certificate stapled to each part. Building the list and making the list trustworthy are different jobs — and the second is the first gate of trust.

Editor's Note

The problem Pebblous has worked on — diagnosing and cleaning data quality (DataClinic), delivering data in a validated form (AI-Ready Data), and automatically recording origin and transformation history — sits in the same place as the data-trust layer of the "auditable inventory" this report describes. If security vendors handle discovery and detection, Pebblous reads this shift from the standpoint of a complement that fills in whether each part can be trusted.

R

References

Primary sources (policy & industry)

1.FedTech Magazine. (2026, June). "How Federal Agencies Can Inventory and Govern AI Systems with AI BOMs." FedTech Magazine.
2.TechInformed. (2026). "Shadow AI now needs a bill of materials." TechInformed.
3.Cisco. (2026). "Know Your AI Stack: Introducing AI BOM in Cisco AI Defense." Cisco Blogs.

Standards & tools

4.OWASP. (2025). "CycloneDX — ML-BOM / AI BOM (ECMA-424 2nd Edition, v1.7)." OWASP Foundation.
5.Cisco AI Defense. (2026). "AI BOM open-source scanner." GitHub.

Academic (origins of part-level documentation)

6.Gebru, T., et al. (2018/2021). "Datasheets for Datasets." Communications of the ACM / arXiv:1803.09010.
7.Mitchell, M., et al. (2019). "Model Cards for Model Reporting." FAT* 2019 / arXiv:1810.03993.

Data & statistics

8.IBM Security / Ponemon Institute. (2025). "Cost of a Data Breach Report 2025." IBM.
9.Wiz Research. (2026). "State of AI in the Cloud 2026." Wiz.
10.ENISA. (2025). "Threat Landscape 2025." European Union Agency for Cybersecurity.
11.Stanford HAI. (2026). "AI Index Report 2026." Stanford University.
12.European Commission. (2024). "EU AI Act, Article 99 (Penalties)." EUR-Lex.
13.Leiden Declaration on Artificial Intelligence and Mathematics. (2026). "Signatories." (2,821 signatories, confirmed 2026-06-30)

※ Some market-size figures are not stated as single values because of definitional differences across research firms. The case of shortened data-preparation time on a manufacturing floor is based on internal Pebblous data.