In the Data-Centric AI era, quality determines success
Executive Summary
The advancement of artificial intelligence (AI) has been driven primarily by innovations in model architecture. However, as state-of-the-art models become commercialized and more accessible, the key factor determining the success of AI systems is shifting from models to data.
🎯 Key Message
Data quality, richness, and integrity have emerged as the core differentiators of technological competitiveness. Biases embedded in data, inaccurate labeling, data drift, unclear provenance, and ethical blind spots can lead not only to degraded AI system performance but also to serious societal consequences.
This report provides a comprehensive analysis of six major frameworks designed to assess and manage AI data quality:
By analyzing these frameworks through complementary lenses of documentation, quantification, automation, governance, benchmarking, and ethics, this report presents an integrated data quality strategy for organizations seeking to build trustworthy and effective AI systems.
1. The Dawn of the Data-Centric AI Era
The AI development paradigm is shifting from 'Model-Centric' to 'Data-Centric'. As state-of-the-art models become increasingly commoditized, the key to competitiveness now lies in data quality, richness, and integrity.
Why Data Quality Matters
Social Bias
Latent biases embedded in data lead to discriminatory outcomes
Labeling Errors
Inaccurate annotations degrade model performance
Data Drift
Changes in data distribution over time reduce performance
Ethical Blind Spots
Lack of ethical considerations in data collection and usage
Systemic Risk: These issues are not merely technical flaws but can lead to model failures, reputational damage, and regulatory violations.
2. Part I: Standards for Data Transparency and Documentation
The journey of data quality management begins with transparent and comprehensive documentation. Without clear information about how a dataset was created, its characteristics, and its limitations, quality cannot be meaningfully discussed.
Datasheets for Datasets
An Ethical Framework from Academia
Proposed by Gebru et al. in 2018, this concept draws inspiration from electronic component datasheets to present a standardized documentation framework for ML datasets.
Key Question Areas:
- ▪ Motivation: Who created it and why?
- ▪ Composition: What data is included?
- ▪ Collection: How and where was it collected?
- ▪ Preprocessing: What cleaning operations were performed?
- ▪ Uses: What are the intended/prohibited use cases?
Philosophical Shift: Redefining datasets not as objective raw materials, but as socio-technical constructs involving human judgment
Google Dataset Cards
A Practical Implementation from Industry
A structured and flexible toolkit that evolves the academic Datasheets concept for large-scale technology organizations. The Data Cards Playbook embeds transparency into organizational culture.
4 Core Modules:
Living Document: Review and update recommended every 6 months or upon significant changes
3. Part II: Quantification and Automation of Data Quality
To efficiently process data at scale, quantitative and automated methodologies beyond qualitative documentation are essential.
IBM's 7 Data Quality Dimensions
Data Quality for AI (DQAI)
A measurable reliability framework that adapts traditional enterprise data quality management principles to the AI lifecycle.
Accuracy
Alignment with the real world
Completeness
Whether required data is missing
Consistency
No conflicts between data points
Timeliness
Up-to-date when needed
Validity
Compliance with format/type/range
Uniqueness
No duplicate records
Bias/Fairness (AI-specific)
Preventing adverse outcomes for specific groups
Limitation: Even data that is technically perfect by metrics can still contain historical biases. An additional ethical "ceiling" must therefore be established.
NVIDIA's Pipeline-Centric Approach
NeMo Curator - Large-Scale Data Curation
This approach treats data quality not as a one-time verification but as a continuous, automated pipeline challenge. It is optimized specifically for processing the vast volumes of unstructured data required for deep learning.
Core Capabilities
- ⚡ Automated data downloading, cleaning, and quality filtering
- 🎬 Multi-modality support (text/image/video)
- 🔄 Semantic deduplication and data blending
- 🎨 Synthetic data generation - addressing identified weaknesses
Data Flywheel
A virtuous cycle: Model feedback → Data improvement → Enhanced model performance
4. Part III: Benchmarking and Governance
Data quality must be addressed beyond individual organizations, at the level of industry standardization and international governance.
DataPerf
Competitive Benchmarking by MLCommons
An initiative that shifts the competitive focus of the ML community from model-centric to data-centric. Public leaderboards drive innovation in data-centric algorithms.
Key Challenges:
Selecting the optimal data subset
Prioritizing noise/error identification
Strategic data procurement
Discovering model failure modes
OECD.AI Principles
Trustworthy Data Governance
A policy framework that sets the highest-level international standards for trustworthy AI, serving as an 'ethical and legal API' that bridges technology and societal expectations.
5 Value-Based Principles:
5. Comparative Framework Analysis
Each of the six frameworks has its own unique philosophy and approach, and a robust data quality management system can be built when they are utilized in an integrated manner.
| Framework | Core Focus | Key Outputs | Approach |
|---|---|---|---|
| Datasheets | Ethics Theory | Conceptual Framework | Socio-technical Analysis |
| Google Cards | Transparency Documentation | Templates & Playbook | Qualitative, Manual |
| IBM DQAI | Quantitative Metrics | Software & API | Quantitative, Automated |
| NVIDIA NeMo | Automated Pipeline | Curation Library | Pipeline-centric, Scalable |
| DataPerf | Competitive Benchmarking | Leaderboards & Challenges | Competition-based, Bottom-up |
| OECD.AI | Policy Governance | Policy Guidelines | Principle-based, Top-down |
Integration Strategy Example
Top-Level Governance
Establish an AI ethics charter based on OECD principles
Ensuring Transparency
Mandatory documentation with Google Dataset Cards
Quantitative Measurement
Set structured data baselines with IBM tools
Automation & Scalability
Process large-scale unstructured data with NVIDIA pipelines
Performance Measurement & Innovation
Run internal challenges in the DataPerf style
6. Building a Practical Organizational Strategy
Data Quality Maturity Model
Level 1: Ad-Hoc
Inconsistent management at individual team levels without standardized procedures
Level 2: Standardized
Data card documentation standards established, regular technical inspections performed
Level 3: Optimized
Automated curation pipelines built, internal benchmarking in operation
Level 4: Ethically Aware
Proactive assessment along socio-technical pillars, integrated ethics reviews
Multi-Layered Data Quality Strategy Model
Stage 1: The "Why"
Establish governance - Define principles and charters
Stage 2: The "What"
Mandate documentation - Create standard templates
Stage 3: The "How"
Automate processes - Adopt tools and build pipelines
Stage 4: The "How Well"
Measure performance & improve - Run benchmarks
Conclusion: High-Quality Data as an Essential Asset for Trustworthy AI
The six frameworks analyzed in this report demonstrate that the understanding of data quality is evolving beyond simple technical preprocessing into a core strategic function for building effective, trustworthy, and responsible AI.
📋 Documentation
Google Dataset Cards - Foundation for transparency and accountability
📊 Quantification
IBM DQAI - Measuring technical soundness
⚡ Automation
NVIDIA NeMo - Efficient management at scale
🏆 Benchmarking
DataPerf - Driving data-centric innovation
🌐 Governance
OECD.AI - Connecting to societal context
⚖️ Ethics
Datasheets - Cornerstone of responsible AI
🚀 Future Outlook
In the future AI landscape, these approaches will converge within a unified data governance framework. Successful organizations will manage data quality through multidisciplinary teams that combine technical expertise, ethical insight, and policy understanding. Securing and managing high-quality data will become the most important driver of sustainable competitive advantage.
References
* References related to the 6 core frameworks are shown in bold
- mlcommons/dataperf: Data Benchmarking - GitHub. https://github.com/mlcommons/dataperf
- AI Ethics at IBM. IBM Data Ethics PDF
- Beyond Accuracy: Redefining Data Quality Metrics for Ethical AI - ResearchGate. ResearchGate
- Datasheets for Datasets - Morgan Klaus Scheuerman. morgan-klaus.com
- Datasheets for Datasets - Microsoft Research. Microsoft PDF
- Datasheets for Datasets - arXiv. arXiv:1803.09010
- Datasheets for Datasets - ResearchGate. ResearchGate
- User Guide - Data Cards Playbook - Google Research. Google Research
- The Data Cards Playbook - Google Research. Google Research
- Data Cards Playbook: Transparent documentation for responsible AI - Google for Developers. Google Developers
- Data Quality in AI - IBM Research. IBM Research
- Data Quality Tools & Solutions - IBM. IBM Solutions
- What Is Data Quality Management? - IBM. IBM Think
- What Is Data Quality? - IBM. IBM Think
- Data quality dimensions - IBM. IBM Docs
- The Six Primary Dimensions for Data Quality Assessment. SBCTC PDF
- Data Quality for AI Tool: Exploratory Data Analysis on IBM API - ResearchGate. ResearchGate
- NVIDIA AI Enterprise - Cloud-native Software Platform. NVIDIA
- NeMo Curator - NVIDIA Developer. NVIDIA Developer
- NeMo - Build, monitor, and optimize AI agents - NVIDIA. NVIDIA
- Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint. NVIDIA Blog
- Benchmark Work - Benchmarks MLCommons. MLCommons
- DataPerf. dataperf.org
- AI Principles Overview - OECD.AI. OECD.AI
- OECD AI Principles. OECD.AI
- OECD AI Principles: Guardrails to Responsible AI Adoption - code4thought. code4thought
- Working Group on Data Governance - OECD.AI. OECD.AI
- Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation - arXiv. arXiv
- What are the key metrics used to evaluate Vision-Language Models? - Milvus. Milvus
- DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark - MDPI. MDPI
- A Survey of State of the Art Large Vision Language Models - arXiv. arXiv
Download Full Report
PDF version with detailed analysis and references
Download the PDF version of this report, which includes the complete content, detailed references, and additional analysis materials. Use it for sharing and learning within your organization.
Download AI Data QA Framework.pdfFile Info: PDF format | ~2.5MB | Published: September 25, 2025