In the Data-Centric AI era, quality determines success

Reading time: ~15 min 한국어

Executive Summary

The advancement of artificial intelligence (AI) has been driven primarily by innovations in model architecture. However, as state-of-the-art models become commercialized and more accessible, the key factor determining the success of AI systems is shifting from models to data.

🎯 Key Message

Data quality, richness, and integrity have emerged as the core differentiators of technological competitiveness. Biases embedded in data, inaccurate labeling, data drift, unclear provenance, and ethical blind spots can lead not only to degraded AI system performance but also to serious societal consequences.

This report provides a comprehensive analysis of six major frameworks designed to assess and manage AI data quality:

📋
Datasheets (Academia)
Ethics Theory
🔍
Google Dataset Cards
Transparency Documentation
📊
IBM DQAI
Quantitative Measurement
NVIDIA NeMo
Automated Pipeline
🏆
DataPerf
Competitive Benchmarking
🌐
OECD.AI
Policy Governance

By analyzing these frameworks through complementary lenses of documentation, quantification, automation, governance, benchmarking, and ethics, this report presents an integrated data quality strategy for organizations seeking to build trustworthy and effective AI systems.

1. The Dawn of the Data-Centric AI Era

The AI development paradigm is shifting from 'Model-Centric' to 'Data-Centric'. As state-of-the-art models become increasingly commoditized, the key to competitiveness now lies in data quality, richness, and integrity.

Why Data Quality Matters

⚠️

Social Bias

Latent biases embedded in data lead to discriminatory outcomes

🎯

Labeling Errors

Inaccurate annotations degrade model performance

📉

Data Drift

Changes in data distribution over time reduce performance

🔒

Ethical Blind Spots

Lack of ethical considerations in data collection and usage

Systemic Risk: These issues are not merely technical flaws but can lead to model failures, reputational damage, and regulatory violations.

2. Part I: Standards for Data Transparency and Documentation

The journey of data quality management begins with transparent and comprehensive documentation. Without clear information about how a dataset was created, its characteristics, and its limitations, quality cannot be meaningfully discussed.

1

Datasheets for Datasets

An Ethical Framework from Academia

Proposed by Gebru et al. in 2018, this concept draws inspiration from electronic component datasheets to present a standardized documentation framework for ML datasets.

Key Question Areas:

  • Motivation: Who created it and why?
  • Composition: What data is included?
  • Collection: How and where was it collected?
  • Preprocessing: What cleaning operations were performed?
  • Uses: What are the intended/prohibited use cases?

Philosophical Shift: Redefining datasets not as objective raw materials, but as socio-technical constructs involving human judgment

2

Google Dataset Cards

A Practical Implementation from Industry

A structured and flexible toolkit that evolves the academic Datasheets concept for large-scale technology organizations. The Data Cards Playbook embeds transparency into organizational culture.

4 Core Modules:

Ask
Define transparency
Inspect
Generate metadata
Answer
Fill out templates
Audit
Impact assessment

Living Document: Review and update recommended every 6 months or upon significant changes

3. Part II: Quantification and Automation of Data Quality

To efficiently process data at scale, quantitative and automated methodologies beyond qualitative documentation are essential.

3

IBM's 7 Data Quality Dimensions

Data Quality for AI (DQAI)

A measurable reliability framework that adapts traditional enterprise data quality management principles to the AI lifecycle.

🎯

Accuracy

Alignment with the real world

📝

Completeness

Whether required data is missing

🔄

Consistency

No conflicts between data points

⏱️

Timeliness

Up-to-date when needed

Validity

Compliance with format/type/range

🎲

Uniqueness

No duplicate records

⚖️

Bias/Fairness (AI-specific)

Preventing adverse outcomes for specific groups

Limitation: Even data that is technically perfect by metrics can still contain historical biases. An additional ethical "ceiling" must therefore be established.

4

NVIDIA's Pipeline-Centric Approach

NeMo Curator - Large-Scale Data Curation

This approach treats data quality not as a one-time verification but as a continuous, automated pipeline challenge. It is optimized specifically for processing the vast volumes of unstructured data required for deep learning.

Core Capabilities

  • Automated data downloading, cleaning, and quality filtering
  • 🎬 Multi-modality support (text/image/video)
  • 🔄 Semantic deduplication and data blending
  • 🎨 Synthetic data generation - addressing identified weaknesses

Data Flywheel

A virtuous cycle: Model feedback → Data improvement → Enhanced model performance

Model Feedback Data Improvement

4. Part III: Benchmarking and Governance

Data quality must be addressed beyond individual organizations, at the level of industry standardization and international governance.

5

DataPerf

Competitive Benchmarking by MLCommons

An initiative that shifts the competitive focus of the ML community from model-centric to data-centric. Public leaderboards drive innovation in data-centric algorithms.

Key Challenges:

🎯
Dataset Selection:

Selecting the optimal data subset

🔧
Dataset Cleaning:

Prioritizing noise/error identification

💰
Dataset Acquisition:

Strategic data procurement

⚔️
Adversarial Examples:

Discovering model failure modes

6

OECD.AI Principles

Trustworthy Data Governance

A policy framework that sets the highest-level international standards for trustworthy AI, serving as an 'ethical and legal API' that bridges technology and societal expectations.

5 Value-Based Principles:

1. Inclusive Growth - Benefits for all members of society
2. Human-Centered Values - Respect for human rights, bias prevention
3. Transparency - Understandable provenance and processing
4. Robustness/Security - Defense against malicious attacks
5. Accountability - Clear responsibility assignment

5. Comparative Framework Analysis

Each of the six frameworks has its own unique philosophy and approach, and a robust data quality management system can be built when they are utilized in an integrated manner.

Framework Core Focus Key Outputs Approach
Datasheets Ethics Theory Conceptual Framework Socio-technical Analysis
Google Cards Transparency Documentation Templates & Playbook Qualitative, Manual
IBM DQAI Quantitative Metrics Software & API Quantitative, Automated
NVIDIA NeMo Automated Pipeline Curation Library Pipeline-centric, Scalable
DataPerf Competitive Benchmarking Leaderboards & Challenges Competition-based, Bottom-up
OECD.AI Policy Governance Policy Guidelines Principle-based, Top-down

Integration Strategy Example

1

Top-Level Governance

Establish an AI ethics charter based on OECD principles

2

Ensuring Transparency

Mandatory documentation with Google Dataset Cards

3

Quantitative Measurement

Set structured data baselines with IBM tools

4

Automation & Scalability

Process large-scale unstructured data with NVIDIA pipelines

5

Performance Measurement & Innovation

Run internal challenges in the DataPerf style

6. Building a Practical Organizational Strategy

Data Quality Maturity Model

Level 1: Ad-Hoc

Inconsistent management at individual team levels without standardized procedures

Level 2: Standardized

Data card documentation standards established, regular technical inspections performed

Level 3: Optimized

Automated curation pipelines built, internal benchmarking in operation

Level 4: Ethically Aware

Proactive assessment along socio-technical pillars, integrated ethics reviews

Multi-Layered Data Quality Strategy Model

🎯

Stage 1: The "Why"

Establish governance - Define principles and charters

📋

Stage 2: The "What"

Mandate documentation - Create standard templates

⚙️

Stage 3: The "How"

Automate processes - Adopt tools and build pipelines

📊

Stage 4: The "How Well"

Measure performance & improve - Run benchmarks

Conclusion: High-Quality Data as an Essential Asset for Trustworthy AI

The six frameworks analyzed in this report demonstrate that the understanding of data quality is evolving beyond simple technical preprocessing into a core strategic function for building effective, trustworthy, and responsible AI.

📋 Documentation

Google Dataset Cards - Foundation for transparency and accountability

📊 Quantification

IBM DQAI - Measuring technical soundness

⚡ Automation

NVIDIA NeMo - Efficient management at scale

🏆 Benchmarking

DataPerf - Driving data-centric innovation

🌐 Governance

OECD.AI - Connecting to societal context

⚖️ Ethics

Datasheets - Cornerstone of responsible AI

🚀 Future Outlook

In the future AI landscape, these approaches will converge within a unified data governance framework. Successful organizations will manage data quality through multidisciplinary teams that combine technical expertise, ethical insight, and policy understanding. Securing and managing high-quality data will become the most important driver of sustainable competitive advantage.

References

* References related to the 6 core frameworks are shown in bold

  1. mlcommons/dataperf: Data Benchmarking - GitHub. https://github.com/mlcommons/dataperf
  2. AI Ethics at IBM. IBM Data Ethics PDF
  3. Beyond Accuracy: Redefining Data Quality Metrics for Ethical AI - ResearchGate. ResearchGate
  4. Datasheets for Datasets - Morgan Klaus Scheuerman. morgan-klaus.com
  5. Datasheets for Datasets - Microsoft Research. Microsoft PDF
  6. Datasheets for Datasets - arXiv. arXiv:1803.09010
  7. Datasheets for Datasets - ResearchGate. ResearchGate
  8. User Guide - Data Cards Playbook - Google Research. Google Research
  9. The Data Cards Playbook - Google Research. Google Research
  10. Data Cards Playbook: Transparent documentation for responsible AI - Google for Developers. Google Developers
  11. Data Quality in AI - IBM Research. IBM Research
  12. Data Quality Tools & Solutions - IBM. IBM Solutions
  13. What Is Data Quality Management? - IBM. IBM Think
  14. What Is Data Quality? - IBM. IBM Think
  15. Data quality dimensions - IBM. IBM Docs
  16. The Six Primary Dimensions for Data Quality Assessment. SBCTC PDF
  17. Data Quality for AI Tool: Exploratory Data Analysis on IBM API - ResearchGate. ResearchGate
  18. NVIDIA AI Enterprise - Cloud-native Software Platform. NVIDIA
  19. NeMo Curator - NVIDIA Developer. NVIDIA Developer
  20. NeMo - Build, monitor, and optimize AI agents - NVIDIA. NVIDIA
  21. Chat With Your Enterprise Data Through Open-Source AI-Q NVIDIA Blueprint. NVIDIA Blog
  22. Benchmark Work - Benchmarks MLCommons. MLCommons
  23. DataPerf. dataperf.org
  24. AI Principles Overview - OECD.AI. OECD.AI
  25. OECD AI Principles. OECD.AI
  26. OECD AI Principles: Guardrails to Responsible AI Adoption - code4thought. code4thought
  27. Working Group on Data Governance - OECD.AI. OECD.AI
  28. Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation - arXiv. arXiv
  29. What are the key metrics used to evaluate Vision-Language Models? - Milvus. Milvus
  30. DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark - MDPI. MDPI
  31. A Survey of State of the Art Large Vision Language Models - arXiv. arXiv

Download Full Report

PDF version with detailed analysis and references

Download the PDF version of this report, which includes the complete content, detailed references, and additional analysis materials. Use it for sharing and learning within your organization.

Download AI Data QA Framework.pdf

File Info: PDF format | ~2.5MB | Published: September 25, 2025