Reading time: ~15 min 한국어

Abstract: The Need for AI Data Quality Management

The performance of AI models directly depends on the quality of training data, and data quality management is no longer optional but a mandatory requirement. This analysis provides an in-depth examination of the technical correlation between international AI data quality standards represented by the ISO/IEC 5259 series and Pebblous DataClinic.

ISO/IEC 5259-2 defines over 60 quantitative Quality Measures (QMs), including 9 additional data quality characteristics essential for analytics and ML beyond traditional quality characteristics. Pebblous DataClinic implements these through DNN-based DataLens and Data Imaging technologies.

Through this analysis, we demonstrate that DataClinic's diagnostic framework of Level I (Basic EDA), Level II (General Lens), and Level III (Data-Specific Lens) comprehensively measures and addresses key QM groups in ISO/IEC 5259-2, including completeness, similarity, representativeness, and diversity. Pebblous's DataLens technique can be interpreted as providing concrete measurement functions for effectively applying the abstract quality characteristics of ISO standards to high-dimensional training data.

1
Background: The Importance of AI Data Quality

1.1. Regulatory Landscape

The reliability and fairness of AI systems are determined by the quality of training data. The EU AI Act (2024) and the U.S. AI Executive Order (EO 14110, 2023) mandate data quality verification for high-risk AI systems.

1.2. Technical Challenges

Problem Definition Impact (Case) Response
Biased Data Training data skewed toward specific demographics or situations Generates discriminatory AI outcomes Case Representativeness, balance verification
Incomplete Data Data with missing or insufficient required classes or attributes Learning failure for specific classes Case Completeness measurement and remediation
Excessive Similar Data Excessive inclusion of duplicate or overly similar data samples Causes overfitting Case Similarity measurement, Data Diet

Key Message: AI data quality management is essential technical infrastructure not only for regulatory compliance but also for ensuring model performance, fairness, and reliability. Pebblous DataClinic is an international standards-based solution that addresses these requirements.

2
ISO/IEC 5259 Series Overview

The ISO/IEC 5259 series is the international standard for data quality management for AI/ML systems, with 5 parts currently published and 1 part forthcoming (6 parts total).

Part Title Content
Part 1 Overview, Terminology, and Examples Defines core concepts including Data Quality Characteristics (DQC), Quality Measures (QM), and assessment methodologies with practical examples
Part 2 Data Quality Measures Presents 24 DQCs and over 65 quantitative QMs. The core analysis subject of this report
Part 3 Data Quality Management Requirements and Guidelines Provides requirements and guidelines for establishing, implementing, maintaining, and improving organizational data quality management systems
Part 4 Data Quality Process Framework Provides an operational process framework to meet the management requirements of Part 3
Part 5 Data Quality Governance Framework A governance-level framework for ensuring high-quality data throughout the data lifecycle (published Feb. 2025)
Part 6 Visualization Framework (Technical Report) Provides methodology for visualizing data quality measurement results (forthcoming)

2.1. AI/ML Data Quality Characteristics: Inherent and Additional

This report focuses on inherent quality characteristics and additional quality characteristics (AI/ML-specific) directly related to AI/ML training data quality.

💡 Note: ISO/IEC 5259-2 classifies data quality characteristics into 4 groups: Inherent Quality Characteristics - intrinsic properties of the data itself, Inherent and System-dependent - data-system interactions, System-dependent - depends on IT infrastructure performance, Additional Quality Characteristics (for AI/ML) - AI/ML-specific quality characteristics (9 total). For the difference between inherent and AI/ML additional characteristics, see Section 4.3 of the ISO 5259 Cheatsheet.

Category Quality Characteristic Description AI/ML Relevance
Inherent Quality Characteristics
(Intrinsic properties of the data)
Accuracy How closely data values and labels match actual values Inaccurate labels distort model training
Completeness Whether all required data and labels exist Missing values are a major cause of model performance degradation
Consistency No contradictions between data, and identical labels for similar data Label inconsistencies cause model confusion
Credibility Trustworthiness of data sources and values Unreliable data reduces trust in AI results
Currentness Whether data is within an acceptable time range Outdated data causes misalignment with current conditions
Additional Quality Characteristics
(AI/ML-specific, 9 total)
Auditability The degree to which data has been or can be audited Required for regulatory compliance and data provenance tracking
Balance The degree to which sample distribution across categories is uniform Imbalanced data produces biased models
Diversity The extent to which the dataset covers a wide range of features and values Lack of diversity creates models that only work in specific situations
Effectiveness The degree to which the dataset meets requirements for specific ML tasks Ineffective data degrades training performance
Identifiability The degree to which individuals can be identified through PII Requires privacy protection and risk management
Relevance The degree to which the dataset is suitable for its given context/purpose Irrelevant data reduces training efficiency
Representativeness The degree to which the dataset reflects the target population Lack of representativeness causes degraded real-world performance
Similarity The degree of similarity among samples within the dataset Excessive similar data causes overfitting
Timeliness The delay between event occurrence and data recording Time delays reduce data reliability and applicability

3
ISO/IEC 5259-2 Key Quality Measures (QM) Analysis

ISO/IEC 5259-2 presents over 65 Quality Measures (QMs) to quantitatively measure the 24 quality characteristics (DQCs) introduced above. This section highlights the QMs that are particularly important for mapping with Pebblous DataClinic.

💡 Note: The complete QM list of ISO/IEC 5259-2 can be found in the ISO/IEC 5259-2 Cheatsheet .

3.1. Completeness QMs

QM ID QM Item Description AI Model Risk
Com-ML-1 Value completeness Ratio of data items without null values Training failure due to missing values
Com-ML-3 Feature completeness Ratio of data items related to specific features without null values Failure to learn specific characteristics
Com-ML-5 Label completeness Ratio of samples with missing or incomplete labels Degraded classification performance for specific classes

3.2. Similarity QMs

QM ID QM Item Description AI Model Risk
Sim-ML-1 Ratio of similar instances in dataset Measures the proportion of similar samples within the dataset Causes overfitting
Sim-ML-2 Average intra-class similarity Average similarity between samples within the same class Degraded generalization performance

3.3. Representativeness QMs

QM ID QM Item Description AI Model Risk
Rep-ML-1 Target domain coverage How comprehensively the data covers various real-world application scenarios Degraded real-world performance
Rep-ML-3 Distribution distance (KL-divergence) Difference between training data distribution and actual data distribution Reduced prediction reliability after deployment

3.4. Balance QMs

QM ID QM Item Description AI Model Risk
Bal-ML-1 Class balance Degree of balance in sample counts per class Minority class ignored, biased predictions
Bal-ML-2 Feature balance Balance of feature distributions within the dataset Excessive dependence on specific features

4
Core Analysis: Quantitative Mapping of DataClinic to ISO/IEC 5259-2

Pebblous DataClinic's 3-level diagnostic framework maps directly to the key QMs of ISO/IEC 5259-2. The tables below illustrate these 1:1 correspondences.

Legend: Currently Supported by DataClinic AADS Expansion Planned Future Roadmap

4.1. Inherent Quality Characteristics Mapping

ISO 5259-2 Characteristic QM ID DataClinic Measurement Status
Completeness Com-ML-5 Level I: Missing value measurement, label completeness analysis Supported
Consistency Con-ML-2 Level II/III: Label consistency analysis (comparing labels of similar samples) Supported
Accuracy Acc-ML-7 Level II/III: Label accuracy verification (anomaly detection) Supported

4.2. AI/ML Additional Quality Characteristics Mapping (9 Characteristics)

ISO 5259-2 Characteristic Representative QM ID DataClinic/AADS Measurement Status
Balance Bal-ML-3, Bal-ML-8 Level I: Class distribution analysis, label imbalance measurement Supported
Diversity Div-ML-1, Div-ML-2 Level II/III: Intrinsic dimension analysis, feature diversity measurement Supported
Representativeness Rep-ML-1 Level II/III: Manifold gap analysis, population coverage measurement Supported
Similarity Sim-ML-1, Sim-ML-2 Level II/III: Sample density measurement, duplicate data detection Supported
Relevance Rel-ML-1, Rel-ML-2 Level II/III: Contextual relevance analysis (outlier detection) AADS Expansion
Effectiveness Eft-ML-1, Eft-ML-3 Level I/II: Valid sample ratio, quality threshold verification AADS Expansion
Auditability Aud-ML-1, Aud-ML-2 AADS: Data lineage tracking, quality audit logs AADS Expansion
Identifiability Idn-ML-1 AADS: PII detection and anonymization level assessment Roadmap
Timeliness Tml-ML-1 AADS: Data freshness measurement, latency analysis Roadmap

Key Insights:

  • Current DataClinic can directly measure and improve 3 inherent quality characteristics and 4 out of 9 AI/ML additional characteristics of ISO/IEC 5259-2
  • Through the ongoing 2025 AADS expansion, 3 additional characteristics including auditability and effectiveness are being added
  • The post-2025 roadmap includes development of identifiability (PII protection) and timeliness (data freshness) capabilities
  • Diagnostic-driven Data Diet (duplicate removal) and Data Bulk-up (deficient area augmentation) align precisely with quality improvement activities required by the standard

5
Pebblous DataClinic: Technical Implementation and DNN-Based Approach

5.1. 3-Level Diagnostic Framework

Level Name Measurement Capabilities Corresponding ISO QM
Level I Basic EDA - Missing value analysis
- Class distribution
- Basic statistics
- Outlier detection
Com-ML (Completeness)
Bal-ML (Balance)
Level II General Lens - General-purpose embeddings
- Density measurement
- Distance distribution analysis
- Manifold shape
Sim-ML (Similarity)
Rep-ML (Representativeness)
Div-ML (Diversity)
Level III Data-Specific Lens - Custom embeddings
- Intrinsic dimension analysis
- Precision quality measurement
- Domain-specific diagnostics
Sim-ML, Rep-ML, Div-ML
Precision Measurement

5.2. DataLens: DNN-Based Data Analysis

DataLens leverages the embedding layers of deep learning models to project data into high-dimensional vector spaces, enabling quantitative measurement of ISO/IEC 5259-2 QMs.

Core Functions

  • Data Imaging: Raw data → Feature vectors → Embedding space
  • Density Measurement: k-NN distance-based density quantification
  • Manifold Analysis: Understanding geometric structure of data distributions

Measurement Functions

  • Density(x): Density around sample x
  • Distance(x, C): Minimum distance to class C
  • ManifoldShape(D): Manifold shape of dataset D

5.3. Data Prescription System

Prescription Purpose Method Effect
Data Diet Resolve excessive Similarity - Remove duplicate samples
- Sampling from dense regions
Reduced overfitting risk
Data Bulk-up Resolve insufficient Representativeness - Manifold gap augmentation
- Adding data to sparse regions
Improved generalization performance

6
Case Studies: Applying ISO/IEC 5259-2 with DataClinic

6.1. Image Dataset Diagnostics

Phase ISO QM Findings Prescription & Results
Issue Found Sim-ML-1 Level III density measurement revealed 40% of samples concentrated in specific regions Data Diet: Removed 25% from dense regions → 30% reduction in training time
Issue Found Rep-ML-1 Manifold gap analysis discovered 5 sparse regions Data Bulk-up: Augmented sparse regions by 15% → 7% improvement in test accuracy

6.2. Text Dataset Quality Verification

Phase ISO QM Findings Prescription & Results
Issue Found Com-ML-5 Level I missing value analysis found 20% missing in a specific class Auto-labeling: Supplemented missing class → Achieved 95% completeness
Issue Found Bal-ML-1 Discovered class imbalance ratio of 1:15 Class Resampling: SMOTE-based synthesis → 18% improvement in F1-score

7
Policy Recommendations and Conclusion

7.1. Policy Recommendations

1. Accelerate Domestic Adoption of ISO/IEC 5259

Rapidly adopt the ISO/IEC 5259 series as KS (Korean Standards) as a core element of national AI strategy, and designate it as a mandatory compliance requirement for public AI projects

2. Foster a Data Quality Verification Tool Ecosystem

Support development of ISO/IEC 5259-compliant tools like DataClinic and introduce public dataset quality certification programs

3. Integrate Data Quality into AI Governance Frameworks

In alignment with the EU AI Act and U.S. EO 14110, mandate data quality audits for high-risk AI systems

4. Cultivate Data-Centric AI Talent

Develop ISO/IEC 5259-based data quality training curricula and establish data quality professional certification frameworks

7.2. Conclusion

This report demonstrates, through the technical mapping of ISO/IEC 5259-2 Quality Measures (QMs) to Pebblous DataClinic, that international standards-based AI data quality management is practically achievable.

DataClinic's DNN-based DataLens and Data Imaging technologies quantitatively measure key DQCs including completeness, similarity, and representativeness. The diagnostic-driven Data Diet and Data Bulk-up prescriptions align precisely with the quality improvement activities required by ISO standards.

In an era where AI is deeply integrated across society, data quality management transcends technical excellence to become a matter of social trust and ethical responsibility. Pebblous DataClinic is a standards-based data quality solution that addresses these contemporary demands, contributing to strengthening the international competitiveness of the Korean AI ecosystem.

References

[1] ISO/IEC JTC 1/SC 42. (2024). ISO/IEC 5259-1:2024 - Artificial intelligence — Data quality for analytics and machine learning (ML) — Part 1: Overview, terminology, and examples.

[2] ISO/IEC JTC 1/SC 42. (2024). ISO/IEC 5259-2:2024 - Part 2: Data quality measures.

[3] ISO/IEC JTC 1/SC 42. (2024). ISO/IEC 5259-3:2024 - Part 3: Data quality management requirements and guidelines.

[4] European Parliament. (2024). Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act).

[5] The White House. (2023). Executive Order 14110 on Safe, Secure, and Trustworthy Artificial Intelligence.

[6] Ministry of Science and ICT, Korea. (2024). AI Ethics Standards and Reliability Assurance Guidelines.

[7] National Information Society Agency (NIA), Korea. (2023). AI Data Quality Management Guideline v2.0.

[8] Sambasivan, N., et al. (2021). "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. CHI 2021.

[9] Gebru, T., et al. (2021). Datasheets for Datasets. Communications of the ACM, 64(12).

[10] Mitchell, M., et al. (2019). Model Cards for Model Reporting. FAT* 2019.