Abstract: The Need for AI Data Quality Management
The performance of AI models directly depends on the quality of training data, and data quality management is no longer optional but a mandatory requirement. This analysis provides an in-depth examination of the technical correlation between international AI data quality standards represented by the ISO/IEC 5259 series and Pebblous DataClinic.
ISO/IEC 5259-2 defines over 60 quantitative Quality Measures (QMs), including 9 additional data quality characteristics essential for analytics and ML beyond traditional quality characteristics. Pebblous DataClinic implements these through DNN-based DataLens and Data Imaging technologies.
Through this analysis, we demonstrate that DataClinic's diagnostic framework of Level I (Basic EDA), Level II (General Lens), and Level III (Data-Specific Lens) comprehensively measures and addresses key QM groups in ISO/IEC 5259-2, including completeness, similarity, representativeness, and diversity. Pebblous's DataLens technique can be interpreted as providing concrete measurement functions for effectively applying the abstract quality characteristics of ISO standards to high-dimensional training data.
1
Background: The Importance of AI Data Quality
1.1. Regulatory Landscape
The reliability and fairness of AI systems are determined by the quality of training data. The EU AI Act (2024) and the U.S. AI Executive Order (EO 14110, 2023) mandate data quality verification for high-risk AI systems.
-
▸
EU AI Act: Mandatory data quality for high-risk AI systems
-
▸
U.S. EO 14110: AI safety standards and data governance
-
▸
Korea Intelligent Informatization Act: AI ethics standards and data management
1.2. Technical Challenges
| Problem | Definition | Impact (Case) | Response |
|---|---|---|---|
| Biased Data | Training data skewed toward specific demographics or situations | Generates discriminatory AI outcomes Case | Representativeness, balance verification |
| Incomplete Data | Data with missing or insufficient required classes or attributes | Learning failure for specific classes Case | Completeness measurement and remediation |
| Excessive Similar Data | Excessive inclusion of duplicate or overly similar data samples | Causes overfitting Case | Similarity measurement, Data Diet |
Key Message: AI data quality management is essential technical infrastructure not only for regulatory compliance but also for ensuring model performance, fairness, and reliability. Pebblous DataClinic is an international standards-based solution that addresses these requirements.
2
ISO/IEC 5259 Series Overview
The ISO/IEC 5259 series is the international standard for data quality management for AI/ML systems, with 5 parts currently published and 1 part forthcoming (6 parts total).
| Part | Title | Content |
|---|---|---|
| Part 1 | Overview, Terminology, and Examples | Defines core concepts including Data Quality Characteristics (DQC), Quality Measures (QM), and assessment methodologies with practical examples |
| Part 2 | Data Quality Measures | Presents 24 DQCs and over 65 quantitative QMs. The core analysis subject of this report |
| Part 3 | Data Quality Management Requirements and Guidelines | Provides requirements and guidelines for establishing, implementing, maintaining, and improving organizational data quality management systems |
| Part 4 | Data Quality Process Framework | Provides an operational process framework to meet the management requirements of Part 3 |
| Part 5 | Data Quality Governance Framework | A governance-level framework for ensuring high-quality data throughout the data lifecycle (published Feb. 2025) |
| Part 6 | Visualization Framework (Technical Report) | Provides methodology for visualizing data quality measurement results (forthcoming) |
2.1. AI/ML Data Quality Characteristics: Inherent and Additional
This report focuses on inherent quality characteristics and additional quality characteristics (AI/ML-specific) directly related to AI/ML training data quality.
💡 Note: ISO/IEC 5259-2 classifies data quality characteristics into 4 groups: Inherent Quality Characteristics - intrinsic properties of the data itself, Inherent and System-dependent - data-system interactions, System-dependent - depends on IT infrastructure performance, Additional Quality Characteristics (for AI/ML) - AI/ML-specific quality characteristics (9 total). For the difference between inherent and AI/ML additional characteristics, see Section 4.3 of the ISO 5259 Cheatsheet.
| Category | Quality Characteristic | Description | AI/ML Relevance |
|---|---|---|---|
| Inherent Quality Characteristics (Intrinsic properties of the data) |
Accuracy | How closely data values and labels match actual values | Inaccurate labels distort model training |
| Completeness | Whether all required data and labels exist | Missing values are a major cause of model performance degradation | |
| Consistency | No contradictions between data, and identical labels for similar data | Label inconsistencies cause model confusion | |
| Credibility | Trustworthiness of data sources and values | Unreliable data reduces trust in AI results | |
| Currentness | Whether data is within an acceptable time range | Outdated data causes misalignment with current conditions | |
| Additional Quality Characteristics (AI/ML-specific, 9 total) |
Auditability | The degree to which data has been or can be audited | Required for regulatory compliance and data provenance tracking |
| Balance | The degree to which sample distribution across categories is uniform | Imbalanced data produces biased models | |
| Diversity | The extent to which the dataset covers a wide range of features and values | Lack of diversity creates models that only work in specific situations | |
| Effectiveness | The degree to which the dataset meets requirements for specific ML tasks | Ineffective data degrades training performance | |
| Identifiability | The degree to which individuals can be identified through PII | Requires privacy protection and risk management | |
| Relevance | The degree to which the dataset is suitable for its given context/purpose | Irrelevant data reduces training efficiency | |
| Representativeness | The degree to which the dataset reflects the target population | Lack of representativeness causes degraded real-world performance | |
| Similarity | The degree of similarity among samples within the dataset | Excessive similar data causes overfitting | |
| Timeliness | The delay between event occurrence and data recording | Time delays reduce data reliability and applicability |
3
ISO/IEC 5259-2 Key Quality Measures (QM) Analysis
ISO/IEC 5259-2 presents over 65 Quality Measures (QMs) to quantitatively measure the 24 quality characteristics (DQCs) introduced above. This section highlights the QMs that are particularly important for mapping with Pebblous DataClinic.
💡 Note: The complete QM list of ISO/IEC 5259-2 can be found in the ISO/IEC 5259-2 Cheatsheet .
3.1. Completeness QMs
| QM ID | QM Item | Description | AI Model Risk |
|---|---|---|---|
| Com-ML-1 | Value completeness | Ratio of data items without null values | Training failure due to missing values |
| Com-ML-3 | Feature completeness | Ratio of data items related to specific features without null values | Failure to learn specific characteristics |
| Com-ML-5 | Label completeness | Ratio of samples with missing or incomplete labels | Degraded classification performance for specific classes |
3.2. Similarity QMs
| QM ID | QM Item | Description | AI Model Risk |
|---|---|---|---|
| Sim-ML-1 | Ratio of similar instances in dataset | Measures the proportion of similar samples within the dataset | Causes overfitting |
| Sim-ML-2 | Average intra-class similarity | Average similarity between samples within the same class | Degraded generalization performance |
3.3. Representativeness QMs
| QM ID | QM Item | Description | AI Model Risk |
|---|---|---|---|
| Rep-ML-1 | Target domain coverage | How comprehensively the data covers various real-world application scenarios | Degraded real-world performance |
| Rep-ML-3 | Distribution distance (KL-divergence) | Difference between training data distribution and actual data distribution | Reduced prediction reliability after deployment |
3.4. Balance QMs
| QM ID | QM Item | Description | AI Model Risk |
|---|---|---|---|
| Bal-ML-1 | Class balance | Degree of balance in sample counts per class | Minority class ignored, biased predictions |
| Bal-ML-2 | Feature balance | Balance of feature distributions within the dataset | Excessive dependence on specific features |
4
Core Analysis: Quantitative Mapping of DataClinic to ISO/IEC 5259-2
Pebblous DataClinic's 3-level diagnostic framework maps directly to the key QMs of ISO/IEC 5259-2. The tables below illustrate these 1:1 correspondences.
Legend: Currently Supported by DataClinic AADS Expansion Planned Future Roadmap
4.1. Inherent Quality Characteristics Mapping
| ISO 5259-2 Characteristic | QM ID | DataClinic Measurement | Status |
|---|---|---|---|
| Completeness | Com-ML-5 | Level I: Missing value measurement, label completeness analysis | Supported |
| Consistency | Con-ML-2 | Level II/III: Label consistency analysis (comparing labels of similar samples) | Supported |
| Accuracy | Acc-ML-7 | Level II/III: Label accuracy verification (anomaly detection) | Supported |
4.2. AI/ML Additional Quality Characteristics Mapping (9 Characteristics)
| ISO 5259-2 Characteristic | Representative QM ID | DataClinic/AADS Measurement | Status |
|---|---|---|---|
| Balance | Bal-ML-3, Bal-ML-8 | Level I: Class distribution analysis, label imbalance measurement | Supported |
| Diversity | Div-ML-1, Div-ML-2 | Level II/III: Intrinsic dimension analysis, feature diversity measurement | Supported |
| Representativeness | Rep-ML-1 | Level II/III: Manifold gap analysis, population coverage measurement | Supported |
| Similarity | Sim-ML-1, Sim-ML-2 | Level II/III: Sample density measurement, duplicate data detection | Supported |
| Relevance | Rel-ML-1, Rel-ML-2 | Level II/III: Contextual relevance analysis (outlier detection) | AADS Expansion |
| Effectiveness | Eft-ML-1, Eft-ML-3 | Level I/II: Valid sample ratio, quality threshold verification | AADS Expansion |
| Auditability | Aud-ML-1, Aud-ML-2 | AADS: Data lineage tracking, quality audit logs | AADS Expansion |
| Identifiability | Idn-ML-1 | AADS: PII detection and anonymization level assessment | Roadmap |
| Timeliness | Tml-ML-1 | AADS: Data freshness measurement, latency analysis | Roadmap |
Key Insights:
- Current DataClinic can directly measure and improve 3 inherent quality characteristics and 4 out of 9 AI/ML additional characteristics of ISO/IEC 5259-2
- Through the ongoing 2025 AADS expansion, 3 additional characteristics including auditability and effectiveness are being added
- The post-2025 roadmap includes development of identifiability (PII protection) and timeliness (data freshness) capabilities
- Diagnostic-driven Data Diet (duplicate removal) and Data Bulk-up (deficient area augmentation) align precisely with quality improvement activities required by the standard
5
Pebblous DataClinic: Technical Implementation and DNN-Based Approach
5.1. 3-Level Diagnostic Framework
| Level | Name | Measurement Capabilities | Corresponding ISO QM |
|---|---|---|---|
| Level I | Basic EDA |
- Missing value analysis - Class distribution - Basic statistics - Outlier detection |
Com-ML (Completeness) Bal-ML (Balance) |
| Level II | General Lens |
- General-purpose embeddings - Density measurement - Distance distribution analysis - Manifold shape |
Sim-ML (Similarity) Rep-ML (Representativeness) Div-ML (Diversity) |
| Level III | Data-Specific Lens |
- Custom embeddings - Intrinsic dimension analysis - Precision quality measurement - Domain-specific diagnostics |
Sim-ML, Rep-ML, Div-ML Precision Measurement |
5.2. DataLens: DNN-Based Data Analysis
DataLens leverages the embedding layers of deep learning models to project data into high-dimensional vector spaces, enabling quantitative measurement of ISO/IEC 5259-2 QMs.
Core Functions
- ▸ Data Imaging: Raw data → Feature vectors → Embedding space
- ▸ Density Measurement: k-NN distance-based density quantification
- ▸ Manifold Analysis: Understanding geometric structure of data distributions
Measurement Functions
- ▸ Density(x): Density around sample x
- ▸ Distance(x, C): Minimum distance to class C
- ▸ ManifoldShape(D): Manifold shape of dataset D
5.3. Data Prescription System
| Prescription | Purpose | Method | Effect |
|---|---|---|---|
| Data Diet | Resolve excessive Similarity |
- Remove duplicate samples - Sampling from dense regions |
Reduced overfitting risk |
| Data Bulk-up | Resolve insufficient Representativeness |
- Manifold gap augmentation - Adding data to sparse regions |
Improved generalization performance |
6
Case Studies: Applying ISO/IEC 5259-2 with DataClinic
6.1. Image Dataset Diagnostics
| Phase | ISO QM | Findings | Prescription & Results |
|---|---|---|---|
| Issue Found | Sim-ML-1 | Level III density measurement revealed 40% of samples concentrated in specific regions | Data Diet: Removed 25% from dense regions → 30% reduction in training time |
| Issue Found | Rep-ML-1 | Manifold gap analysis discovered 5 sparse regions | Data Bulk-up: Augmented sparse regions by 15% → 7% improvement in test accuracy |
6.2. Text Dataset Quality Verification
| Phase | ISO QM | Findings | Prescription & Results |
|---|---|---|---|
| Issue Found | Com-ML-5 | Level I missing value analysis found 20% missing in a specific class | Auto-labeling: Supplemented missing class → Achieved 95% completeness |
| Issue Found | Bal-ML-1 | Discovered class imbalance ratio of 1:15 | Class Resampling: SMOTE-based synthesis → 18% improvement in F1-score |
7
Policy Recommendations and Conclusion
7.1. Policy Recommendations
1. Accelerate Domestic Adoption of ISO/IEC 5259
Rapidly adopt the ISO/IEC 5259 series as KS (Korean Standards) as a core element of national AI strategy, and designate it as a mandatory compliance requirement for public AI projects
2. Foster a Data Quality Verification Tool Ecosystem
Support development of ISO/IEC 5259-compliant tools like DataClinic and introduce public dataset quality certification programs
3. Integrate Data Quality into AI Governance Frameworks
In alignment with the EU AI Act and U.S. EO 14110, mandate data quality audits for high-risk AI systems
4. Cultivate Data-Centric AI Talent
Develop ISO/IEC 5259-based data quality training curricula and establish data quality professional certification frameworks
7.2. Conclusion
This report demonstrates, through the technical mapping of ISO/IEC 5259-2 Quality Measures (QMs) to Pebblous DataClinic, that international standards-based AI data quality management is practically achievable.
DataClinic's DNN-based DataLens and Data Imaging technologies quantitatively measure key DQCs including completeness, similarity, and representativeness. The diagnostic-driven Data Diet and Data Bulk-up prescriptions align precisely with the quality improvement activities required by ISO standards.
In an era where AI is deeply integrated across society, data quality management transcends technical excellence to become a matter of social trust and ethical responsibility. Pebblous DataClinic is a standards-based data quality solution that addresses these contemporary demands, contributing to strengthening the international competitiveness of the Korean AI ecosystem.
References
[1] ISO/IEC JTC 1/SC 42. (2024). ISO/IEC 5259-1:2024 - Artificial intelligence — Data quality for analytics and machine learning (ML) — Part 1: Overview, terminology, and examples.
[2] ISO/IEC JTC 1/SC 42. (2024). ISO/IEC 5259-2:2024 - Part 2: Data quality measures.
[3] ISO/IEC JTC 1/SC 42. (2024). ISO/IEC 5259-3:2024 - Part 3: Data quality management requirements and guidelines.
[4] European Parliament. (2024). Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act).
[5] The White House. (2023). Executive Order 14110 on Safe, Secure, and Trustworthy Artificial Intelligence.
[6] Ministry of Science and ICT, Korea. (2024). AI Ethics Standards and Reliability Assurance Guidelines.
[7] National Information Society Agency (NIA), Korea. (2023). AI Data Quality Management Guideline v2.0.
[8] Sambasivan, N., et al. (2021). "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. CHI 2021.
[9] Gebru, T., et al. (2021). Datasheets for Datasets. Communications of the ACM, 64(12).
[10] Mitchell, M., et al. (2019). Model Cards for Model Reporting. FAT* 2019.