ISO/IEC 5259 Standards-Based LLM Text Data Quality Assessment

• • Reading time: ~10 min • 한국어

Summary: ISO/IEC 5259 is an international standard that presents a new paradigm for data quality assessment and management specialized for artificial intelligence (AI) and machine learning (ML) environments. This guide covers methodologies and practical cases for evaluating the quality of LLM (Large Language Model) text data using this standard.

1. Overview of ISO/IEC 5259 Standard and LLM Data Quality

The ISO/IEC 5259 series is the first international standard on "Data quality for analytics and machine learning (ML)." While existing data quality standards (e.g., ISO/IEC 25012) focused on the data producer's perspective, ISO/IEC 5259 redefines data quality from the perspective of AI/ML data consumers who search, collect, process, and use external data.

The quality of LLM (Text) data directly impacts model performance and reliability, and is a decisive factor that governs the model's Bias, Generalization, and Explainability.

The ISO/IEC 5259 standard provides the following comprehensive framework for assessing and managing the quality of LLM training datasets.

Standard Component	Core Functions and Roles
Part 1 (Overview, Terminology)	Presents the data quality conceptual framework and Data Life Cycle (DLC) model.
Part 2 (Measurement)	Defines AI/ML-specific data quality characteristics and Quality Measures.
Part 3 (Management Requirements)	Defines the Data Quality Management Life Cycle (DQMLC) and organizational requirements.
Part 4 (Process Framework)	Presents a cyclical process (DQPF: Plan-Assess-Improve-Verify) for executing quality activities.
Part 5 (Governance Framework)	Presents top-level decision-making systems and accountability measures for organizations.

1.1. ISO/IEC 5259-2 Data Quality Measurement Items (LLM-Neutral)

ISO/IEC 5259-2 defines 24 Data Quality Characteristics for measuring AI/ML data quality. These characteristics are categorized into inherent properties, system-dependent properties, and additional characteristics, and serve as LLM-neutral assessment criteria applicable to all types of AI/ML data (text, image, tabular, etc.).

Category	Quality Characteristic	Description
Inherent	Accuracy	The degree to which data values correctly represent actual values
	Completeness	The degree to which required data exists without omission
	Consistency	The degree to which data is consistent and free of contradictions
	Credibility	The degree to which data is considered trustworthy
	Currentness	The degree to which data is from an appropriate time period for its purpose
Inherent & System-Dependent	Accessibility	The degree to which data can be accessed
	Compliance	The degree to which data adheres to regulations, standards, and rules
	Efficiency	The degree to which data can be processed with appropriate resources
	Precision	The degree to which data is exact or discriminable
	Traceability	The degree to which audit trails for data access and changes are available
	Understandability	The degree to which users can read and interpret data
	Confidentiality	The degree to which data is accessible only to authorized users (ISO/IEC 25012)
System- Dependent	Availability	The degree to which datasets can be retrieved
	Portability	The degree to which data can be moved between systems while maintaining quality
	Recoverability	The degree to which data can be maintained and recovered in case of failure
Additional (AI/ML Specific)	Auditability	The degree to which data has been audited or is auditable
	Balance	The degree to which sample distribution across categories is uniform
	Diversity	The degree to which a dataset contains a diverse range of features and values
	Effectiveness	The degree to which a dataset meets the requirements of a specific ML task
	Identifiability	The degree to which individuals can be identified through PII
	Relevance	The degree to which a dataset is appropriate for a given context
	Representativeness	The degree to which a dataset reflects the target population
	Similarity	The degree of similarity between samples within a dataset
	Timeliness	The delay between phenomenon occurrence and data recording

* The 24 quality characteristics above are universal assessment criteria applicable to training data for all AI/ML models, not just LLMs. Each characteristic has specific Quality Measures defined for quantitative measurement.

2. LLM Text Data Quality Assessment Methodology (Based on ISO/IEC 5259-2)

The key to evaluating LLM text datasets lies in measuring the AI/ML-specific quality characteristics presented in ISO/IEC 5259-2. This standard inherits the 15 characteristics from ISO/IEC 25012 and includes additional characteristics directly linked to AI model performance.

2.1. Essential Quality Characteristics for LLM Text Data (DQC)

Quality Characteristic	Meaning and Application in LLM Data	Measurement Metric Examples (ISO/IEC 5259-2 QM)
Accuracy	The degree to which data values (tokens, named entities) correctly represent real-world facts. In particular, data label accuracy for labeled text data is critical.	Semantic data accuracy, Data label accuracy (Acc-ML-7: number of labels providing appropriate information / total labels defined in the dataset).
Completeness	The degree to which essential information (entities, context) exists without omission. In particular, label completeness (whether labels are missing) is critical.	Value completeness (Com-ML-1). Label completeness (Com-ML-5: proportion of samples with missing or incomplete labels).
Consistency	Whether data is free of contradictions and identical labels are consistently assigned to similar data items (e.g., terminology uniformity in technical documents).	Data label consistency (Con-ML-2: number of similar item pairs with identical labels / total similar item pair comparisons), Data record consistency (duplicate record ratio).
Balance	How uniformly samples are distributed across categories (classes) within a dataset. This is critical for diagnosing LLM fairness and bias issues.	Label proportion balance (Bal-ML-7: difference in specific label value proportions between two categories), Label distribution balance (Bal-ML-8: divergence between label distribution and uniform label distribution).
Representativeness	How well a dataset reflects the target population (e.g., prompt distribution in operational environments).	Representativeness ratio (Rep-ML-1: number of target attributes in the dataset / number of relevant attributes in a specific context).
Diversity	How broad a range of features and values the dataset samples contain. Lack of diversity increases overfitting risk.	Label richness (Div-ML-1: number of unique labels in the dataset / total data items).
Relevance	How suitable the dataset's features are for solving a given AI task. Unnecessary features increase model complexity.	Feature relevance (Rel-ML-1: number of relevant features / total features in the dataset).
Auditability	Whether data is prepared to be reviewable for auditing or regulatory compliance.	Audited records (Aud-ML-1: number of audited records / total records).

2.2. Quantitative Metric Application Example: Text Summarization (Completeness)

When evaluating the training data quality for text summarization tasks using LLMs, the ROUGE-L (Recall) score can be used as a specific metric for measuring Completeness.

Background: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluates how similar a machine-generated summary is to a human-written reference summary.
Measurement: ROUGE-L Recall measures how completely the key information from the reference summary (based on the Longest Common Subsequence) is included in the generated summary without omission, which directly connects to the concept of completeness in evaluating information gaps.
Assessment Application: By comparing the reference summaries (Ground Truth) in the training dataset with model-generated summaries, one can indirectly verify whether the dataset itself provides complete information to the model.

3. Assessment and Improvement Process for LLM Training Datasets

ISO/IEC 5259 defines data quality assessment not as a one-time activity, but as a cyclical process (DQPF) that is continuously repeated throughout the Data Life Cycle (DLC).

3.1. Quality Management Along the Data Life Cycle (DLC)

Quality management of LLM training datasets should follow the 6-stage DLC model below.

Data requirements: Define data characteristics needed for LLM projects (e.g., science and technology QA LLM), required data volume, acceptable bias levels, and set relevant DQC targets.
Data planning: Design resources, timelines, and architecture (data models) for data acquisition and preparation, and establish DQ measurement execution plans.
Data acquisition: Collect text data according to plans, considering data provenance, bias, and reliability.
Data preparation: The critical stage where substantial quality assessment and improvement activities are performed. Data cleaning, transformation, labeling/annotation, and data quality assessment occur at this stage.
Data provisioning: Apply prepared data to LLM training and evaluation, and provide feedback to previous stages (preparation/acquisition) to improve data quality based on model performance evaluation results.
Data decommissioning: Manage archiving, transfer, or disposal of data no longer in use, and verify PII (Personally Identifiable Information) handling and regulatory compliance.

3.2. Data Quality Improvement Activities (ISO/IEC 5259-4)

When quality assessment (Evaluation) results fall short of established targets, the data quality improvement process specified in ISO/IEC 5259-4 should be applied. For LLM text data, the following methodologies can be used:

Data Cleaning: Remove or correct incomplete, inaccurate, or irrelevant text data. This includes removing duplicate records from combined datasets or correcting improperly formatted text data items.
Data Augmentation: Increase dataset volume or diversity to address imbalance issues and improve model generalization performance.
- Text data augmentation methods: Synonym replacement, entity replacement, back translation, disrupting sentence order, or generating sentences using generative models.
Data Imputation: Fill missing values (null data items) in text data with appropriate values using statistical methods (mean, median, mode) or iterative multivariate imputation (Iterative Imputer).
Data De-identification: When training data contains PII (e.g., names, IP addresses), apply anonymization, pseudonymization, or aggregation methods to protect data subjects' privacy.

4. Practical Case: ISO/IEC 5259-Based LLM Training Dataset Assessment and Management

The referenced material presents a specific 'Agentic AI Data Scientist (AADS)' platform development case for applying the ISO/IEC 5259 standard to LLM training datasets. This case provides a practical blueprint for how to assess and manage LLM text data.

4.1. Project Goals and Assessment Items

The AADS platform aims to automate the entire process from data collection, quality diagnosis, improvement, to regulatory reporting using autonomous agents. The LLM training dataset assessment for this project includes the following items.

Assessment Item	LLM Text Data Goals	Assessment Method and Standard Basis
Multimodal Data Quality Index (QI)	Achieve comprehensive score targets integrating text/image/table (e.g., Phase 1: 88, Final: 95).	Measure text and multimodal quality scores based on ISO/IEC 25012 data quality characteristics, applying weights to convert to a single comprehensive score (QI).
Quality Diagnosis Text LLM Accuracy	Verify performance of LLM specialized for data quality diagnosis (KONI-4B based AADS-LLM) (Target accuracy: 95%).	Measure API accuracy using proprietary standard API test sets and referencing the AgentBench evaluation framework.
Autonomous Agent Task Success Rate	Measure how autonomously agents complete complex data management tasks (quality diagnosis, improvement, governance).	Develop a proprietary standard Task Suite (AADS-DQ-Bench) referencing AgentBench to verify autonomous execution success rate.

4.2. ISO/IEC 5259 Process Application Scenario

In this case, ISO/IEC 5259 is applied as follows.

Planning and requirements definition: AADS defines AI-specific metrics from ISO/IEC 5259-2 such as Balance, Representativeness, and Diversity as core diagnostic functions based on AI project requirements (e.g., robot/manufacturing/public domain-specific LLM) and sets target QI.
Assessment and diagnosis: The developed text quality diagnosis LLM (AADS-LLM) analyzes text training datasets and measures ISO/IEC 5259-2-based quality indicators (QI). In particular, it identifies issues such as dataset bias, lack of representativeness, and data drift potential through quantitative metrics and visualization.
Improvement activity automation: When quality issues are discovered through assessment, AADS autonomously performs improvement activities specified in ISO/IEC 5259-4, including data cleaning, data augmentation, and imputation. For example, it can automatically recommend specific strategies such as data augmentation or resampling to resolve imbalance issues.
Governance and reporting: AADS automatically records logs, decision-making processes, and final results of all data quality activities, internalizes the governance framework of ISO/IEC 5259-5, and generates audit-ready compliance reports. This ensures AI model transparency and accountability.

4.3. Quantifying LLM Training Dataset Quality: QI Index

The ISO/IEC 5259 standard does not define specific grade ranges for data quality measure values; criteria vary depending on the use case and context.

In the AADS case, after calculating scores for each quality metric, a model is applied to derive a weight-based single comprehensive quality score (QI).

$$ \text{Score} = \sum_{i=1}^{n} w_i \cdot s_i $$

Here, $s_i$ is the normalized score (between 0 and 1) for each quality metric, and $w_i$ is the weight representing the importance of that metric ($\sum w_i = 1$). For LLM training datasets, the weights ($w_i$) vary depending on the intended use of the text data.

Weight setting example: For robot work instruction text data, higher weights can be assigned to command Consistency and Completeness (e.g., $w_{\text{consistency}}=0.4$, $w_{\text{completeness}}=0.3$).

5. Building Trust Through Governance and Regulatory Compliance

ISO/IEC 5259 goes beyond simple technical assessment, emphasizing a governance framework that ensures data quality activities are aligned with the organization's overall strategic direction.

5.1. Governance Framework (ISO/IEC 5259-5)

ISO/IEC 5259-5 clearly defines the roles and responsibilities of the organization's governing body and management, ensuring that data quality strategy aligns with business objectives.

Role of the Governing Body: Establish data quality strategy, direct and oversee alignment of organizational business objectives (ML-supported business objectives) with data quality objectives.
Role of Management: Implement data quality strategy, establish and enforce comprehensive data quality policies, implement data quality management processes (ISO/IEC 5259-3), and build risk management systems.

5.2. Regulatory Compliance and Auditability

In the AI era, auditable data quality becomes an essential business requirement. LLM data assessment solutions compliant with the ISO/IEC 5259 standard have the following competitive advantages.

Ensuring Auditability: ISO/IEC 5259-2 defines Auditability and Traceability as important quality characteristics. This requires maintaining records of where LLM data came from and how it was processed (Data provenance).
Automated Reporting: As demonstrated in the AADS case, the ability to automatically generate evidence materials (logs, reports) for key control items of ISO 42001 (AI Management System International Standard) is a core element providing 'trust' and 'accountability' to enterprise customers in heavily regulated industries (finance, healthcare).

The ISO/IEC 5259 standard goes beyond simply measuring accuracy or completeness of LLM text data quality, providing a comprehensive blueprint for systematically addressing the unique challenges of the AI era: bias, generalization performance, and regulatory compliance.