Image Dataset Quality Has Two Layers — ISO/IEC 5259 Applied to Images

• • Reading time: approx. 20 min • 한국어

Image dataset quality divides into two distinct layers: the 'pixel level' and the 'task level'. ISO/IEC 5259-2 addresses both layers through a framework of 23 top-level QM categories. This guide maps every applicable QM item to image datasets and provides measurement methods alongside DataClinic automation support levels for each.

Image datasets fall into three types — pure images, classification/detection annotations, and image-text pairs — and the relevant QM items differ by type. Common QMs cover pixel-level fundamentals such as file integrity, deduplication, and brightness/resolution distributions. Type-specific QMs measure task-oriented quality like label accuracy, bounding box IoU, and CLIP similarity.

This guide provides a five-step evaluation workflow with Pass/Warn/Fail decision criteria, and clearly distinguishes items that DataClinic measures automatically from those requiring external tools. Practitioners can use this matrix to build a quality evaluation plan tailored to their dataset type.

1. Why Image Datasets Need Their Own Quality Standards

Image data carries fundamentally different quality problems from text. Pixel-level quality (brightness, resolution, corruption) and annotation quality (label accuracy, bounding box IoU) require completely separate measurement systems. In text, "consistency" means uniform vocabulary; in images, "consistency" means uniformity across RGB channel distributions or the absence of duplicate frames. Even the same ISO 5259 QM item demands a different measurement approach depending on the data type.

Failure cases: numbers alone cannot guarantee quality

ImageNet — 1,431,167 images: The per-class image count ranged from 700 to 1,300, so the numbers looked balanced. Yet 120 dog breeds accounted for 12% of the entire dataset — a semantic imbalance that caused models trained on it to overfit dog breed classification and exhibit serious bias in real-world deployments.

WikiArt — 81,444 images: DataClinic reported "RGB consistent," yet the actual Red channel followed a bimodal distribution. The warm reds of Impressionist paintings and the dark tones of Classicist works formed two separate peaks. Automated diagnosis alone could not catch domain-specific patterns like this.

The conclusion is clear: separating pixel-level diagnosis from task-level diagnosis is not optional — conflating them leads to failure.

The framework ISO 5259-2 provides for images

ISO/IEC 5259-2 structures image dataset quality across three layers.

Common quality characteristics (Accuracy, Completeness, Consistency, Credibility, Currentness, etc.) — fundamental quality independent of data type. Measures whether files open, whether duplicates exist, and whether metadata is complete.
AI/ML additional quality characteristics (Balance, Diversity, Effectiveness, Similarity, Representativeness) — distributional quality specific to image ML. Measures class balance, representativeness within feature space, and sample independence.
Task-specific extensions — measures label and annotation quality matched to the task type, such as IoU for detection datasets and CLIP similarity for VLP datasets.

2. Three Types of Image Datasets

Applying the same QMs to every image dataset is inefficient. Classifying datasets into three types based on the presence and form of annotations lets you set the right QM priorities for each.

Type A Pure Images (no annotations)

Generative AI training images and unsupervised learning datasets belong here. Because there are no labels, label-related QMs do not apply.

Key concerns: pixel quality, deduplication, distributional balance, representativeness
Applicable QMs: Com-ML-1, Con-ML-1, Cre-ML-1, Bal-ML-1/2, Sim-ML-1/2/3, Rep-ML-1

Type B Classification / Detection / Segmentation Annotations

Supervised learning datasets such as ImageNet (classification), COCO (detection), and Cityscapes (segmentation). All Type A QMs apply, plus label quality QMs.

Key concerns: label accuracy, class balance, bounding box quality
Additional QMs: Acc-ML-6 (IoU), Acc-ML-7 (label accuracy), Bal-ML-3/4/5/6/7/8, Div-ML-1/2/3

Type C Image-Text Pairs (VLP / CLIP / Captioning)

Datasets where images are paired with text, such as LAION-400M, COCO Captions, and Conceptual Captions. Semantic alignment between image and text is the central concern.

Key concerns: image-text semantic alignment, caption completeness
Additional QMs: Acc-ML-2 (CLIP similarity), Com-ML-2 (object presence check), Con-ML-2 (label consistency)

QM priorities by dataset type

The table below shows the priority of key QM items across the three dataset types.

QM Item	Type A	Type B	Type C
Com-ML-1 File integrity	Required	Required	Required
Con-ML-1 Deduplication	Required	Required	Required
Cre-ML-1 Pixel quality	Required	Recommended	Recommended
Bal-ML-1 Brightness balance	Recommended	Recommended	Recommended
Bal-ML-2 Resolution balance	Recommended	Recommended	Recommended
Sim-ML-1/2/3 Similarity / independence	Required	Recommended	Recommended
Rep-ML-1 Representativeness	Required	Required	Required
Acc-ML-7 Label accuracy	—	Required	Recommended
Bal-ML-3 Class balance	—	Required	Recommended
Acc-ML-6 IoU	—	Required (detection)	—
Bal-ML-4/5/6 Bbox balance	—	Required (detection)	—
Acc-ML-2 CLIP similarity	—	—	Required
Acc-ML-4 RPN risk	Recommended	Required	Recommended

3. Intrinsic Image Quality (Pixel-Level Layer)

Some QMs apply to every image dataset regardless of type — from whether files open correctly to whether pixel distributions are skewed. This section covers the common QMs that measure the foundational health of image data.

3.1. Completeness

Com-ML-1 Value Completeness — File Integrity ✅ Auto (L1)

Definition: The proportion of image files that can be opened and read successfully.

Measurement: Successful file header parses / total file count.

Image application: Attempt to read files with PIL/OpenCV. Files that raise IOError are treated as null values.

Com-ML-2 Value Occurrence Completeness — Object Presence Check ❌ External Tool

Definition: The proportion of images where the object specified in the annotation is actually present.

Measurement: Images confirmed by an object detection model / total annotated images.

Image application: Verify annotated class objects using YOLO/Faster-RCNN. Applies to Type B and C only.

Com-ML-3 Feature Completeness — Annotation Completeness ❌ External Tool

Definition: The proportion of images where specific features (masks, bboxes, keypoints) are annotated without omissions.

Annotation schema validation is required.

Com-ML-4 Record Completeness — Metadata Completeness ⚠️ Partial

Definition: The proportion of records with all metadata fields present (capture timestamp, resolution, camera info, etc.).

DataClinic offers partial support for reading file metadata.

Com-ML-5 Label Completeness — Label Coverage ✅ Auto (L1)

Definition: The proportion of images that have a label assigned (supervised learning only).

Measurement: Labeled images / total images.

3.2. Consistency

Con-ML-1 Data Record Consistency — Duplicate Images ✅ Auto (L1)

Exact duplicates: Byte-identical files detected via SHA-256 hash.

Near-duplicates: pHash (Perceptual Hash) and dHash. A pHash distance < 10 is treated as a duplicate.

Con-ML-2 Data Label Consistency — Label-Image Consistency ⚠️ L2/L3

Detects cases where the same visual pattern receives different labels. Checks label mismatches among pHash near-duplicate image pairs.

DataClinic visualizes label inconsistencies among similar samples at L2/L3.

Con-ML-3 Data Format Consistency — Format Consistency ✅ Auto (L1)

Measures the mix rate of RGB vs. grayscale images and channel count uniformity (1ch vs. 3ch vs. 4ch).

Con-ML-4 Semantic Consistency — Semantic Anomaly Detection ❌ External Tool

Detects logical inconsistencies within images — for example, a scene labeled as summer that contains snow in the background.

Requires multimodal LLM-based verification using tools such as Qwen-VL or LLaVA.

3.3. Credibility

Cre-ML-1 Values Credibility — Pixel Quality ⚠️ Partial (L1)

BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator): lower is better (range 0–100).

Laplacian Variance: used for blur detection. Low values indicate blur.

Watermark detection: performed via template matching or CLIP. DataClinic supports brightness/saturation distributions, but BRISQUE requires an external tool.

Cre-ML-2 Source Credibility — Data Provenance 〰️ Manual

Requires verification of C2PA (Content Credentials) digital signatures and confirmed supplier metadata.

Cre-ML-3 Data Dictionary Credibility — Schema Alignment 〰️ Manual

Verifies that header metadata maps correctly to annotation file properties.

Cre-ML-4 Data Model Credibility — Standard Schema Compliance ❌ External Tool

Validates conformance to COCO JSON, Pascal VOC XML, and YOLO TXT formats.

3.4. Accuracy — Common Items

Acc-ML-3 Data Accuracy Assurance — Quality Assurance 〰️ Manual

Measures the proportion of data collected from verified sources and the proportion that has undergone expert dual review.

Acc-ML-4 Risk of Dataset Inaccuracy — Inaccuracy Risk (RPN) ⚠️ L2/L3

Calculates risk priority using an FMEA-based approach.

\[ \text{RPN} = S \times O \times D \]

Where \(S\) is Severity, \(O\) is Occurrence, and \(D\) is Detectability.

Image risk types: label errors, missing classes, blur/noise, metadata errors, class bias. DataClinic can estimate Occurrence (\(O\)) through outlier detection.

Acc-ML-5 Data Model Accuracy — Ontology Alignment ❌ External Tool

Checks whether the class hierarchy aligns with domain ontologies such as WordNet or the ImageNet hierarchy.

3.5. AI/ML Distributional Quality

The core of image ML quality lies in distributions. Balance, Similarity, Representativeness, Diversity, and Effectiveness — these five characteristics govern a model's ability to generalize.

Balance — Common

Bal-ML-1 Brightness Balance — Coefficient of variation of the brightness (mean pixel value) distribution ✅ L1

\[ CV_{\text{brightness}} = \frac{\sigma_{\text{brightness}}}{\mu_{\text{brightness}}} \]

Bal-ML-2 Resolution Balance — Uniformity of the resolution distribution ✅ L1

Similarity

Sim-ML-1 Sample Similarity — Mean cosine similarity between samples in feature space ✅ L2/L3

Sim-ML-2 Samples Tightness — Cluster cohesion (concentration in high-density regions) ✅ L2/L3

Sim-ML-3 Samples Independency — Independence measured via nearest-neighbor distance distribution ✅ L2/L3

Representativeness

Rep-ML-1 Representativeness — Measures whether high-density clusters represent the full dataset ✅ L2/L3

Detects bias within feature space. Examples include the Antoine Blanchard effect (over-representation of a specific visual pattern) and the peacock effect (excessive concentration of visually striking samples).

Effectiveness

Eft-ML-1 Feature Effectiveness — Feature validity (separability between classes) ✅ L2/L3

Eft-ML-2 Class Size Effectiveness — Whether class sizes are effective for training ✅ L2/L3

Eft-ML-3 Label Effectiveness — Whether labels are effectively separated in feature space ✅ L2/L3

3.6. Governance Quality

Governance quality belongs to the domain of processes and policies rather than technical measurement. It applies to all dataset types, and most items require manual review or external tooling.

QM Item	Description	DataClinic
Idn-ML-1	Identifiability: proportion of images containing PII (faces, license plates)	❌
Tra-ML-1~3	Traceability: records of image collection routes and processing history	❌
Aud-ML-1~2	Auditability: quality inspection records and audit trail availability	❌
Acs-ML-1~3	Accessibility: data access permission management	❌
Cmp-ML-1	Compliance: copyright, privacy law, and license adherence	❌
Eff-ML-1~3	Efficiency: file size optimization, format efficiency	⚠️
Cur-ML-1~2	Currentness: timeliness of data collection	❌

4. Task-Oriented Quality (Task-Level Layer)

If common QMs measure the "foundational fitness" of an image dataset, type-specific QMs measure its "operational capability" for a given task. The same dataset may require entirely different quality criteria depending on whether it is used to train a classification model or a detection model.

4.1. Classification Datasets

Acc-ML-7 Label Accuracy ✅ L2/L3

Computes the distance from each sample to its class centroid in ViT/ResNet embedding space. Samples closer to a different class centroid are label error candidates.

ImageNet case study: Northcutt et al. 2021 — a 6% error rate, with approximately 85,870 mislabeled images. DataClinic automatically identifies these error candidates at L2/L3 through low-density sample detection.

Bal-ML-3 Inter-Class Balance ✅ L1

Imbalance ratio is calculated as max class count / min class count.

Real examples: SpectralWaste at 19.6:1, WikiArt at 133:1. Higher imbalance ratios make learning minority classes increasingly difficult.

Diversity — Div-ML-1/2/3 ✅ L1

Div-ML-1 Label Richness — Total number of unique classes in the dataset

Div-ML-2 Relative Label Abundance — Mean number of samples per class

Div-ML-3 Category Size Diversity — Diversity of the sample count distribution across classes

Label distribution — Bal-ML-7/8 ❌ External Tool

Bal-ML-7 Label Proportion Balance — Evenness of each class's share of total labels

Bal-ML-8 Label Distribution Balance — Evenness of per-image label count distribution (for multi-label classification)

4.2. Object Detection Datasets

Acc-ML-6 Bounding Box Accuracy (IoU) ❌ External Tool

\[ \text{IoU} = \frac{|\text{Pred} \cap \text{GT}|}{|\text{Pred} \cup \text{GT}|} \geq \text{threshold} \]

The threshold is typically 0.5 (AP50) or 0.75 (AP75). Validated through dual annotation plus expert review.

Bounding Box Balance — Bal-ML-4/5/6 ❌ External Tool

Bal-ML-4 Bbox H/W Ratio Balance — Evenness of bounding box height/width ratio distribution. Detects imbalance between portrait and landscape objects.

Bal-ML-5 Bbox Area Balance by Category — Evenness of mean bbox area across classes. Measures imbalance between small-object and large-object classes.

Bal-ML-6 Bbox Area Balance by Sample — Evenness of total bbox area distribution per sample.

4.3. Image-Text Pairs (VLP / CLIP)

Acc-ML-1 Syntactic Accuracy — Caption Grammar Accuracy ❌ External Tool

Detects grammatical errors and special-character contamination in caption text. Requires text processing tools.

Acc-ML-2 Semantic Accuracy — CLIP Semantic Alignment ✅ L3

\[ \text{CLIP cosine similarity}(\text{image}, \text{text}) \geq \text{threshold} \]

The threshold is typically 0.25–0.30. DataClinic L3 supports BLIP image-text matching.

Con-ML-2 Label-Image Consistency ⚠️ Partial

Detects cases where the same caption is used with different images. DataClinic provides partial support.

5. QM Evaluation Workflow

Knowing the QM items and actually applying them are two different things. The five-step workflow below describes the practical procedure from classifying the dataset type all the way to a final Pass/Fail determination.

Step 1: Classify the dataset type

No annotations → Type A

Classification / detection / segmentation annotations → Type B

Image-text pairs → Type C

Step 2: Measure common QMs (automatable)

Com-ML-1: File integrity check

Con-ML-1: Deduplication (SHA-256 + pHash)

Con-ML-3: Format consistency

Bal-ML-1/2: Brightness/resolution distributions — measurable automatically via DataClinic L1

Step 3: Measure ML-specific distributional quality

Sim-ML-1/2/3, Rep-ML-1, Eft-ML-1~3 — measured automatically via DataClinic L2/L3

Step 4: Measure type-specific task quality

Type A No additional items

Type B Acc-ML-7, Bal-ML-3~8, Div-ML-1~3

Type C Acc-ML-2, Com-ML-2, Con-ML-2

Step 5: Pass / Fail / Warn determination

Set quantitative thresholds for each QM item and determine the verdict.

Example decision criteria

QM Item	Pass	Warn	Fail
Com-ML-1	≥ 99%	97–99%	< 97%
Con-ML-1	< 1% duplicates	1–3%	> 3%
Bal-ML-3	≤ 5:1	5–20:1	> 20:1
Acc-ML-6	IoU ≥ 0.75	0.5–0.75	< 0.5
Acc-ML-2	CLIP ≥ 0.30	0.25–0.30	< 0.25

Mapping to DataClinic L1 scores: 80–100 = Pass, 60–79 = Warn, below 60 = Fail

6. Complete QM Matrix

The table below is a comprehensive matrix that maps all ISO/IEC 5259-2 QM items to the three image dataset types and shows DataClinic's automation support level for each.

✅ Auto ⚠️ Partial ❌ External Tool 〰️ Manual

QM Code	QM Item	A	B	C	DataClinic
Com-ML-1	File integrity	Required	Required	Required	✅ L1
Com-ML-2	Object presence check	-	Recommended	Recommended	❌
Com-ML-3	Annotation completeness	-	Recommended	-	❌
Com-ML-4	Metadata completeness	Recommended	Recommended	Recommended	⚠️
Com-ML-5	Label completeness	-	Required	Recommended	✅ L1
Con-ML-1	Deduplication	Required	Required	Required	✅ L1
Con-ML-2	Label consistency	-	Recommended	Recommended	⚠️ L2/L3
Con-ML-3	Format consistency	Required	Required	Required	✅ L1
Con-ML-4	Semantic anomaly detection	Recommended	Recommended	Recommended	❌
Cre-ML-1	Pixel quality	Required	Recommended	Recommended	⚠️ L1
Cre-ML-2	Source credibility	Recommended	Recommended	Recommended	〰️
Cre-ML-3	Schema alignment	Recommended	Recommended	Recommended	〰️
Cre-ML-4	Standard schema compliance	-	Recommended	Recommended	❌
Acc-ML-1	Syntactic accuracy	-	-	Recommended	❌
Acc-ML-2	Semantic accuracy (CLIP)	-	-	Required	✅ L3
Acc-ML-3	Quality assurance	Recommended	Recommended	Recommended	〰️
Acc-ML-4	Inaccuracy risk (RPN)	Recommended	Required	Recommended	⚠️ L2/L3
Acc-ML-5	Ontology alignment	-	Recommended	-	❌
Acc-ML-6	Bbox IoU accuracy	-	Required (detection)	-	❌
Acc-ML-7	Label accuracy	-	Required	Recommended	✅ L2/L3
Bal-ML-1	Brightness balance	Recommended	Recommended	Recommended	✅ L1
Bal-ML-2	Resolution balance	Recommended	Recommended	Recommended	✅ L1
Bal-ML-3	Inter-class balance	-	Required	Recommended	✅ L1
Bal-ML-4	Bbox H/W balance	-	Recommended (detection)	-	❌
Bal-ML-5	Bbox area balance (by class)	-	Recommended (detection)	-	❌
Bal-ML-6	Bbox area balance (by sample)	-	Recommended (detection)	-	❌
Bal-ML-7	Label proportion balance	-	Recommended	Recommended	❌
Bal-ML-8	Label distribution balance	-	Recommended	-	❌
Div-ML-1	Label richness	-	Recommended	Recommended	✅ L1
Div-ML-2	Relative label abundance	-	Recommended	Recommended	✅ L1
Div-ML-3	Category size diversity	-	Recommended	Recommended	✅ L1
Eft-ML-1	Feature effectiveness	Recommended	Recommended	Recommended	✅ L2/L3
Eft-ML-2	Class size effectiveness	-	Recommended	-	✅ L2/L3
Eft-ML-3	Label effectiveness	-	Recommended	Recommended	✅ L2/L3
Sim-ML-1	Sample similarity	Required	Recommended	Recommended	✅ L2/L3
Sim-ML-2	Sample tightness	Required	Recommended	Recommended	✅ L2/L3
Sim-ML-3	Sample independency	Recommended	Recommended	Recommended	✅ L2/L3
Rep-ML-1	Representativeness	Required	Required	Required	✅ L2/L3
Idn-ML-1	Identifiability (PII)	Recommended	Recommended	Recommended	❌
Cur-ML-1	Feature currentness	Recommended	Recommended	Recommended	❌
Cur-ML-2	Record currentness	Recommended	Recommended	Recommended	❌
Rel-ML-1	Feature relevance	Recommended	Recommended	Recommended	❌
Rel-ML-2	Record relevance	Recommended	Recommended	Recommended	❌
Tra-ML-1~3	Traceability	Recommended	Recommended	Recommended	❌
Aud-ML-1~2	Auditability	Recommended	Recommended	Recommended	❌
Acs-ML-1~3	Accessibility	Recommended	Recommended	Recommended	❌
Cmp-ML-1	Compliance	Recommended	Recommended	Recommended	❌
Eff-ML-1~3	Efficiency	Recommended	Recommended	Recommended	⚠️
Por-ML-1~2	Portability	Recommended	Recommended	Recommended	❌
Tml-ML-1	Timeliness	Recommended	Recommended	Recommended	❌

Items not yet supported by DataClinic do not represent gaps in data quality — they simply fall outside the current scope of automation. These items can be addressed through specialized tools (BRISQUE, IoU validators, C2PA toolkit, etc.) or through structured manual review processes.