Image dataset quality divides into two distinct layers: the 'pixel level' and the 'task level'. ISO/IEC 5259-2 addresses both layers through a framework of 23 top-level QM categories. This guide maps every applicable QM item to image datasets and provides measurement methods alongside DataClinic automation support levels for each.
Image datasets fall into three types — pure images, classification/detection annotations, and image-text pairs — and the relevant QM items differ by type. Common QMs cover pixel-level fundamentals such as file integrity, deduplication, and brightness/resolution distributions. Type-specific QMs measure task-oriented quality like label accuracy, bounding box IoU, and CLIP similarity.
This guide provides a five-step evaluation workflow with Pass/Warn/Fail decision criteria, and clearly distinguishes items that DataClinic measures automatically from those requiring external tools. Practitioners can use this matrix to build a quality evaluation plan tailored to their dataset type.
1. Why Image Datasets Need Their Own Quality Standards
Image data carries fundamentally different quality problems from text. Pixel-level quality (brightness, resolution, corruption) and annotation quality (label accuracy, bounding box IoU) require completely separate measurement systems. In text, "consistency" means uniform vocabulary; in images, "consistency" means uniformity across RGB channel distributions or the absence of duplicate frames. Even the same ISO 5259 QM item demands a different measurement approach depending on the data type.
Failure cases: numbers alone cannot guarantee quality
ImageNet — 1,431,167 images: The per-class image count ranged from 700 to 1,300, so the numbers looked balanced. Yet 120 dog breeds accounted for 12% of the entire dataset — a semantic imbalance that caused models trained on it to overfit dog breed classification and exhibit serious bias in real-world deployments.
WikiArt — 81,444 images: DataClinic reported "RGB consistent," yet the actual Red channel followed a bimodal distribution. The warm reds of Impressionist paintings and the dark tones of Classicist works formed two separate peaks. Automated diagnosis alone could not catch domain-specific patterns like this.
The conclusion is clear: separating pixel-level diagnosis from task-level diagnosis is not optional — conflating them leads to failure.
The framework ISO 5259-2 provides for images
ISO/IEC 5259-2 structures image dataset quality across three layers.
- Common quality characteristics (Accuracy, Completeness, Consistency, Credibility, Currentness, etc.) — fundamental quality independent of data type. Measures whether files open, whether duplicates exist, and whether metadata is complete.
- AI/ML additional quality characteristics (Balance, Diversity, Effectiveness, Similarity, Representativeness) — distributional quality specific to image ML. Measures class balance, representativeness within feature space, and sample independence.
- Task-specific extensions — measures label and annotation quality matched to the task type, such as IoU for detection datasets and CLIP similarity for VLP datasets.
2. Three Types of Image Datasets
Applying the same QMs to every image dataset is inefficient. Classifying datasets into three types based on the presence and form of annotations lets you set the right QM priorities for each.
Type A Pure Images (no annotations)
Generative AI training images and unsupervised learning datasets belong here. Because there are no labels, label-related QMs do not apply.
Key concerns: pixel quality, deduplication, distributional balance, representativeness
Applicable QMs:
Com-ML-1,
Con-ML-1,
Cre-ML-1,
Bal-ML-1/2,
Sim-ML-1/2/3,
Rep-ML-1
Type B Classification / Detection / Segmentation Annotations
Supervised learning datasets such as ImageNet (classification), COCO (detection), and Cityscapes (segmentation). All Type A QMs apply, plus label quality QMs.
Key concerns: label accuracy, class balance, bounding box quality
Additional QMs:
Acc-ML-6 (IoU),
Acc-ML-7 (label accuracy),
Bal-ML-3/4/5/6/7/8,
Div-ML-1/2/3
Type C Image-Text Pairs (VLP / CLIP / Captioning)
Datasets where images are paired with text, such as LAION-400M, COCO Captions, and Conceptual Captions. Semantic alignment between image and text is the central concern.
Key concerns: image-text semantic alignment, caption completeness
Additional QMs:
Acc-ML-2 (CLIP similarity),
Com-ML-2 (object presence check),
Con-ML-2 (label consistency)
QM priorities by dataset type
The table below shows the priority of key QM items across the three dataset types.
| QM Item | Type A | Type B | Type C |
|---|---|---|---|
| Com-ML-1 File integrity | Required | Required | Required |
| Con-ML-1 Deduplication | Required | Required | Required |
| Cre-ML-1 Pixel quality | Required | Recommended | Recommended |
| Bal-ML-1 Brightness balance | Recommended | Recommended | Recommended |
| Bal-ML-2 Resolution balance | Recommended | Recommended | Recommended |
| Sim-ML-1/2/3 Similarity / independence | Required | Recommended | Recommended |
| Rep-ML-1 Representativeness | Required | Required | Required |
| Acc-ML-7 Label accuracy | — | Required | Recommended |
| Bal-ML-3 Class balance | — | Required | Recommended |
| Acc-ML-6 IoU | — | Required (detection) | — |
| Bal-ML-4/5/6 Bbox balance | — | Required (detection) | — |
| Acc-ML-2 CLIP similarity | — | — | Required |
| Acc-ML-4 RPN risk | Recommended | Required | Recommended |
3. Intrinsic Image Quality (Pixel-Level Layer)
Some QMs apply to every image dataset regardless of type — from whether files open correctly to whether pixel distributions are skewed. This section covers the common QMs that measure the foundational health of image data.
3.1. Completeness
Com-ML-1 Value Completeness — File Integrity ✅ Auto (L1)
Definition: The proportion of image files that can be opened and read successfully.
Measurement: Successful file header parses / total file count.
Image application: Attempt to read files with PIL/OpenCV. Files that raise IOError are treated as null values.
Com-ML-2 Value Occurrence Completeness — Object Presence Check ❌ External Tool
Definition: The proportion of images where the object specified in the annotation is actually present.
Measurement: Images confirmed by an object detection model / total annotated images.
Image application: Verify annotated class objects using YOLO/Faster-RCNN. Applies to Type B and C only.
Com-ML-3 Feature Completeness — Annotation Completeness ❌ External Tool
Definition: The proportion of images where specific features (masks, bboxes, keypoints) are annotated without omissions.
Annotation schema validation is required.
Com-ML-4 Record Completeness — Metadata Completeness ⚠️ Partial
Definition: The proportion of records with all metadata fields present (capture timestamp, resolution, camera info, etc.).
DataClinic offers partial support for reading file metadata.
Com-ML-5 Label Completeness — Label Coverage ✅ Auto (L1)
Definition: The proportion of images that have a label assigned (supervised learning only).
Measurement: Labeled images / total images.
3.2. Consistency
Con-ML-1 Data Record Consistency — Duplicate Images ✅ Auto (L1)
Exact duplicates: Byte-identical files detected via SHA-256 hash.
Near-duplicates: pHash (Perceptual Hash) and dHash. A pHash distance < 10 is treated as a duplicate.
Con-ML-2 Data Label Consistency — Label-Image Consistency ⚠️ L2/L3
Detects cases where the same visual pattern receives different labels. Checks label mismatches among pHash near-duplicate image pairs.
DataClinic visualizes label inconsistencies among similar samples at L2/L3.
Con-ML-3 Data Format Consistency — Format Consistency ✅ Auto (L1)
Measures the mix rate of RGB vs. grayscale images and channel count uniformity (1ch vs. 3ch vs. 4ch).
Con-ML-4 Semantic Consistency — Semantic Anomaly Detection ❌ External Tool
Detects logical inconsistencies within images — for example, a scene labeled as summer that contains snow in the background.
Requires multimodal LLM-based verification using tools such as Qwen-VL or LLaVA.
3.3. Credibility
Cre-ML-1 Values Credibility — Pixel Quality ⚠️ Partial (L1)
BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator): lower is better (range 0–100).
Laplacian Variance: used for blur detection. Low values indicate blur.
Watermark detection: performed via template matching or CLIP. DataClinic supports brightness/saturation distributions, but BRISQUE requires an external tool.
Cre-ML-2 Source Credibility — Data Provenance 〰️ Manual
Requires verification of C2PA (Content Credentials) digital signatures and confirmed supplier metadata.
Cre-ML-3 Data Dictionary Credibility — Schema Alignment 〰️ Manual
Verifies that header metadata maps correctly to annotation file properties.
Cre-ML-4 Data Model Credibility — Standard Schema Compliance ❌ External Tool
Validates conformance to COCO JSON, Pascal VOC XML, and YOLO TXT formats.
3.4. Accuracy — Common Items
Acc-ML-3 Data Accuracy Assurance — Quality Assurance 〰️ Manual
Measures the proportion of data collected from verified sources and the proportion that has undergone expert dual review.
Acc-ML-4 Risk of Dataset Inaccuracy — Inaccuracy Risk (RPN) ⚠️ L2/L3
Calculates risk priority using an FMEA-based approach.
\[ \text{RPN} = S \times O \times D \]
Where \(S\) is Severity, \(O\) is Occurrence, and \(D\) is Detectability.
Image risk types: label errors, missing classes, blur/noise, metadata errors, class bias. DataClinic can estimate Occurrence (\(O\)) through outlier detection.
Acc-ML-5 Data Model Accuracy — Ontology Alignment ❌ External Tool
Checks whether the class hierarchy aligns with domain ontologies such as WordNet or the ImageNet hierarchy.
3.5. AI/ML Distributional Quality
The core of image ML quality lies in distributions. Balance, Similarity, Representativeness, Diversity, and Effectiveness — these five characteristics govern a model's ability to generalize.
Balance — Common
Bal-ML-1 Brightness Balance — Coefficient of variation of the brightness (mean pixel value) distribution ✅ L1
\[ CV_{\text{brightness}} = \frac{\sigma_{\text{brightness}}}{\mu_{\text{brightness}}} \]
Bal-ML-2 Resolution Balance — Uniformity of the resolution distribution ✅ L1
Similarity
Sim-ML-1 Sample Similarity — Mean cosine similarity between samples in feature space ✅ L2/L3
Sim-ML-2 Samples Tightness — Cluster cohesion (concentration in high-density regions) ✅ L2/L3
Sim-ML-3 Samples Independency — Independence measured via nearest-neighbor distance distribution ✅ L2/L3
Representativeness
Rep-ML-1 Representativeness — Measures whether high-density clusters represent the full dataset ✅ L2/L3
Detects bias within feature space. Examples include the Antoine Blanchard effect (over-representation of a specific visual pattern) and the peacock effect (excessive concentration of visually striking samples).
Effectiveness
Eft-ML-1 Feature Effectiveness — Feature validity (separability between classes) ✅ L2/L3
Eft-ML-2 Class Size Effectiveness — Whether class sizes are effective for training ✅ L2/L3
Eft-ML-3 Label Effectiveness — Whether labels are effectively separated in feature space ✅ L2/L3
3.6. Governance Quality
Governance quality belongs to the domain of processes and policies rather than technical measurement. It applies to all dataset types, and most items require manual review or external tooling.
| QM Item | Description | DataClinic |
|---|---|---|
| Idn-ML-1 | Identifiability: proportion of images containing PII (faces, license plates) | ❌ |
| Tra-ML-1~3 | Traceability: records of image collection routes and processing history | ❌ |
| Aud-ML-1~2 | Auditability: quality inspection records and audit trail availability | ❌ |
| Acs-ML-1~3 | Accessibility: data access permission management | ❌ |
| Cmp-ML-1 | Compliance: copyright, privacy law, and license adherence | ❌ |
| Eff-ML-1~3 | Efficiency: file size optimization, format efficiency | ⚠️ |
| Cur-ML-1~2 | Currentness: timeliness of data collection | ❌ |
4. Task-Oriented Quality (Task-Level Layer)
If common QMs measure the "foundational fitness" of an image dataset, type-specific QMs measure its "operational capability" for a given task. The same dataset may require entirely different quality criteria depending on whether it is used to train a classification model or a detection model.
4.1. Classification Datasets
Acc-ML-7 Label Accuracy ✅ L2/L3
Computes the distance from each sample to its class centroid in ViT/ResNet embedding space. Samples closer to a different class centroid are label error candidates.
ImageNet case study: Northcutt et al. 2021 — a 6% error rate, with approximately 85,870 mislabeled images. DataClinic automatically identifies these error candidates at L2/L3 through low-density sample detection.
Bal-ML-3 Inter-Class Balance ✅ L1
Imbalance ratio is calculated as max class count / min class count.
Real examples: SpectralWaste at 19.6:1, WikiArt at 133:1. Higher imbalance ratios make learning minority classes increasingly difficult.
Diversity — Div-ML-1/2/3 ✅ L1
Div-ML-1 Label Richness — Total number of unique classes in the dataset
Div-ML-2 Relative Label Abundance — Mean number of samples per class
Div-ML-3 Category Size Diversity — Diversity of the sample count distribution across classes
Label distribution — Bal-ML-7/8 ❌ External Tool
Bal-ML-7 Label Proportion Balance — Evenness of each class's share of total labels
Bal-ML-8 Label Distribution Balance — Evenness of per-image label count distribution (for multi-label classification)
4.2. Object Detection Datasets
Acc-ML-6 Bounding Box Accuracy (IoU) ❌ External Tool
\[ \text{IoU} = \frac{|\text{Pred} \cap \text{GT}|}{|\text{Pred} \cup \text{GT}|} \geq \text{threshold} \]
The threshold is typically 0.5 (AP50) or 0.75 (AP75). Validated through dual annotation plus expert review.
Bounding Box Balance — Bal-ML-4/5/6 ❌ External Tool
Bal-ML-4 Bbox H/W Ratio Balance — Evenness of bounding box height/width ratio distribution. Detects imbalance between portrait and landscape objects.
Bal-ML-5 Bbox Area Balance by Category — Evenness of mean bbox area across classes. Measures imbalance between small-object and large-object classes.
Bal-ML-6 Bbox Area Balance by Sample — Evenness of total bbox area distribution per sample.
4.3. Image-Text Pairs (VLP / CLIP)
Acc-ML-1 Syntactic Accuracy — Caption Grammar Accuracy ❌ External Tool
Detects grammatical errors and special-character contamination in caption text. Requires text processing tools.
Acc-ML-2 Semantic Accuracy — CLIP Semantic Alignment ✅ L3
\[ \text{CLIP cosine similarity}(\text{image}, \text{text}) \geq \text{threshold} \]
The threshold is typically 0.25–0.30. DataClinic L3 supports BLIP image-text matching.
Con-ML-2 Label-Image Consistency ⚠️ Partial
Detects cases where the same caption is used with different images. DataClinic provides partial support.
5. QM Evaluation Workflow
Knowing the QM items and actually applying them are two different things. The five-step workflow below describes the practical procedure from classifying the dataset type all the way to a final Pass/Fail determination.
Step 1: Classify the dataset type
No annotations → Type A
Classification / detection / segmentation annotations → Type B
Image-text pairs → Type C
Step 2: Measure common QMs (automatable)
Com-ML-1: File integrity check
Con-ML-1: Deduplication (SHA-256 + pHash)
Con-ML-3: Format consistency
Bal-ML-1/2: Brightness/resolution distributions — measurable automatically via DataClinic L1
Step 3: Measure ML-specific distributional quality
Sim-ML-1/2/3, Rep-ML-1, Eft-ML-1~3 — measured automatically via DataClinic L2/L3
Step 4: Measure type-specific task quality
Type A No additional items
Type B Acc-ML-7, Bal-ML-3~8, Div-ML-1~3
Type C Acc-ML-2, Com-ML-2, Con-ML-2
Step 5: Pass / Fail / Warn determination
Set quantitative thresholds for each QM item and determine the verdict.
Example decision criteria
| QM Item | Pass | Warn | Fail |
|---|---|---|---|
| Com-ML-1 | ≥ 99% | 97–99% | < 97% |
| Con-ML-1 | < 1% duplicates | 1–3% | > 3% |
| Bal-ML-3 | ≤ 5:1 | 5–20:1 | > 20:1 |
| Acc-ML-6 | IoU ≥ 0.75 | 0.5–0.75 | < 0.5 |
| Acc-ML-2 | CLIP ≥ 0.30 | 0.25–0.30 | < 0.25 |
Mapping to DataClinic L1 scores: 80–100 = Pass, 60–79 = Warn, below 60 = Fail
6. Complete QM Matrix
The table below is a comprehensive matrix that maps all ISO/IEC 5259-2 QM items to the three image dataset types and shows DataClinic's automation support level for each.
| QM Code | QM Item | A | B | C | DataClinic |
|---|---|---|---|---|---|
| Com-ML-1 | File integrity | Required | Required | Required | ✅ L1 |
| Com-ML-2 | Object presence check | - | Recommended | Recommended | ❌ |
| Com-ML-3 | Annotation completeness | - | Recommended | - | ❌ |
| Com-ML-4 | Metadata completeness | Recommended | Recommended | Recommended | ⚠️ |
| Com-ML-5 | Label completeness | - | Required | Recommended | ✅ L1 |
| Con-ML-1 | Deduplication | Required | Required | Required | ✅ L1 |
| Con-ML-2 | Label consistency | - | Recommended | Recommended | ⚠️ L2/L3 |
| Con-ML-3 | Format consistency | Required | Required | Required | ✅ L1 |
| Con-ML-4 | Semantic anomaly detection | Recommended | Recommended | Recommended | ❌ |
| Cre-ML-1 | Pixel quality | Required | Recommended | Recommended | ⚠️ L1 |
| Cre-ML-2 | Source credibility | Recommended | Recommended | Recommended | 〰️ |
| Cre-ML-3 | Schema alignment | Recommended | Recommended | Recommended | 〰️ |
| Cre-ML-4 | Standard schema compliance | - | Recommended | Recommended | ❌ |
| Acc-ML-1 | Syntactic accuracy | - | - | Recommended | ❌ |
| Acc-ML-2 | Semantic accuracy (CLIP) | - | - | Required | ✅ L3 |
| Acc-ML-3 | Quality assurance | Recommended | Recommended | Recommended | 〰️ |
| Acc-ML-4 | Inaccuracy risk (RPN) | Recommended | Required | Recommended | ⚠️ L2/L3 |
| Acc-ML-5 | Ontology alignment | - | Recommended | - | ❌ |
| Acc-ML-6 | Bbox IoU accuracy | - | Required (detection) | - | ❌ |
| Acc-ML-7 | Label accuracy | - | Required | Recommended | ✅ L2/L3 |
| Bal-ML-1 | Brightness balance | Recommended | Recommended | Recommended | ✅ L1 |
| Bal-ML-2 | Resolution balance | Recommended | Recommended | Recommended | ✅ L1 |
| Bal-ML-3 | Inter-class balance | - | Required | Recommended | ✅ L1 |
| Bal-ML-4 | Bbox H/W balance | - | Recommended (detection) | - | ❌ |
| Bal-ML-5 | Bbox area balance (by class) | - | Recommended (detection) | - | ❌ |
| Bal-ML-6 | Bbox area balance (by sample) | - | Recommended (detection) | - | ❌ |
| Bal-ML-7 | Label proportion balance | - | Recommended | Recommended | ❌ |
| Bal-ML-8 | Label distribution balance | - | Recommended | - | ❌ |
| Div-ML-1 | Label richness | - | Recommended | Recommended | ✅ L1 |
| Div-ML-2 | Relative label abundance | - | Recommended | Recommended | ✅ L1 |
| Div-ML-3 | Category size diversity | - | Recommended | Recommended | ✅ L1 |
| Eft-ML-1 | Feature effectiveness | Recommended | Recommended | Recommended | ✅ L2/L3 |
| Eft-ML-2 | Class size effectiveness | - | Recommended | - | ✅ L2/L3 |
| Eft-ML-3 | Label effectiveness | - | Recommended | Recommended | ✅ L2/L3 |
| Sim-ML-1 | Sample similarity | Required | Recommended | Recommended | ✅ L2/L3 |
| Sim-ML-2 | Sample tightness | Required | Recommended | Recommended | ✅ L2/L3 |
| Sim-ML-3 | Sample independency | Recommended | Recommended | Recommended | ✅ L2/L3 |
| Rep-ML-1 | Representativeness | Required | Required | Required | ✅ L2/L3 |
| Idn-ML-1 | Identifiability (PII) | Recommended | Recommended | Recommended | ❌ |
| Cur-ML-1 | Feature currentness | Recommended | Recommended | Recommended | ❌ |
| Cur-ML-2 | Record currentness | Recommended | Recommended | Recommended | ❌ |
| Rel-ML-1 | Feature relevance | Recommended | Recommended | Recommended | ❌ |
| Rel-ML-2 | Record relevance | Recommended | Recommended | Recommended | ❌ |
| Tra-ML-1~3 | Traceability | Recommended | Recommended | Recommended | ❌ |
| Aud-ML-1~2 | Auditability | Recommended | Recommended | Recommended | ❌ |
| Acs-ML-1~3 | Accessibility | Recommended | Recommended | Recommended | ❌ |
| Cmp-ML-1 | Compliance | Recommended | Recommended | Recommended | ❌ |
| Eff-ML-1~3 | Efficiency | Recommended | Recommended | Recommended | ⚠️ |
| Por-ML-1~2 | Portability | Recommended | Recommended | Recommended | ❌ |
| Tml-ML-1 | Timeliness | Recommended | Recommended | Recommended | ❌ |
Items not yet supported by DataClinic do not represent gaps in data quality — they simply fall outside the current scope of automation. These items can be addressed through specialized tools (BRISQUE, IoU validators, C2PA toolkit, etc.) or through structured manual review processes.