Reading time: ~15 min 한국어
🔬

Evaluation Methodology

This report re-interprets DataClinic's three-level diagnostic results (Level I / II / III) through the ISO/IEC 5259-2:2024 Quality Measure (QM) framework as an independent evaluation. The metrics, charts, and outlier samples measured by DataClinic are mapped to each ISO QM's formal definition, with Pass / Fail / Caution verdicts rendered independently.

DataClinic L1 Diagnostics DataClinic L2/L3 Diagnostics ISO 5259-2 QM Interpretation

Summary: We independently evaluated the SpectralWaste recycling waste image dataset (2,794 images, 6 classes) against ISO/IEC 5259-2:2024 Quality Measures (QMs). DataClinic's three-level diagnostic metrics and charts were mapped to ISO QM definitions. Of the 14 QMs assessed, 3 passed, 5 failed, and 3 received caution flags. The core issues are severe class imbalance (19.6:1 max ratio) and a lack of representativeness and diversity caused by a single capture environment. DataClinic's "Bulk-up" recommendation aligns precisely with the ISO Bal-ML-1 and Eft-ML-1 Fail verdicts.

3 / 10
Measured QMs Passed
5
Failed Items
2
Caution Items
3
Unmeasured (Roadmap)

1 Dataset Overview

Basic Information

DatasetSpectralWaste
SourceKaggle
Diagnosed Images1,709 (out of 2,794 total)
Classes6
Image Size276 x 256 px (RGB)
DataClinic Score68 / 100 (Moderate)

Class Distribution (L1 Diagnosis)

ClassSamplesRatio
video_tape64637.8%
basket38422.5%
film24814.5%
cardboard19911.6%
bag19911.6%
filament331.9%

Max/Min class ratio: 19.6 : 1 (video_tape vs filament)

SpectralWaste dataset collage

Representative image collage from the SpectralWaste dataset -- six types of recycling waste on a conveyor belt

SpectralWaste is a recycling waste dataset collected by synchronizing RGB and hyperspectral imaging on a prototype conveyor belt. Each image includes a composite bar chart summarizing the spectral signature of each object. While the dataset was designed for training automated recycling classification models, class imbalance and a homogeneous capture environment may limit model performance.

2 ISO/IEC 5259-2 Evaluation Framework

This report independently applies the Quality Measures (QMs) defined in ISO/IEC 5259-2:2024 to the SpectralWaste image dataset. DataClinic's three-level diagnostic outputs are mapped to ISO QM definitions and independently interpreted — connecting what DataClinic measured with how ISO evaluates it.

DataClinic Level What Is Measured Mapped ISO 5259-2 QMs
Level I Class counts, missing values, pixel statistics, mean images Com-ML-1/3/5, Bal-ML-1, Eft-ML-1
Level II General-purpose embedding (1280-dim) density, outliers, similarity Sim-ML-1/2, Rep-ML-1/3, Div-ML-1, Con-ML-2, Acc-ML-7
Level III Domain-specific lens (32-dim) density and cluster analysis Rep-ML-1, Div-ML-1/2, Bal-ML-2

Intrinsic DQCs (3)

Accuracy, Completeness, Consistency
DataClinic Level I

AI/ML DQCs (9)

Balance, Diversity, Representativeness, Similarity, Relevance, Effectiveness, Auditability, Identifiability, Timeliness
DataClinic Level II/III

Verdict Criteria

Pass Meets criteria
Fail Below threshold
Caution Needs further review
-- N/A Not measured

3 Intrinsic Quality Assessment

QM ID Item ISO Definition Verdict
Com-ML-1 Value Completeness Proportion of data items without null values Pass
Com-ML-3 Feature Completeness Proportion of feature-related items without null values Pass
Com-ML-5 Label Completeness Proportion of samples with complete labels Pass
Con-ML-2 Label Consistency Proportion of similar samples with consistent labels Caution
Acc-ML-7 Label Accuracy Estimated mislabel rate via outlier detection Caution

Com-ML-1/3/5 -- Completeness Pass Rationale

DataClinic Level I diagnosis confirmed zero missing values across the dataset. All 1,709 images have three RGB channels intact, and labels for all six classes are correctly assigned. This satisfies ISO 5259-2 completeness criteria (value, feature, and label).

bag mean

bag

basket mean

basket

cardboard mean

cardboard

filament mean

filament

film mean

film

video_tape mean

video_tape

Mean images per class -- labels are correctly assigned and mean images render normally for all six classes

Con-ML-2 / Acc-ML-7 -- Caution Rationale

Con-ML-2 (Label Consistency): ISO 5259-2 requires that similar instances in embedding space share the same label. The Level II low-density distribution reveals multi-modal clusters with ambiguous class boundaries in certain regions. Potential label cross-contamination between similar samples cannot be ruled out and warrants further review.

Acc-ML-7 (Label Accuracy): Twenty low-density outliers were identified at both Level II and Level III. Among these, the low-density samples in the filament and cardboard classes may stem from the peculiarities of composite spectral bar chart images, but the possibility of labeling errors should also be investigated.

filament low-density outlier

filament (low density)

cardboard low-density outlier

cardboard (low density)

video_tape high density

video_tape (high density)

video_tape high density 2

video_tape (high density)

L2 outlier samples -- low-density (outliers) concentrate in filament and cardboard; high-density (typical) in video_tape

4 Balance Assessment -- Bal-ML

QM ID Item ISO Definition Measurement Verdict
Bal-ML-1 Class Balance Degree of balance in class-wise sample counts Std. dev. 242.7, max ratio 19.6:1 Fail
Bal-ML-2 Feature Balance Balance of feature distributions within the dataset Color/size skew (confirmed at L3) Fail

Bal-ML-1 -- Severe Class Imbalance

ISO 5259-2 Bal-ML-1 measures the degree of balance across class-wise sample counts. A max/min class ratio exceeding 10:1 is generally considered to cause severe model bias toward minority classes. SpectralWaste's video_tape (646 images) to filament (33 images) ratio stands at 19.6:1, qualifying as severe imbalance under ISO criteria. The filament class with only 33 images falls well below the minimum threshold commonly required for deep learning (typically 100+ images). Training under these conditions is highly likely to cause the model to misclassify filament as video_tape or another majority class.

L2 density box chart

L2 density box chart -- comparing density distribution spread across classes; video_tape shows the widest range

L3 density box chart

L3 domain-specific lens box chart -- inter-class density variance is even more pronounced than at L2

Bal-ML-2 -- Feature Imbalance

ISO 5259-2 Bal-ML-2 measures whether intrinsic features such as color, size, and shape are evenly distributed across the dataset. Level III analysis (domain-specific 32-dimensional lens) repeatedly revealed "identical waste colors and small-size features." This reflects how the single conveyor belt environment has homogenized lighting, background, and viewing angle characteristics. In real-world industrial settings, a variety of lighting conditions, backgrounds, and waste states exist, meaning this feature skew could create a significant domain gap.

bag L3 density

bag

filament L3 density

filament

film L3 density

film

video_tape L3 density

video_tape

L3 per-class density plots -- different distribution shapes and positions across classes indicate feature imbalance

5 Similarity Assessment -- Sim-ML

QM ID Item ISO Definition Measurement Verdict
Sim-ML-1 Duplicate Instance Ratio Proportion of duplicate or near-duplicate samples L2 low density = low duplication Pass
Sim-ML-2 Intra-class Similarity Average similarity among samples within the same class High-density concentration in video_tape Caution

Sim-ML-1 -- Pass: Low Duplication

ISO 5259-2 Sim-ML-1 measures the proportion of samples that are excessively close in embedding space (effectively duplicates). High duplication leads to overfitting. The Level II diagnosis rated overall density as low, which paradoxically means there are few duplicate samples. SpectralWaste actually falls on the data-scarce side. While this is a Pass from the Sim-ML-1 perspective, it feeds directly into the data sufficiency issue (Eft-ML).

L2 density histogram

L2 density histogram -- overall low density distribution suggests data scarcity rather than duplication

L3 density histogram

L3 density histogram -- the same low-density pattern is confirmed under the domain-specific lens

Sim-ML-2 -- Caution: High Intra-class Similarity in video_tape

Sim-ML-2 flags the risk that when samples within a class are too similar, the model fails to learn broad decision boundaries for that class. At both Level II and Level III, all top-4 high-density outliers belong to the video_tape class. These samples are concentrated from the same capture date, time, and session (e.g., train__20230119_03_*), resulting in low intra-class diversity and high intra-class similarity.

vt1

ins1 -- 0.1795

vt3

ins3 -- 0.1701

vt4

ins13 -- 0.1667

vt5

ins13 -- 0.1649

L2 top-4 high-density samples -- all video_tape, all from the same session (20230119_03)

6 Representativeness Assessment -- Rep-ML

QM ID Item ISO Definition Verdict
Rep-ML-1 Target Domain Coverage Degree to which diverse real-world deployment conditions are covered Fail
Rep-ML-3 Distribution Distance (KL-divergence) Divergence between training data distribution and real-world distribution Fail

Rep-ML-1 -- Insufficient Target Domain Coverage

ISO 5259-2 Rep-ML-1 evaluates whether the training data adequately covers the diverse conditions found in the actual deployment environment. SpectralWaste was collected on a single prototype conveyor belt. Real-world recycling facilities encounter varied lighting conditions (fluorescent, natural, nighttime), belt speeds, overlapping waste, contaminated materials, and diverse viewing angles. The Level III diagnosis confirming "urban setting, single cluster" directly reflects this domain bias. Under Rep-ML-1 criteria, real-world deployment coverage is severely inadequate.

L2 PCA distribution

L2 PCA overall distribution -- six classes overlap or scatter across embedding space

L3 PCA distribution

L3 PCA distribution -- under the domain-specific lens, samples collapse into a single cluster, confirming lack of environmental diversity

Rep-ML-3 -- Distribution Gap (KL-divergence)

Rep-ML-3 measures KL-divergence between the training data distribution and the real-world deployment distribution. While no real-world reference data is available to compute an exact KL-divergence score, the Level II density contour map shows a low-density, fragmented distribution, suggesting that the training data fails to represent the continuous distribution expected in production. Given the constraint of a single conveyor belt capture environment, the risk of distribution shift after deployment is high.

L2 density contour

L2 overall density contour -- low density, fragmented cluster pattern

L3 density contour

L3 overall density contour -- a single concentrated density region under the domain lens

7 Diversity Assessment -- Div-ML

QM ID Item ISO Definition Verdict
Div-ML-1 Intrinsic Dimensionality Effective dimensionality of the data -- how many distinct features exist Caution
Div-ML-2 Feature Diversity Diversity in visual features such as color, shape, and size Fail

Div-ML-1 -- Multi-modal Distribution, but Limited Cluster Count

ISO 5259-2 Div-ML-1 measures diversity through the intrinsic dimensionality of the data. At Level II, a multi-modal distribution is observed, giving an initial impression of diversity. However, at Level III (domain-specific 32-dimensional lens), the data converges into a single cluster. This means that while the general-purpose lens (1,280 dimensions) shows separated clusters, the actual diversity of recycling-domain-relevant features is low. A Caution verdict is warranted under Div-ML-1.

L2 density contour

L2 density contour -- multi-modal distribution with multiple clusters under the general-purpose lens

L3 density contour

L3 density contour -- converges into a single cluster under the domain lens, indicating low effective diversity

Div-ML-2 -- Insufficient Visual Feature Diversity

Div-ML-2 measures diversity across visual features including color, size, shape, background, and lighting. Level III analysis found that "identical waste colors and small-size features" dominate the dataset. The pixel histogram also confirms that RGB distributions are concentrated in a narrow color range. This results from a single conveyor belt, fixed capture distance, and uniform lighting environment. Real-world recycling classifiers must handle crumpled, contaminated, or mixed waste in a wide range of sizes, colors, and backgrounds, making this dataset severely lacking under Div-ML-2 criteria.

Pixel histogram

L1 pixel histogram -- RGB channel pixel distributions are concentrated in a narrow brightness and color range

8 Effectiveness & Identifiability Assessment

QM ID Item ISO Definition Measurement Verdict
Eft-ML-1 Effective Sample Ratio Proportion of classes meeting the training threshold Min. class: 33 images (filament) Fail
Idn-ML-1 Identifiability (PII) Presence of personally identifiable information Waste images only -- no PII Pass

Eft-ML-1 -- Insufficient Effective Samples

ISO 5259-2 Eft-ML-1 measures whether each class meets the minimum sample threshold for effective model training. The typical minimum for deep learning classification is 100+ images per class, with 300+ recommended in practice. SpectralWaste's filament class has only 33 images, falling far short of this threshold. The bag and cardboard classes also have only 199 images each, below the recommended 300. In total, four of the six classes fail to meet the recommended threshold. This finding aligns precisely with DataClinic's "Data Bulk-up" recommendation and the ISO Eft-ML-1 Fail verdict.

Idn-ML-1 -- Pass: No PII Risk

ISO 5259-2 Idn-ML-1 requires that datasets contain no personally identifiable information (faces, license plates, names, etc.). SpectralWaste consists entirely of images showing recycling waste on a conveyor belt, with no people, personal data, or any identifiable elements present. The dataset is safe from a PII standpoint, and no personal data processing issues arise for commercial use (separate licensing restrictions notwithstanding).

9 Unmeasured Items (Auditability, Relevance, Timeliness)

QM ID Item ISO Definition Status Verdict
Aud-ML-1/2 Auditability Data lineage tracking, quality audit logs Planned for AADS extension -- N/A
Rel-ML-1/2 Relevance Contextual/purpose relevance, outlier detection Planned for AADS extension -- N/A
Tml-ML-1 Timeliness Data freshness, appropriateness of collection date On roadmap -- N/A

Tml-ML-1 (Timeliness) note: SpectralWaste was collected between 2022 and 2023. As recycling waste types and packaging trends continue to evolve (e.g., new-material films, biodegradable bags), the dataset may not reflect current recycling conditions. Once timeliness measurement tools are in place, this item can also be assessed.

10 Summary & Recommendations

DQC Group QM ID Item Verdict Severity
CompletenessCom-ML-1/3/5Value, Feature & Label CompletenessPass--
ConsistencyCon-ML-2Label ConsistencyCautionMedium
AccuracyAcc-ML-7Label AccuracyCautionMedium
BalanceBal-ML-1Class BalanceFailCritical
BalanceBal-ML-2Feature BalanceFailHigh
SimilaritySim-ML-1Duplicate Instance RatioPass--
SimilaritySim-ML-2Intra-class SimilarityCautionMedium
RepresentativenessRep-ML-1Domain CoverageFailCritical
RepresentativenessRep-ML-3KL-divergenceFailHigh
DiversityDiv-ML-1Intrinsic DimensionalityCautionMedium
DiversityDiv-ML-2Feature DiversityFailHigh
EffectivenessEft-ML-1Effective Sample RatioFailCritical
IdentifiabilityIdn-ML-1PII RiskPass--
Auditability, Relevance, TimelinessAud/Rel/Tml---- N/A--

Immediate Action

  • Bal-ML-1: Collect or synthesize additional filament data (target: 300+ images minimum)
  • Eft-ML-1: Bulk up all four under-threshold classes via data collection or augmentation

Mid-Term Improvement

  • Rep-ML-1: Expand capture environments with diverse lighting, backgrounds, and viewing angles
  • Div-ML-2: Include contaminated, crumpled, and mixed waste samples
  • Bal-ML-2: Diversify color and size feature distributions

Monitoring

  • Con-ML-2: Cross-verify labels between similar samples
  • Acc-ML-7: Manually review all 20 low-density outlier labels
  • Sim-ML-2: Diversify video_tape capture sessions

DataClinic Recommendation vs. ISO 5259 Verdict Alignment

DataClinic's "Data Bulk-up" recommendation aligns precisely with the ISO 5259-2 Bal-ML-1 (class imbalance) and Eft-ML-1 (insufficient effective samples) Fail verdicts. The fact that two independent frameworks -- using different methodologies -- arrive at the same conclusion validates DataClinic's diagnostic results in the language of ISO international standards. This confirms that DataClinic effectively implements the ISO 5259-2 Quality Measures in practice.

References