Reading time: ~18 min 한국어
🔬

Evaluation Methodology

This report is an independent evaluation that reinterprets DataClinic's 3-level diagnostic results (Level I / II / III) through the ISO/IEC 5259-2:2024 Quality Measures (QM) framework. We mapped DataClinic's metrics, charts, and outliers to each ISO QM criterion and independently assigned Pass / Fail / Caution verdicts. In particular, we critically reinterpret discrepancies between DataClinic API descriptions and actual chart data.

DataClinic L1 Diagnosis DataClinic L2/L3 Diagnosis ISO 5259-2 QM Interpretation · Verdict

Summary: We independently evaluated the WikiArt art movement image dataset (81,444 images, 27 classes) against ISO/IEC 5259-2:2024 Quality Measures (QM). After mapping DataClinic's 3-level diagnostic metrics and charts to ISO QM criteria, we found 5 Fail, 5 Caution, and 3 N/A out of 13 assessed items — with zero Pass. The key issues are a 133x imbalance between Impressionism and Analytical Cubism, a single-cloud structure in L2, the Blanchard effect (a single painter defining "typical art" in L3), and Pop Art's medium fault line. DataClinic's overall score of 53/100 (Poor) is confirmed from the ISO perspective as well.

0 / 13
QM Items Passed
5
Fail Items
5
Caution Items
3
N/A (Deferred)

1 Dataset Overview

Basic Information

DatasetWikiArt
SourceHuggingFace (huggan/wikiart)
Total Images81,471 (diagnosed: 81,444)
Classes27 (art movements)
Image Size750×597 ~ 1382×17768 px
DataClinic Score53 / 100 (Poor)

Top 10 Classes by Sample Count (L1)

Class (Movement)Samples
Impressionism13,060
Realism10,733
Romanticism7,019
Expressionism6,736
Post_Impressionism6,450
Art_Nouveau4,334
Baroque4,241
Symbolism3,421
Abstract_Expressionism2,782
Naive_Art2,405
... (17 movements omitted)
Analytical_Cubism98

Max-to-min class ratio: 133 : 1 (Impressionism vs Analytical_Cubism)

WikiArt dataset collage

▲ WikiArt dataset representative image collage — 27 art movements from Renaissance to Pop Art

WikiArt is a large-scale image dataset for art movement classification and one of the most widely used art AI benchmarks on HuggingFace. It spans 27 movements and approximately 81,000 works, from the Renaissance to contemporary art. However, the inherent disparities in the number of surviving works across historical periods, digitization bias, and Western-centric curation all affect its quality as an ML training dataset. DataClinic's overall score of 53 (Poor) reflects these structural issues.

2 ISO/IEC 5259-2 Evaluation Framework

This report independently applies the Quality Measures (QM) from ISO/IEC 5259-2:2024 to the WikiArt image dataset. We mapped DataClinic's 3-level diagnostic results to ISO QM criteria and independently interpreted and assigned verdicts for each item. Notably, this report critically identifies four discrepancies between DataClinic API text descriptions and actual chart data.

DataClinic Level What It Measures Mapped ISO 5259-2 QMs
Level I Class count, sample count, missing values, pixel statistics (RGB), resolution range Com-ML, Bal-ML-1, Eft-ML-1
Level II Wolfram ImageIdentify Net V2 embeddings (1,280-dim) — general-purpose shape recognition Sim-ML, Rep-ML-1, Div-ML-1, Con-ML-2
Level III BLIP image-text matching (56-dim) — semantic analysis Rep-ML-3, Div-ML-2, Acc-ML-7

Intrinsic DQC

Completeness · Consistency
→ DataClinic Level I

AI/ML Additional DQC

Balance · Similarity · Representativeness · Diversity · Effectiveness · Accuracy
→ DataClinic Level II/III

Verdict Criteria

Fail Below threshold
Caution Requires further review
N/A Assessment deferred

3 Intrinsic Quality Assessment

QM ID Criterion ISO Definition Verdict
Com-ML-1 Class Completeness Whether the target domain's class taxonomy is sufficiently covered Caution
Con-ML-2 Pixel Channel Consistency Statistical consistency of RGB channel distributions Caution

Com-ML-1 — Class Completeness: Caution

WikiArt covers 27 art movements, providing broad coverage of major Western art-historical periods. However, several rare movements have sample counts too low for ML training. Action Painting (98 images), Analytical Cubism (98), and Synthetic Cubism (120) fall far below the typical minimum training requirement of 300+ images for deep learning models. While all 27 movements are technically "present," some are effectively unlearnable, preventing a full Pass.

Con-ML-2 — Pixel Channel Consistency: Caution

The L1 pixel histogram (below) reveals dramatically different distributions across RGB channels. The Blue channel shows a pronounced left-skewed peak at 30-40, the Red channel exhibits a bimodal structure with a spike near 255, and the Green channel follows a relatively smooth mid-range distribution.

These distributions are art-historically explainable. The low Blue values originate from the brown grounds of traditional oil painting, while the Red 255 spike reflects cadmium red and vermillion pigments saturating in digital captures. Though artistically meaningful, this pattern signals the need for channel-specific normalization strategies in ML pipelines.

L1 Pixel Histogram RGB

▲ L1 Pixel Histogram — Blue (left-skewed 30-40), Red (bimodal + 255 spike), Green (smooth mid-range). Dramatic cross-channel differences

Critical Reinterpretation D1: Challenging the "Consistent" RGB Claim

DataClinic API description: "RGB channels are consistent"
Actual chart data: The L1 pixel histogram above shows Blue (left-skewed at 30-40), Red (bimodal + 255 spike), and Green (smooth mid-range) — dramatically different distributions across channels. For channels to be "consistent," they should exhibit similar statistical shapes, which is clearly not the case here. While this discrepancy is explainable through paint chemistry, the API's "consistent" verdict is inaccurate.

4 Balance Assessment — Bal-ML

QM ID Criterion Measurement Verdict
Bal-ML-1 Class Balance 133x imbalance, stdDev(3,269) > mean(3,016) Fail
Bal-ML-2 Feature Space Balance L3 period-based stratification (classical 1.84-1.87, modern 1.49-1.67) N/A

Bal-ML-1 — 133x Class Imbalance

ISO 5259-2's Bal-ML-1 measures the degree of balance in per-class sample counts. A max-to-min ratio exceeding 10:1 is generally considered severe imbalance. WikiArt's Impressionism (13,060 images) versus Analytical Cubism (98 images) yields a ratio of 133:1 — roughly 7x more severe than the SpectralWaste recycling dataset's 19.6:1.

The more structural problem is that the standard deviation (3,269) exceeds the mean (3,016). This means the concept of an "average class" is essentially meaningless. The dataset is bifurcated into a few large movements (Impressionism, Realism, Romanticism) and many small ones (Cubism variants, Minimalism).

This imbalance is art-historically inevitable. Impressionism was a mass movement spanning decades across Europe with thousands of participating artists, while Analytical Cubism was a short-lived experiment (1907-1912) led by just two artists — Picasso and Braque. Yet art-historical inevitability does not excuse ML training problems. A model trained on this data will overfit to Impressionism while barely learning Analytical Cubism at all.

Bal-ML-2 — Feature Space Balance: N/A (Deferred)

The L3 Box Chart shows stratification between classical movements (Baroque, Renaissance, etc.) with median values of 1.84-1.87 and modern movements (Pop Art, Minimalism, etc.) with medians of 1.49-1.67. This separation reflects genuine art-historical differences across periods. While it could be interpreted as "imbalance," it mirrors historical reality, so we defer judgment.

5 Distinguishability and Label Accuracy

QM ID Criterion Measurement Verdict
Eft-ML-1 Distinguishability L2 classes inseparable (single cloud) Caution
Eft-ML-2 Annotation Completeness Metadata (artist, year) completeness not diagnosed N/A
Acc-ML-7 Label Accuracy Dali → Abstract_Expressionism misclassification, Pop Art medium contamination Fail

Eft-ML-1 — Classes Inseparable in L2

ISO 5259-2's Eft-ML assesses whether each class in the dataset can be distinguished through learning. In the L2 general-purpose lens (Wolfram ImageIdentify Net V2, 1,280-dim), the PCA plot and contour map show all 27 classes forming a single connected cloud. No inter-class boundaries are visible, meaning the general-purpose shape recognition AI cannot visually distinguish between art movements.

DataClinic rated L2 Geometry as "Good," which contradicts the chart data (see Discrepancy D3 below).

Critical Reinterpretation D3: Geometry "Good" Is an Overestimate

DataClinic API: L2 Geometry = "Good"
Actual chart: Both the L2 PCA and contour plots show all 27 classes merged into a single cloud. Rating a dataset where no class separation exists as "Good" is an overestimate. In a dataset with no class separation, supervised classifiers will perform severely below expectations. From the ISO Eft-ML-1 perspective, this warrants a "Poor" rating.

Acc-ML-7 — Label Accuracy: Fail

Two types of labeling errors were observed.

1. Systematic movement misclassification: Salvador Dali's works are labeled as Abstract Expressionism. Dali is unambiguously classified as Surrealism in art history, distinct from Abstract Expressionism in period, geography, and technique. This suggests that such errors may be systematic rather than isolated.

2. Medium contamination: The Pop Art class contains not only traditional paintings but also photographs of installations, architectural photography, and other non-painting media. The implicit assumption that "art = painting" breaks down in the Pop Art genre, which cascades into the extreme separation observed in L3 analysis.

6 Similarity Assessment — Sim-ML

QM ID Criterion Measurement Verdict
Sim-ML-1 Intra-class Similarity Some classes (Cubism variants) show high cohesion, but full quantification unavailable N/A
Sim-ML-2 Cross-class Similarity Minimalism ≈ Color_Field_Painting (same L2 cluster) Caution

Sim-ML-2 — Minimalism and Color Field Painting Merge

Sim-ML-2 measures cases where samples from different classes cluster too closely in embedding space. In the L2 analysis, Minimalism and Color Field Painting occupy nearly identical positions. These two movements are closely related in art history as well — both emerged in 1960s New York and share a focus on color planes and geometric simplicity — so the general-purpose AI's inability to separate them is somewhat expected.

From an ML standpoint, however, keeping these as separate classes means the classifier will fail to learn the boundary. Class merging or hierarchical labeling (e.g., Minimalism → "Geometric Abstraction" parent category) should be considered.

L2 Contour Density Clusters

▲ L2 Contour — Two density centers within a single continuous mass. All 27 classes remain unseparated in one cloud

Critical Reinterpretation D2: Cluster Count Overstated

DataClinic API: "3 high-density clusters"
Actual chart: The L2 contour shows a single connected mass with 2 density centers. Describing this as "3 separate clusters" is an overstatement. Separated clusters and density variations within a single mass have entirely different implications for ML.

L2 PCA full distribution

▲ L2 PCA — All 27 classes overlap in a single cloud. No class separation achieved

L2 Density Histogram

▲ L2 Density Histogram — Overall density distribution

7 Representativeness Assessment — Rep-ML

QM ID Criterion ISO Definition Verdict
Rep-ML-1 L2 Representativeness Whether the feature-space core represents the full domain Fail
Rep-ML-3 L3 Representativeness Whether "typical" samples in semantic space represent the domain Fail

Rep-ML-1 — L2 Feature Space: Minimalism/Color Field Bias

The core of the L2 general-purpose lens's feature space (high-density region) is dominated by Minimalism and Color Field Painting. These two movements feature visually simple compositions (monochrome canvases, geometric forms) that the general-purpose shape recognition AI interprets as "the most universal visual patterns."

As a result, the rich visual diversity of 27 movements — Baroque's dramatic chiaroscuro, Ukiyo-e's woodblock textures, Expressionism's distorted forms — is inadequately represented in the feature space. This reflects both the limitations of the general-purpose lens and the dataset's representativeness deficit.

L2 Box Chart Class Density

▲ L2 Box Chart — Per-class density distributions. Minimalism/Color_Field_Painting high-density concentration confirmed

The Antoine Blanchard Effect

Core evidence for Rep-ML-3 Fail: Of the top 12 high-density samples in L3 (BLIP image-text matching), 7 are Parisian boulevard scenes by Antoine Blanchard, with the remaining 4 being Impressionist cityscapes by Pissarro and similar artists.

Blanchard was a 19th-century commercial painter who repeatedly depicted Parisian scenes — the Champs-Elysees, Place de la Madeleine, and streets in front of the Opera Garnier. Because his works are over-represented in WikiArt, the BLIP lens's definition of "typical art" converges on "a rainy Parisian evening street scene under gaslight."

This is where collection bias (over-collection of one artist's repetitive commercial works) and lens characteristics (BLIP's semantic matching assigns high consistency scores to representational cityscapes) intersect. If a single artist's commercial repetitions define the "core" of an 81,000-image dataset, that dataset cannot claim to represent the diversity of art.

L3 PCA full distribution

▲ L3 PCA — BLIP semantic space. Dramatic Pop Art separation + period-based stratification

L3 Density Histogram

▲ L3 Density Histogram — Density distribution under the BLIP lens

8 Diversity Assessment — Div-ML

QM ID Criterion Verdict
Div-ML-1 L2 Diversity — All 27 classes form a single continuous cloud in L2 Fail
Div-ML-2 L3 Diversity — Dramatic Pop Art separation, period-based stratification present Caution

Div-ML-1 — Diversity Failure in L2

ISO 5259-2's Div-ML-1 measures the effective dimensionality and distributional diversity of features. The fact that all 27 art movements collapse into a single continuous cloud under the L2 general-purpose lens means that, from this lens's perspective, "art movement diversity" does not exist in the data.

Wolfram ImageIdentify Net was trained for everyday object classification. To this model, all paintings are essentially one category: "image." The differences between movements — brushwork, palette, composition — register only as minuscule variations in the 1,280-dimensional space, insufficient for class separation.

L3 Contour

▲ L3 Contour — Unlike L2, the BLIP lens reveals structure: dramatic Pop Art separation + period-based stratification

Critical Reinterpretation D4: L3 Clusters "Unclear" Is an Underestimate

DataClinic API: "Cluster distinction still unclear"
Actual chart: The L3 Box Chart (below) shows Pop Art's dramatic separation (median ~1.50 vs. others at 1.70-1.90) and clear period-based stratification between classical and modern movements. The "unclear" assessment underestimates the actual structure visible in the L3 charts.

The Pop Art Fault Line

In the L3 Box Chart, Pop Art's median density sits at approximately 1.50, dramatically separated from the remaining 26 movements (medians 1.70-1.90). The root cause is a fundamental difference in medium.

Examining the low-density (outlier) samples in the Pop Art class reveals not traditional paintings but installation photographs, architectural photography, and collages. Since the BLIP lens performs image-text semantic matching, it recognizes "oil on canvas" and "photograph of a gallery installation" as entirely different categories.

WikiArt's implicit assumption that "art = painting" breaks down at Pop Art. Pop Art extends beyond painting to encompass printmaking, silkscreen, installation, and collage. This medium diversity is what surfaces as a "fault line" in L3. The issue is not one of movement diversity but of medium diversity.

L3 Box Chart Class Density

▲ L3 Box Chart — Key chart. Pop Art dramatic separation (median ~1.50) + classical movements (Baroque, Renaissance: 1.84-1.87) vs modern movements (Minimalism, Color_Field: 1.49-1.67) period-based stratification

9 Two Lenses Compared: L2 vs L3

The most compelling finding from WikiArt is that the two lenses tell entirely different stories. L2 (general-purpose shape recognition) says "all paintings look alike," while L3 (semantic matching) says "they separate clearly by period and medium." The comparison card below presents both perspectives side by side.

Dimension L2 Findings (General Shape AI) L3 Findings (Semantic AI)
High-density Core Minimalism / Color_Field_Painting
Visual simplicity interpreted as "universal pattern"
Antoine Blanchard's Parisian boulevards
Semantically consistent representational cityscapes
Low-density Outliers Degas portraits, Ukiyo-e prints, Mabe abstracts
"Unusual" visual patterns for general lens
Pop Art installation photos, contemporary architecture
Non-painting media → semantic space outliers
Cluster Structure Single cloud (no class separation)
All paintings converge to one "image" category
Dramatic Pop Art separation + period stratification
Semantic lens distinguishes period and medium
ISO Implications Div-ML-1 Fail, Eft-ML-1 Caution
General-purpose lens cannot classify art movements
Rep-ML-3 Fail, Div-ML-2 Caution
Semantic lens finds structure but reveals representation bias

Key takeaway: Lens selection fundamentally shapes data quality assessment outcomes. Relying only on the L2 general-purpose lens would lead to the conclusion that "art movement classification is impossible" — missing the period-based structure that L3 reveals. Conversely, looking only at L3 would overlook the dataset's failure in general-purpose AI applications. This reinforces the necessity of multi-lens analysis in ISO 5259-2 evaluations.

10 Overall Assessment and Recommendations

DQC Group QM ID Criterion Verdict Severity
BalanceBal-ML-1Class Balance (133x)FailCritical
RepresentativenessRep-ML-1L2 Minimalism biasFailCritical
RepresentativenessRep-ML-3L3 Blanchard effectFailHigh
DiversityDiv-ML-1L2 single cloudFailCritical
AccuracyAcc-ML-7Dali misclassification, Pop Art medium contaminationFailHigh
CompletenessCom-ML-1Rare movements: 98-120 samplesCautionMedium
DistinguishabilityEft-ML-1L2 classes inseparableCautionMedium
SimilaritySim-ML-2Minimalism ≈ Color_FieldCautionMedium
DiversityDiv-ML-2Pop Art medium fault lineCautionMedium
ConsistencyCon-ML-2RGB channel discrepancyCautionMedium
SimilaritySim-ML-1Intra-class quantification unavailableN/A
DistinguishabilityEft-ML-2Metadata completeness not diagnosedN/A
BalanceBal-ML-2Period-based stratification (historical reality)N/A

Immediate Actions

  • Bal-ML-1: Augment rare movements (Analytical_Cubism, Action_Painting, etc.) to 300+ samples
  • Acc-ML-7: Full label audit. Correct systematic errors such as Dali → Surrealism
  • Div-ML-1: Redesign class taxonomy — evaluate merging or hierarchical restructuring of 27 movements

Medium-term Improvements

  • Rep-ML-1/3: Adjust overrepresentation of commercial repetitions (e.g., Blanchard) via downsampling or weighting
  • Div-ML-2: Split Pop Art class into "painting" and "non-painting (installation/photography)" subclasses
  • Sim-ML-2: Merge Minimalism & Color_Field_Painting or introduce hierarchical labeling

Monitoring

  • Con-ML-2: Develop RGB channel normalization strategies (painting-domain-specific)
  • Eft-ML-1: Evaluate domain-specific lens-based classification pipelines
  • Com-ML-1: Consider expansion to non-Western movements (East Asian painting, Islamic miniatures, etc.)

What DataClinic's Score of 53 Means

DataClinic's overall score of 53/100 (Poor) aligns with this report's independent ISO 5259-2 assessment. 5 Fail + 5 Caution items with zero Pass confirms that WikiArt has serious structural quality issues for its intended purpose of "art movement classification." However, this dataset is simultaneously a rich art resource of 80,000+ images. Applying the recommendations above in stages can restore its value as an ML benchmark.

Critical Reinterpretation Summary: DataClinic API vs Actual Charts

# DataClinic API Claim Actual Chart Data ISO Verdict Impact
D1 "RGB channels consistent" Blue left-skewed, Red bimodal + 255 spike Con-ML-2 upgraded to Caution
D2 "3 high-density clusters" 1 connected mass with 2 density centers Div-ML-1 Fail maintained
D3 L2 Geometry "Good" 27 classes in single cloud, no separation Eft-ML-1 Caution
D4 L3 clusters "unclear" Dramatic Pop Art separation + period stratification clearly visible Div-ML-2 Caution (underestimate corrected)

References