2026.03 · Pebblous Data Communication Team

Reading time: ~15 min · 한국어

Executive Summary

This report presents key insights from the quality diagnosis report of the WikiArt dataset using Pebblous DataClinic.

WikiArt is an art image dataset consisting of 81,471 images across 27 styles, from Abstract Expressionism to Baroque. The DataClinic overall diagnosis resulted in a quality score of 53 (Poor). Class imbalance is extreme, ranging from a minimum of 98 images to a maximum of 13,060 images — a difference of over 133x — which creates a high risk of training classifiers biased toward specific art styles.

The L2 (Feature Space Analysis) using Wolfram ImageIdentify Net V2 (1,280 dimensions) discovered 3 high-density clusters, but the boundaries between art styles are unclear and class separation is blurry. The L3 (Domain-Specific Analysis) using BLIP Image-Text Matching (56 dimensions) also showed difficulty in cluster separation, though the overall distribution shape was satisfactory.

To improve data quality, both minority class augmentation (bulk-up) and majority class deduplication (diet) are needed simultaneously. In particular, label consistency review should be prioritized due to the visual similarity among abstract art styles.

Dataset Overview

WikiArt is a large-scale image dataset spanning centuries of Western art history. Published on Kaggle, this dataset is widely used in computer vision and art style classification research. It comprises 81,471 images across 27 art styles, covering major art movements including Impressionism, Baroque, and Abstract Expressionism.

Item Details
Dataset Name WikiArt
Source Kaggle (WikiArt)
Total Images 81,471
Number of Classes 27 Art Styles
Overall Score 53 (Poor)
Report Date 2025.01.01

The 27 art style classes include Abstract_Expressionism, Action_painting, Analytical_Cubism, Art_Nouveau, Baroque, Color_Field_Painting, Cubism, Early_Renaissance, Expressionism, Fauvism, High_Renaissance, Impressionism, Mannerism_Late_Renaissance, Minimalism, Naive_Art_Primitivism, New_Realism, Northern_Renaissance, Pointillism, Pop_Art, Post_Impressionism, Realism, Rococo, Romanticism, Symbolism, Synthetic_Cubism, and Ukiyo_e. This dataset addresses the inherently challenging task of art style classification, and its difficulty further complicates data quality issues.

WikiArt Dataset — Collage of Representative Images from 27 Art Styles

WikiArt Dataset — Collage of Representative Images from 27 Art Styles (DataClinic L1 (Basic Quality) Analysis)

Overall Diagnosis: 53 (Poor)

The DataClinic overall score for the WikiArt dataset is 53 (Poor). This means the dataset urgently requires quality improvement before it can be directly used for AI model training. The primary reason for the score of 53 is one single factor: extreme class imbalance.

In art history, mainstream styles like Impressionism and Romanticism have vast bodies of work, while niche movements like Analytical_Cubism and Action_painting inherently have fewer pieces. This real-world imbalance is directly reflected in the dataset. AI models become biased toward classes with more data, while minority classes are effectively ignored.

Level 1: Pixel Quality Analysis

Level 1 analyzes basic pixel-level statistics of images. DataClinic computes the mean image for each class and examines RGB channel brightness distributions, missing values, and class imbalance.

RGB Channel Consistency: Good

Pixel-level RGB channel consistency was rated as Good. Channel distributions are evenly balanced, and images with extremely dominant or deficient channels are rare. This indicates that while WikiArt encompasses diverse art styles and eras, the digitization quality has been maintained consistently overall.

Missing Values: None

No missing values were found across all 81,471 images. Every image contains valid pixel data, eliminating the need for any missing value handling during preprocessing.

Class Imbalance: Poor — The Core Issue

The most critical issue at Level 1 is class imbalance. DataClinic analysis revealed:

  • Minimum class: 98 images (Analytical_Cubism, etc.)
  • Maximum class: 13,060 images (Impressionism, etc.)
  • Mean: 3,016 images
  • Standard deviation: 3,269 (greater than the mean — extreme variance)

The min-max ratio reaches an astonishing 133:1. The fact that the standard deviation (3,269) exceeds the mean (3,016) clearly demonstrates how severely skewed the distribution is.

Below are the mean images for representative classes, revealing distinctly different visual characteristics for each art style:

Abstract Expressionism class mean image

Abstract Expressionism

Art Nouveau class mean image

Art Nouveau

Baroque class mean image

Baroque

The mean images show that Abstract Expressionism exhibits abstract characteristics with mixed colors and textures, while Baroque displays dark backgrounds with strong chiaroscuro typical of classical painting. The fact that mean images are clearly distinguishable across styles indicates that some degree of inter-style differentiation exists at the pixel level.

Level 2: Feature Space Analysis (Wolfram ImageIdentify Net V2)

Level 2 uses Wolfram ImageIdentify Net V2 (1,280-dimensional feature vectors) to place images in a high-dimensional embedding space and analyze inter-class separability and density distributions. This is a general-purpose vision model not specialized for any particular domain, measuring "how differently a computer perceives images."

3 High-Density Clusters Discovered

Analysis of the PCA-reduced feature space revealed 3 high-density clusters. This means WikiArt's 27 art styles are broadly grouped into 3 visual types from a general vision model's perspective: classical painting styles (Baroque, Renaissance, etc.), modern abstract styles (Abstract Expressionism, Minimalism, etc.), and stylized art movements (Ukiyo_e, Art_Nouveau, etc.).

The PCA visualization below shows the distribution of mean features per class:

L2 PCA - WikiArt per-class mean feature distribution

Unclear Boundaries Between Styles

However, the most important finding is that cluster boundaries are highly unclear. In particular, abstract art styles (Abstract_Expressionism, Action_painting, Color_Field_Painting, Minimalism) overlap significantly in the embedding space. From a computer vision model's perspective, these styles are difficult to distinguish. Art historically, these movements evolved in a continuous context, making this a somewhat expected result.

Distribution: Bell-shaped

The density distribution is bell-shaped, with most images concentrated in the center of the feature space and a small number of outliers at the extremes. This indicates a healthy data structure close to a normal distribution.

L2 density distribution

Level 3: Domain-Specific Analysis (BLIP Image-Text Matching)

Level 3 uses BLIP Image-Text Matching Nets (56-dimensional feature vectors). BLIP is a multimodal model trained on both images and text, measuring "in what artistic context an image is understood." The compressed 56-dimensional representation extracts only the core features specialized for the art domain.

Cluster Separation: Still Unclear

In the BLIP-based analysis, cluster separation between art styles remains unclear. Similar to L2, abstract art styles fail to separate in the text-image matching space. This suggests that the difficulty in distinguishing these styles is not merely due to visual similarity, but also because textual descriptions cannot clearly differentiate them.

The L3 PCA visualization shows the feature distribution by art style:

L3 PCA - BLIP-based feature distribution by art style

Geometry: Fair / Distribution: Good

In the L3 analysis, the Geometry metric was rated as Fair, while the Distribution was rated as Good. A good distribution indicates that overall data dispersion in the BLIP feature space is adequate. While L2 showed a bell-shaped distribution, L3 confirms a similar or more even distribution.

L3 density distribution

Outlier Analysis: High-Density vs. Low-Density Samples

DataClinic uses density-based outlier analysis to identify the most typical samples (high-density) and most atypical samples (low-density) in the dataset. Based on the L2 analysis, the following samples were identified.

High-Density Samples — Representative Images

High-density samples are images with the most "typical" characteristics in their class. These images form the core concept of each class during model training.

High-density sample - Minimalism (density: 0.319)

Minimalism — Density 0.319

An abstract painting showing typical minimalist characteristics with minimal form and color

High-density sample - Color Field Painting (density: 0.318)

Color Field Painting — Density 0.318

A quintessential Color Field work with large color fields filling the canvas

Low-Density Samples — Outlier Images

Low-density samples are outliers that visually differ significantly from other images within their class. These images may be mislabeled or represent atypical works of the art style, requiring review.

Low-density sample - Abstract Expressionism (density: 0.047)

Abstract Expressionism — Density 0.047

An outlier image markedly different from other works in the Abstract Expressionism class

Low-density sample - Naive Art Primitivism (density: 0.048)

Naive Art Primitivism — Density 0.048

An atypical work with very low density in the Naive Art category

Notably, the density values of the low-density samples (0.047, 0.048) are approximately 7 times lower than the high-density samples (0.318, 0.319). Such extreme outliers are likely labeling errors, collection errors, or edge cases at the boundaries of art styles. DataClinic automatically identifies these samples and catalogs them for data engineer review.

Recommendations: Simultaneous Bulk-Up + Diet Strategy

To raise the WikiArt dataset's quality score from 53 to above 80, two simultaneous improvement strategies are needed: "bulk-up" to augment minority classes, and "diet" to remove unnecessary duplicates from majority classes.

1. Minority Class Bulk-Up (Data Collection and Augmentation)

  • Additional Collection: Collect additional images from the official WikiArt website, museum APIs, etc. for classes with fewer than 100 images, such as Analytical_Cubism (98 images) and Action_painting
  • Data Augmentation: Use augmentation techniques such as rotation, color adjustment, and cropping to ensure at least 500 images per minority class
  • Target: Balance all classes to a minimum of 300 images

2. Majority Class Diet (Deduplication)

  • Similarity-Based Deduplication: Classify image pairs with cosine similarity above 0.95 in the L2 embedding space as duplicates and remove one
  • High-Density Cluster Sampling: Select and retain 3,000 representative samples centered on high-density regions for majority classes exceeding 13,000 images (e.g., Impressionism)
  • Target: Limit maximum images per class to 5,000 or fewer

3. Label Consistency Review

  • Boundary Style Review: Re-verify label consistency for overlapping classes in the embedding space, such as Abstract_Expressionism, Action_painting, Color_Field_Painting, and Minimalism
  • Outlier Review: Manually re-confirm or remove samples in the bottom 5% density (low-density outliers)
  • Class Merging Consideration: Consider merging overly similar styles (e.g., Analytical_Cubism + Synthetic_Cubism) into parent classes

Expected Improvement Results

Expected DataClinic score changes after applying all 3 improvements:

Metric Current Expected After Improvement
Class Imbalance Poor (98–13,060 images) Good (300–5,000 images)
Outlier Ratio ~5% <1%
Label Consistency Fair Good
Overall Score 53 (Poor) 75–80 (Fair–Good)

DataClinic goes beyond simply discovering problems — it quantitatively shows which classes need how much improvement. This enables the most effective allocation of data engineering resources.