Dissecting 150 Korean Foods with Data

The Korean Image (Food) dataset, published on AI Hub, is a large-scale Korean food vision dataset comprising 150 classes and 150,507 images, spanning from traditional Korean dishes to modern street food. It is commercially licensed, making it immediately available for developing AI-powered food recognition services.

The 150 classes capture the full landscape of Korean food culture:

Soups & Stews — Galbitang(갈비탕, short rib soup), Mul-naengmyeon(물냉면, cold noodles), Samgyetang(삼계탕, ginseng chicken), Chueotang(추어탕, loach soup), Yukgaejang(육개장, spicy beef soup), Dakgaejang(닭개장, spicy chicken soup), Muguk(무국, radish soup)
Grilled — Galbi-gui(갈비구이, grilled ribs), Samgyeopsal(삼겹살, pork belly), Galchi-gui(갈치구이, grilled hairtail), Godeungeo-gui(고등어구이, grilled mackerel)
Braised & Stir-fried — Gaji-bokkeum(가지볶음, stir-fried eggplant), Kkaennip-jangajji(깻잎장아찌, pickled perilla leaves), Ganjang-gejang(간장게장, soy-marinated crab), Galbi-jjim(갈비찜, braised ribs)
Street Food — Gimbap(김밥, seaweed rice rolls), Ramyeon(라면, instant noodles), Mandu(만두, dumplings), Tteokbokki(떡볶이, spicy rice cakes), Sundae(순대, blood sausage)
Traditional Rice Cakes — Songpyeon(송편), Gyeongdan(경단, rice balls), Kkultteok(꿀떡, honey rice cake), Hangwa(한과, traditional confections)
Seafood — Meongge(멍게, sea squirt), Gwamegi (semi-dried fish), Jeotgal (fermented seafood)
Others — Korean fried chicken, Jajangmyeon (black bean noodles), Jjamppong (spicy seafood noodles), and other Korean-adapted versions of foreign dishes

Each food name carries the context of Korean food culture. Meongge(멍게, sea squirt) is one of the most difficult foods for AI models to recognize due to its distinctive fishy aroma and vermillion color. Gwamegi is a winter seasonal food from the Pohang region that is visually hard to distinguish from regular dried fish. How these 'domain-knowledge-demanding foods' affect data quality is at the core of this diagnosis.

DataClinic Overall Score: 71 (Fair)

150

Classes

150,507

Total Images

1,003

Avg. per Class

16.8

Class Balance Std. Dev.

A score of 71 places it in the 'Fair' grade, which is actually above average among large-scale public datasets. The fundamental strength of class balance is excellent, but insufficient visual diversity in some classes drags the score down. With commercial licensing, this dataset is ready for real-world AI development.

Level 1 examines image integrity, missing values, class balance, and pixel statistics. This is the first health check DataClinic performs upon receiving the raw data.

Class Balance: Textbook-Level

Image counts across 150 classes range from a minimum of 992 to a maximum of 1,125, with a standard deviation of just 16.8. This is a distribution that appears deliberately balanced. For comparison, the WikiArt dataset's class balance standard deviation runs into the thousands. This means the risk of model bias toward specific foods during training is extremely low.

Image Resolution: Wide Spectrum

Image dimensions range from 121x91px to 6,048x4,032px, a very wide distribution. This reflects collection from diverse sources, from smartphone snapshots to professional DSLR photography. Resolution standardization preprocessing is essential for AI training. In particular, images at the minimum resolution of 121x91px require upscaling for standard models like ResNet-50 (which expects 224x224px input).

Channel Composition: Stable

99.42% of all images are standard 3-channel RGB. 0.33% contain an alpha channel (RGBa) and 0.25% are other formats, meaning only 0.58% of total images require alpha channel removal or RGB conversion during preprocessing.

Missing Values: Negligible

Out of the original 150,610 images, 103 (0.07%) were missing, leaving 150,507 images for actual diagnosis. A 0.07% rate is very low by large-scale web-crawled dataset standards and is practically negligible.

Domain Insight: What Class Mean Images Reveal

Looking at the mean images above, Gyeongdan(경단) and Kkultteok(꿀떡) appear sharp, while Gimbap(김밥) is slightly blurry. This indicates that Gyeongdan and Kkultteok are visually consistent foods (similar color, shape, arrangement), while Gimbap exhibits diverse photography styles -- cross-sections, roll forms, and various plating arrangements. The sharpness of a mean image directly reflects intra-class visual consistency.

Level 2 extracts features from the entire dataset using Wolfram ImageIdentify Net V2 (1,280-dimensional feature vectors) and analyzes their distribution. This neural network is a general-purpose image recognition model, not specialized for the food domain. In other words, it views Korean food through the eyes of a layperson, not a chef.

Two Clusters: Soup vs. Dry

PCA and density topography analysis reveal that through a general-purpose lens, Korean food splits into two distinct clusters. Interpreting the results with domain knowledge:

Cluster A (Dry food group) — Grilled, stir-fried, rice cakes, pancakes, street food. Characterized by solid forms and rich colors
Cluster B (Soup-based food group) — Soups, stews, noodle dishes. Characterized by liquid forms in wide bowls

This is evidence that Korean cuisine's distinctive 'soup culture' is faithfully reflected in image data. Even without knowing recipes or ingredients, a general-purpose AI naturally learns the 'with broth / without broth' distinction based solely on visual structure.

The PCA visualization below shows the distribution of per-class mean features:

L2 PCA — Korean Image (Food) per-class mean feature distribution

Distribution: Bell-Shaped — Healthy

The overall density distribution maintains a bell-shaped curve. Most images cluster near the center of the feature space, with a small number of outliers at the extremes. This is a healthy data structure approximating a normal distribution.

L2 overall density plot — Korean Image (Food)

Per-Class Density Comparison: Ramyeon(라면) vs. Meongge(멍게)

Comparing per-class density distributions reveals interesting differences. Ramyeon shows a narrow, high-peaked density distribution, while Meongge has a relatively wide and flat distribution. Ramyeon has high visual consistency with its red broth + noodles + scallion combination, whereas Meongge is photographed in diverse states -- raw, prepared, and plated.

Ramyeon(라면) — Narrow, concentrated density distribution

Meongge(멍게) — Wide, diverse density distribution

Level 3 applies a 129-dimensional lens specialized for the Korean food dataset, built on top of Wolfram ImageIdentify Net V2. Instead of a general-purpose perspective, it re-examines the data through the eyes of a Korean food expert.

Cluster Unification: Two Become One

The most notable change is that the two clusters visible in Level 2 merge into one. While the general-purpose lens responded to the visual structure of 'broth presence,' the domain-specific lens prioritizes the shared identity of 'Korean food.'

This has important implications for real-world AI service development. When building a Korean food recognition model, using a general-purpose backbone directly may cause soup-based and dry foods to be treated as entirely different domains. However, using a Korean food-specialized feature extractor creates a more unified recognition space.

The L3 PCA visualization confirms the unified distribution:

L3 PCA — Korean food domain-specific lens class distribution

Distribution: Still Bell-Shaped — Stable

Even with the domain-specific lens, the overall distribution maintains a bell-shaped form. The fact that clusters merged while preserving a healthy distribution shape demonstrates that domain specialization is not mere compression, but meaningful representation learning.

L3 overall density plot — Korean food domain-specific

DataClinic uses density-based outlier analysis to identify the most typical samples (high density) and the most atypical samples (low density) in the dataset.

High-Density Samples — AI's 'Typical Korean Food'

High-density samples -- images closest to the center of the feature space -- are dominated by Songpyeon(송편) and Mul-naengmyeon(물냉면, cold noodles). This is no coincidence.

Songpyeon(송편) is a visually highly consistent food:

Uniform half-moon silhouette
Fixed color palette of white, pink, and green
Standardized composition neatly arranged on a plate
Often simple backgrounds with uniform lighting

From an AI perspective, Songpyeon(송편) is a "predictable" image. Nearly all Songpyeon photos produce similar feature vectors, resulting in high measured density.

Songpyeon(송편) (High)

density: 0.6961

Songpyeon(송편) (High)

density: 0.6952

Songpyeon(송편) (High)

density: 0.6945

Naengmyeon(냉면) (High)

density: 0.6957

Low-Density Samples — The Identity of Outliers

The top low-density outliers include Gimbap(김밥), Sundae(순대, blood sausage), Kkaennip-jangajji(깻잎장아찌, pickled perilla leaves), and Samgyeopsal(삼겹살, pork belly). Their common trait is highly variable shooting angles, plating styles, and cooking states:

Gimbap — Cross-section views (exposing pickled radish and egg filling) vs. side views (cylindrical exterior) produce entirely different images
Samgyeopsal — Raw pink meat vs. grilled brown meat results in dramatically different colors
Kkaennip-jangajji — Standalone on a plate vs. used as a wrap
Sundae — Whole sausage vs. sliced cross-section exposure

Gimbap(김밥) (Low)

density: 0.0513

Sundae(순대) (Low)

density: 0.0541

Kkaennip(깻잎장아찌) (Low)

density: 0.0552

Samgyeopsal(삼겹살) (Low)

density: 0.0566

Most Dissimilar Pair: Hangwa(한과) vs. Gimbap(김밥)/Yukgaejang(육개장)

Similarity analysis identified the most distant image pairs in feature space. The combination of Hangwa(한과) (traditional confections) and Gimbap(김밥)/Yukgaejang(육개장) (spicy beef soup) is representative. Hangwa features golden-brown, dry, uniform confection shapes, while Gimbap is a black-and-white cylinder with colorful cross-sections, and Yukgaejang is a bowl brimming with red broth -- diametrically opposite in color, texture, and form.

Hangwa — most distant pair reference image

Hangwa(한과)

Gimbap(김밥)

The most distant pair in feature space. Diametrically opposite in color, texture, and form.

Practical Application of Outlier Analysis

Low-density outliers carry two possible meanings: (1) Label errors -- images incorrectly labeled as belonging to a class they don't, or (2) Diversity-rich samples -- atypical but frequently encountered variations in real-world photography. The second case is actually important for improving model robustness. Manually reviewing DataClinic outlier samples to distinguish errors from diversity is the first step in quality improvement.

DataClinic recommends a Data Diet for this dataset. Despite textbook-level class balance, the quality score of 71 (Fair) is held back by insufficient visual diversity in certain classes.

What Is Data Diet?

Data Diet is not simply about reducing data. It involves identifying and removing near-duplicate images concentrated in high-density regions so that models can learn more diverse patterns.

Songpyeon(송편) — Half-moon shaped, pastel-colored images are densely clustered. Adding atypical images with varied lighting, rustic styling, and making-of scenes is recommended
Mul-naengmyeon(물냉면) — Center-of-bowl compositions are repetitive. More diverse angles and plating styles are recommended
High-density classes overall — After removing duplicate images, replace with images from diverse shooting environments (restaurants, homes, street food stalls)

Is Data Bulkup (Augmentation) Not Needed?

Since class balance is already excellent, Diet is more urgent than minority class augmentation. However, atypical seafood classes like Meongge(멍게), Gwamegi, and Jeotgal naturally have low visual diversity, so adding images from varied shooting conditions could significantly improve model robustness.

Key Takeaways

Class Balance: Textbook-level (Std. Dev. 16.8)

Missing Values: 0.07% — Negligible

Channel Composition: 99.42% RGB

Resolution Range: Preprocessing standardization required

High-Density Class Duplication: Data Diet recommended

Expected Improvement: Data Diet could raise score from 71 to 80+

Full diagnosis results and detailed analysis for all 150 classes are available at DataClinic Report #59.

Executive Summary

Dataset Overview — The World of 150 Korean Foods

Overall Diagnosis — Quality Score: 71 (Fair)

Level 1: Basic Quality Check — Pixel-Level Health Exam

Class Balance: Textbook-Level

Image Resolution: Wide Spectrum

Channel Composition: Stable

Missing Values: Negligible

Level 2: Korean Food Through a General-Purpose AI Lens — Two Worlds

Two Clusters: Soup vs. Dry

Distribution: Bell-Shaped — Healthy

Per-Class Density Comparison: Ramyeon(라면) vs. Meongge(멍게)

Level 3: Domain-Specific Lens — Two Worlds Become One

Cluster Unification: Two Become One

Distribution: Still Bell-Shaped — Stable

Outlier Analysis — Why Songpyeon Is the Most 'Typical'

High-Density Samples — AI's 'Typical Korean Food'

Low-Density Samples — The Identity of Outliers

Most Dissimilar Pair: Hangwa(한과) vs. Gimbap(김밥)/Yukgaejang(육개장)

Improvement Recommendations — Data Diet Prescription

What Is Data Diet?

Is Data Bulkup (Augmentation) Not Needed?