Korean Food Story — The Broth Culture AI Discovered

"Guk-mul-do eop-da" — literally "there isn't even broth," but its real meaning is "there's absolutely nothing" or "not a single penny." In Korean culture, broth (guk-mul) represents the most basic thing one should share, the minimum unit of a meal. This single idiom captures the essence of Korean cuisine.

The traditional Korean table is built on a triangular structure of bap (rice) + guk (soup) + banchan (side dishes). Galbitang, samgyetang, doenjang-jjigae, miyeok-guk... No matter how simple the meal, at least one bowl of soup is considered proper Korean dining etiquette. It is no coincidence that a large proportion of the 150 classes in this Korean food dataset are soup-based dishes. And as we will see later, this fact is faithfully reflected in AI as well.

The global spread of K-pop and K-dramas (the Korean Wave, or Hallyu) has ignited explosive interest in Korean food. Scenes of grilling samgyeopsal with soju, or cooking a late-night bowl of ramyeon, are now familiar to fans worldwide. In an era where a single Netflix drama can multiply global search volume for a particular Korean dish by dozens of times, Korean food categories in food-recognition AI are no longer optional.

🌍 A Guide for Those New to Korean Food

  • Banchan (Side Dishes) — Small dishes served alongside the main course, typically 3 to 10 per meal. This is why items like pickled perilla leaves (kkaennip-jangajji) may look unfamiliar when plated individually.
  • Broth Culture (Guk-mul Culture) — Soups, stews, and hot pots are essential to Korean meals. This is exactly why general-purpose AI splits Korean food into two clusters.
  • Seasonal Foods — Songpyeon is eaten only during Chuseok (Korean Thanksgiving). The repetition of half-moon shapes and pastel colors in standardized photos makes it the "most typical" food for AI.
  • Cooking State — Samgyeopsal (pork belly) includes raw (pink), grilling (smoky), and cooked (brown) images all in the same class. This is the structural reason it produces the most outliers.

🤖 For AI Researchers: Why Korean Food Data Is Fascinating

Korean food image recognition is like compressing several hard problems in computer vision into a single dataset.

  • Fine-grained Classification at Its Extreme — Mul-naengmyeon (cold broth noodles) and bibim-naengmyeon (spicy mixed noodles) are both naengmyeon, but the presence or absence of broth makes their appearance completely different. Doenjang-jjigae and kimchi-jjigae share similar bowl compositions but differ in color. This is why ImageNet-level general models fall short.
  • Intra-class Variance Imbalance — Songpyeon (very low intra-class variance) vs. Gimbap (very high intra-class variance): even when the number of images per class is nearly identical, learning difficulty varies dramatically. A balanced dataset does not necessarily guarantee balanced learning.
  • The Power of Domain Specialization — Two clusters in the general-purpose lens (L2) merge into one under the domain-specialized lens (L3). The feature space created by a Korean-food-specific backbone has a fundamentally different structure.

Executive Summary

This post presents the key insights from the quality diagnostic report #59 of the Korean Food Image dataset, conducted using Pebblous DataClinic.

The Korean Food Image dataset is a large-scale Korean cuisine vision dataset comprising 150 classes and 150,507 images, ranging from galbitang to fried chicken. The DataClinic overall diagnosis resulted in a quality score of 71 (Fair). In terms of class balance, image counts range from a minimum of 992 to a maximum of 1,125 with a standard deviation of just 16.8 — a textbook-level distribution.

At L2 (feature space analysis) using Wolfram ImageIdentify Net V2 (1,280 dimensions), a pattern was discovered where general-purpose AI splits Korean food into two clusters: soup-based foods and dry foods. In contrast, at L3 (domain-optimized analysis, 129 dimensions), these two clusters merge into a single Korean food space.

The most "typical" food was Songpyeon, while the most heterogeneous images were found in Gimbap. Data Diet (deduplication) is recommended for classes with low visual diversity.

Dataset Overview — A World of 150 Korean Foods

The Korean Food Image dataset is a large-scale Korean cuisine vision dataset comprising 150 classes and 150,507 images, spanning from traditional Korean dishes to modern street food. Provided by AIHub with commercial use permitted, it can be immediately applied to AI-based food recognition service development.

The 150 classes faithfully capture the landscape of Korean food culture. Below is a breakdown by major categories:

🍲
Soups & Stews
Galbitang, Samgyetang, Chueotang
Yukgaejang, Doenjang-jjigae, Muguk
Sirae-giguk, Bugeoguk, etc.
🍜
Noodles & Rice
Mul-naengmyeon, Bibim-naengmyeon, Ramyeon
Jajangmyeon, Jjamppong
Kongnamul-gukbap, etc.
🥩
Grilled Dishes
Samgyeopsal, Galbi-gui
Galchi-gui, Godeungeo-gui
Dakgalbi, etc.
🥘
Braised & Stir-fried
Galbijjim, Jangjorim
Mechuri-al-jangjorim, Gaji-bokkeum
Ganjang-gejang, etc.
🌿
Kimchi & Pickles
Kkaennip-jangajji, Geotjeori
Yeolmu-kimchi, Kkakdugi
Baechu-kimchi, etc.
🍡
Street Food
Gimbap, Tteokbokki, Sundae
Mandu, Rabokki
Fried Chicken, etc.
🍮
Rice Cakes & Confections
Songpyeon, Gyeongdan, Kkultteok
Hangwa, Yaksik
Injeolmi, etc.
🦑
Seafood & Fermented Fish
Meongge, Gwamegi, Jeotgal
Hongeo, Kkomak
Jeonbok-gui, etc.

Each food name carries the context of Korean food culture. Meongge (sea squirt) is one of the hardest foods for AI models to recognize due to its distinctive briny smell and vermilion color. Gwamegi is a winter seasonal delicacy from the Pohang region that is visually difficult to distinguish from ordinary dried fish. Fried chicken is a foreign food reinterpreted Korean-style. How these "domain-knowledge-dependent foods" affect data quality is the key point of this diagnosis.

Gyeongdan class mean image
Gyeongdan
Gimbap class mean image
Gimbap
Kkultteok class mean image
Kkultteok
Ramyeon class mean image
Ramyeon
Mandu class mean image
Mandu
Meongge class mean image
Meongge

▲ Class mean images — pixel-averaged results from approximately 1,000 images per class. The sharper the image, the higher the visual consistency. Note that Gyeongdan and Kkultteok appear sharp, while Gimbap is relatively blurry.

Overall Diagnosis — Quality Score 71 (Fair)

DataClinic Overall Score: 71 (Fair)
150
Classes
150,507
Total Images
1,003
Avg. per Class
16.8
Class Balance Std. Dev.

An overall score of 71 falls in the "Fair" grade, which ranks among the top tier of large-scale public datasets. The fundamental strength of class balance is excellent, but insufficient visual diversity in certain classes limits the score. With commercial use permitted, this is a production-ready dataset for real-world AI development.

Diagnostic Item Result Notes
Class Balance✅ GoodStd. dev. 16.8 (textbook-level)
Missing Values✅ Good0.07% (103 / 150,610 images)
Channel Composition✅ Good99.42% RGB
Image Resolution⚠️ Fair121×91px ~ 6,048×4,032px
Intra-class Diversity⚠️ FairDense duplicates in Songpyeon, Mul-naengmyeon, etc.
Commercial Use✅ PermittedSource: AIHub

Level 1: Basic Quality Check — A Pixel-Level Health Exam

Level 1 examines image integrity, missing values, class balance, and pixel statistics. It is the foundational health check that DataClinic runs first upon receiving raw data.

✅ Class Balance: Textbook Level

Image counts across the 150 classes range from a minimum of 992 to a maximum of 1,125, with a standard deviation of just 16.8. This is a distribution so well-balanced it appears deliberately engineered. For comparison, the WikiArt dataset has a class balance standard deviation in the thousands. This means the risk of model bias toward specific foods during training is extremely low.

However, numerical balance does not guarantee learning balance. 1,003 Songpyeon images and 1,003 Gimbap images are the same count, but nearly all Songpyeon images look alike while Gimbap looks completely different from every angle. Intra-class variance is explored in greater detail at L2/L3.

⚠️ Image Resolution: A Wide Spectrum

Image dimensions range widely from a minimum of 121×91px to a maximum of 6,048×4,032px. This reflects collection from diverse sources, from smartphone snapshots to professional DSLR photography. Preprocessing to standardize input resolution is essential for AI training. The minimum 121×91px images require upscaling for standard models like ResNet-50 (which expects 224×224px input).

✅ Channel Composition: Stable

99.42% of all images are standard 3-channel RGB. 0.33% include an alpha channel (RGBa), and 0.25% are in other formats, meaning only 0.58% of images require alpha channel removal or RGB conversion during preprocessing.

✅ Missing Values: Negligible

Out of the original 150,610 images, 103 (0.07%) were missing, leaving 150,507 images for the actual diagnosis. 0.07% is an exceptionally low rate by the standards of large-scale web-crawled datasets.

🔍 What Mean Images Reveal

Level 2: Korean Food Through a General AI's Eyes — Two Worlds

Level 2 extracts features from the entire dataset using Wolfram ImageIdentify Net V2 (1,280-dimensional feature vectors) and analyzes their distribution. This neural network is a general-purpose image recognition model not specialized in the food domain — essentially viewing the photos through the eyes of a foreigner unfamiliar with Korean cuisine.

🌊 Two Clusters: Soup vs. Dry

PCA and density topography analysis reveal that under a general-purpose lens, Korean food splits into two distinct clusters. Interpreted with domain knowledge:

🍜 Cluster B — Soup-based Foods
Liquid form in wide bowls. Heat, steam, and background broth as visual signatures.
Mul-naengmyeon Bibim-naengmyeon Galbitang Samgyetang Yukgaejang Doenjang-jjigae Muguk Ramyeon Jjamppong
🥩 Cluster A — Dry Foods
Solid forms, rich colors, and plated solid compositions as visual signatures.
Samgyeopsal Galbi-gui Gimbap Songpyeon Gyeongdan Hangwa Mandu Sundae Tteokbokki

This is direct evidence that Korea's distinctive "broth culture" is faithfully reflected in image data. Even without knowledge of recipes or ingredients, general-purpose AI naturally learns to distinguish "with broth / without broth" based purely on visual structure.

The PCA visualization below shows the distribution of per-class mean features:

L2 PCA — Korean Food Image per-class mean feature distribution

📊 Distribution: Bell-shaped — Healthy

The overall density distribution maintains a bell-shaped curve. Most images are concentrated in the center of the feature space, with a small number of outliers distributed at both extremes. This is a healthy data structure approximating a normal distribution.

L2 overall density plot — Korean Food Image

🔬 Per-class Density Comparison — Portraits of 6 Foods

Comparing per-class density distributions reveals dramatically different patterns for each food. The 6 classes below show the spectrum from high density (visually consistent) to low density (diverse):

Songpyeon L2 class density distribution
Songpyeon 🍡
Repeated half-moon shapes and pastel colors → narrow, tall peak. The most "typical" food
Mul-naengmyeon L2 class density distribution
Mul-naengmyeon 🍜
Consistent bowl + clear broth composition → high density within the soup cluster
Ramyeon L2 class density distribution
Ramyeon 🍜
Red broth + noodles + scallion combo visually fixed → narrow, concentrated distribution
Samgyeopsal L2 class density distribution
Samgyeopsal 🥩
Raw, grilling, and cooked stages mixed → distribution splits into two peaks
Meongge L2 class density distribution
Meongge 🦑
Raw, prepared, plated forms all vary → broad, low distribution
Gimbap L2 class density distribution
Gimbap 🍱
Diverse angles: cross-section, side view, plated → widest low-density distribution

▲ L2 per-class density distributions. The narrower and taller the peak, the more visually consistent the food; the broader and lower, the greater the photographic diversity.

Level 3: The Korean Food Specialist Lens — Two Worlds Become One

Level 3 applies a 129-dimensional lens specialized for the Korean food dataset, built on top of Wolfram ImageIdentify Net V2. It re-examines the data not through a general-purpose perspective, but through the eyes of a Korean cuisine expert.

🎯 Cluster Unification: Two → One

The most notable change is that the two clusters visible at Level 2 merge into one. Where the general-purpose lens responded to the visual structure of "broth present vs. absent," the domain-specialized lens prioritizes the shared identity of "Korean food."

This has important implications for real-world AI service development. When building a Korean food recognition model, using a general-purpose backbone as-is may treat soup-based and dry foods as entirely different domains. However, using a Korean-food-specialized feature extractor creates a more unified recognition space.

The unified distribution can be seen in the L3 PCA visualization:

L3 PCA — Korean food domain-specialized lens class distribution

📈 Distribution: Still Bell-shaped — Stable

Even under the domain-specialized lens, the overall distribution maintains its bell shape. The fact that the healthy distribution form is preserved even as clusters merge demonstrates that domain specialization is meaningful representation learning, not mere compression.

L3 overall density plot — Korean food domain-specialized

🔬 L3 Per-class Density — What the Domain Lens Changed

How did per-class distributions change from L2 to L3? Let us examine the notable changes across 6 classes:

Songpyeon L3 class density distribution
Songpyeon 🍡
High density maintained at L3. The consistency of Chuseok's signature food remains unchanged under the Korean food lens
Mul-naengmyeon L3 class density distribution
Mul-naengmyeon 🍜
The two naengmyeon variants (mul and bibim) are grouped more closely in the Korean food space
Ramyeon L3 class density distribution
Ramyeon 🍜
Ramyeon's distinctive red broth serves as an excellent identifier under the Korean food lens
Samgyeopsal L3 class density distribution
Samgyeopsal 🥩
Variance from cooking state differences persists at L3 — an inherent meat domain characteristic
Meongge L3 class density distribution
Meongge 🦑
Distribution remains broad even under the Korean food lens due to its unique form
Gimbap L3 class density distribution
Gimbap 🍱
Low-density pattern persists at L3. Angle diversity is not resolved even by domain specialization

▲ L3 per-class density distributions. Compared to L2, the overall clusters have merged, but the intra-class variance patterns are preserved according to each food's inherent characteristics.

Outlier Analysis — Why Songpyeon Is the Most "Typical"

DataClinic identifies the most typical samples (high density) and the most atypical samples (low density) in the dataset through density-based outlier analysis.

🏆 High-density Samples — What AI Considers "Typical Korean Food"

High-density samples — images located at the very center of the feature space — are predominantly Songpyeon and Mul-naengmyeon. This is no coincidence.

Songpyeon is a Korean Chuseok (Thanksgiving, 15th day of the 8th lunar month) holiday food, made by filling rice flour dough with sesame, red bean, or chestnut, shaping it into a half-moon, and steaming it on pine needles. It is an extremely visually consistent food:

  • Uniform half-moon silhouette
  • Standardized color palette of white, pink, and green
  • Neatly arranged composition on plates
  • Often simple backgrounds with uniform lighting

From AI's perspective, Songpyeon is a "predictable" image. Since nearly all Songpyeon photos share similar feature vectors, their density is measured as high. Paradoxically, this "typicality" also makes it the number one candidate for Data Diet.

Songpyeon high-density sample 1 (density 0.6961)
Songpyeon (high density)
density: 0.6961
Songpyeon high-density sample 2 (density 0.6952)
Songpyeon (high density)
density: 0.6952
Songpyeon high-density sample 3 (density 0.6945)
Songpyeon (high density)
density: 0.6945
Mul-naengmyeon high-density sample (density 0.6957)
Mul-naengmyeon (high density)
density: 0.6957

⚠️ Low-density Samples — The Identity of Outliers

The top-ranking low-density outliers are Gimbap, Sundae, Kkaennip-jangajji, and Samgyeopsal. Their common trait is highly variable shooting angles, plating styles, and cooking states:

  • Gimbap — cross-section view (revealing pickled radish and egg fillings) vs. side view (cylindrical exterior). Often confused with Japanese maki rolls by foreigners, but Gimbap is an entirely different food using sesame-oil-coated rice and Korean-style ingredients. This visual diversity is the cause of its low density.
  • Samgyeopsal — pink before grilling vs. brown after grilling. The color difference is extreme.
  • Kkaennip-jangajji (pickled perilla leaves) — plated alone on a dish vs. used as a wrap. A classic banchan dilemma.
  • Sundae (Korean blood sausage) — whole form vs. sliced cross-section view.
Gimbap low-density outlier (density 0.0513)
Gimbap (low density)
density: 0.0513 ⚠️
Sundae low-density outlier (density 0.0541)
Sundae (low density)
density: 0.0541 ⚠️
Kkaennip-jangajji low-density outlier (density 0.0552)
Kkaennip-jangajji (low density)
density: 0.0552 ⚠️
Samgyeopsal low-density outlier (density 0.0566)
Samgyeopsal (low density)
density: 0.0566 ⚠️

↔️ Most Different Pair: Hangwa vs. Gimbap

Similarity analysis revealed the most distant image pair in the feature space. The Hangwa and Gimbap combination is representative. Hangwa is a golden-brown, dry, and standardized confectionery form, while Gimbap is a black-and-white cylinder with a colorful cross-section — polar opposites in color, texture, and shape.

Hangwa (Korean traditional confectionery) is made from rice, beans, and honey, and is an indispensable food at ceremonies such as weddings and ancestral rites. Its neat appearance and standardized form make it a high-density food for AI. Gimbap, on the other hand, produces a completely different image depending on the shooting angle.

Hangwa — reference image of the most distant pair

Hangwa (Korean confectionery)

VS
Gimbap — most distant image from Hangwa

Gimbap

▲ The most distant pair in feature space. Polar opposites in color, texture, and shape.

💡 Practical Use of Outlier Analysis

Recommendations — A Data Diet Prescription

DataClinic recommends a Data Diet for this dataset. Despite the textbook-quality class balance, the quality score of 71 (Fair) is held back by insufficient visual diversity in certain classes.

🥗 What Is a Data Diet?

A Data Diet is not simply about reducing data. It identifies and removes near-identical images concentrated in high-density regions, helping the model learn more diverse patterns.

  • Songpyeon — Dense with half-moon-shaped, pastel-colored images. Supplementing with images of the shaping process, rustic styles, and varied lighting conditions would make the model more robust in real-world environments.
  • Mul-naengmyeon — Repetitive center-of-bowl compositions. Supplementing with diverse angles and plating styles is recommended.
  • High-density classes overall — Ideally, duplicate images should be removed and replaced with images from diverse shooting environments such as restaurants, homes, and street food stalls.

💊 Is Data Bulkup (Augmentation) Unnecessary?

Since the current class balance is excellent, a Diet is more urgent than minority class augmentation. However, atypical seafood classes like Meongge, Gwamegi, and Jeotgal have naturally low visual diversity, so adding images from varied shooting conditions could significantly improve model robustness.

🎯 Key Summary

✅ Class balance: Textbook-level (std. dev. 16.8)

✅ Missing values: 0.07%, negligible

✅ Channel composition: 99.42% RGB

⚠️ Resolution range: Preprocessing standardization needed

⚠️ High-density class duplicates: Data Diet recommended

📈 Expected improvement: 71 → 80s possible after Data Diet

The full diagnostic results and detailed per-class analysis of all 150 classes are available at DataClinic Report #59.

Conclusion — Three Discoveries from 150 Korean Foods in Data

From the diagnosis of Korean Food Image dataset #59, we obtained three key discoveries.

First, broth culture manifests in data. When general-purpose AI (Wolfram ImageIdentify Net V2) learned Korean food, it naturally formed two clusters: soup-based and dry food groups. Without any recipe or ingredient information, purely through visual patterns, it captured the essential structure of Korean food culture. An AI that has never tasted Korean food discovered on its own that "Korean meals always include broth."

Second, domain specialization creates unification, not division. When the Korean-food-specialized lens (L3, 129 dimensions) was applied, the two clusters merged into one. This powerfully demonstrates how domain knowledge reinterprets data. A lens that understands Korean food sees the shared identity of galbitang and samgyeopsal as "Korean food" as more important than their visual differences.

Third, balanced counts do not guarantee balanced learning. Despite a textbook-level class balance with a standard deviation of 16.8, the learning difficulty gap between Songpyeon (extremely low intra-class variance) and Gimbap (extremely high intra-class variance) is enormous. Removing high-density duplicates via Data Diet is expected to raise the score from 71 to the 80s.

📊 Diagnosis at a Glance

Level Key Finding Implication
L1 Class balance std. dev. 16.8 Extremely low risk of biased learning
L2 General AI forms soup/dry 2-cluster split Possible soup-vs-dry confusion with general backbone
L3 Clusters merge under Korean food lens Clear benefit of domain-specialized feature extractor
Outliers Songpyeon high density, Gimbap low density Data Diet target: remove high-density class duplicates

For more dataset diagnostics and DataClinic use cases, visit dataclinic.ai.