2026.03 · Pebblous Data Communication Team

Reading time: ~12 min · 한국어

Executive Summary

This article is based on the analysis results from DataClinic Report #224. PBLS_Military is a synthetic image dataset of 10 military equipment classes generated via computer graphics, including Korea's key defense export products. Comprising 3,171 images created by combining various scenarios, environments, and backgrounds without actual combat photography, it received a DataClinic overall score of 68 (Fair).

68
DataClinic Overall Score
10
Weapon System Classes
3,171
Total Images
216
Images per Class (Perfectly Balanced)

DataClinic Grade Summary

L1 Integrity Good
L1 Missing Good
L1 Class Balance Good
L1 Statistics Bad
L2 DataLens No Issues
L2 Geometry Good
L2 Distribution Bad
L3 DataLens No Issues
L3 Geometry Fair
L3 Distribution Fair

Why Synthetic Data? — Building the Battlefield with Data

Imagine photographing an actual K-2 Black Panther tank in deserts, snowfields, and urban ruins under different lighting and weather conditions. Deploying over a dozen tanks worth billions of dollars to stage hundreds of scenes is practically impossible. That is why defense AI researchers choose Synthetic Data.

Synthetic data consists of images generated via computer graphics, enabling the creation of infinite combinations of training data without actual photography. PBLS_Military is a dataset that puts this concept into practice.

🎮

Infinite Environment Combos

Combat scenarios impossible to film in real life, built with CG

⚖️

Perfect Class Balance

Exactly 216 images for all 10 classes — bias-free training data

🔒

No Security Concerns

No need to photograph real military facilities, zero classified info exposure

The Battlefield Blueprint Hidden in Synthetic Data Filenames

PBLS_Military filenames are not simple numbers. They encode the exact battlefield conditions under which each image was rendered.

sn3_en4_bg9_mt01.png
sn
Scenario (1~4)
Camera angle & situation
en
Environment (1~6)
Lighting & weather
bg
Background (1~9)
Terrain & backdrop type
mt
Model Type (01~10)
Weapon system type

📐 Theoretical maximum combinations: 4 × 6 × 9 × 10 = 2,160 scenes — each class selectively includes 216 of these.

Dataset Introduction — PBLS_Military

PBLS_Military is a synthetic military image dataset built by Pebblous, a Korean defense industry company. It includes 10 classes of military equipment — Korea's flagship defense exports and enemy vehicles essential for training. Set against winter and autumn natural environments, it comprises 3,171 HD widescreen images (up to 1,344×768px).

PBLS_Military Dataset — Collage of 10 Military Equipment Representative Images

PBLS_Military — Collage of 10 Military Equipment Representative Images (DataClinic L1 Analysis)

PBLS_Military Representative Image — K-9 Thunder Self-Propelled Howitzer Synthetic Image

▲ PBLS_Military Representative Image — K-9 Thunder Self-Propelled Howitzer (High-density typical sample, density 0.248)

⚠️ Not Available for Commercial Use

The PBLS_Military dataset contains military equipment images and is not licensed for commercial use. It may only be used for non-commercial purposes such as research, education, and defense AI development.

Level 1 — Basic Quality Diagnosis

Mean Images per Class — Each Weapon's "Archetype" Through AI's Eyes

A mean image is the pixel-level average of all images in a given class. It is normal for them to appear blurry — they show the common contours of overlapping images. The sharper the mean image, the more visually similar the images in that class are.

K-2 Black Panther Mean Image
K-2 Black Panther
K200 Mean Image
K200 KIFV
K806 Mean Image
K806 WAV
BMP-3 Mean Image
BMP-3 (Enemy)
Cobra Helicopter Mean Image
AH-1 Cobra
Military Jeep Mean Image
Military Jeep

✅ Strengths

  • 📐 Perfect Class Balance: Std. deviation 0.0 — exactly 216 images for all 10 classes
  • 🎨 RGB Channel Consistency: All images in RGB format, no grayscale or RGBA contamination
  • Zero Missing Values: No corrupted files, no empty images
  • 🖼️ HD Resolution: 1,338~1,344 × 768px widescreen rendering

⚠️ Cautions

  • 📊 L1 Statistics: Bad — Lack of visual diversity
  • 🔄 Similar Compositions: Most images share similar framing
  • 📁 Synthetic Data Limitations: Texture and lighting realism gap vs. real-world images
💡 DataClinic Insight: Synthetic data excels over real-world data in terms of class balance and zero missing values. However, visual diversity (L1 Statistics: Bad) is a chronic weakness of synthetic data — because all images share a similar rendering style. When using this for actual AI training, we recommend supplementing with Domain Randomization techniques or mixing with real-world data.

Level 2 — DataLens Analysis (Wolfram ImageIdentify Net V2)

Level 2 uses Wolfram's ImageIdentify Net V2, trained on 3 million images, as a lens. Although this neural network was not specifically trained on military equipment, it analyzes the data through general visual patterns (shape, texture, color). Let's examine how PBLS_Military data is distributed in a 1,280-dimensional feature space.

PBLS_Military L2 PCA Overall Distribution

▲ Level 2 PCA Distribution — Feature space distribution of 10 classes (Wolfram ImageIdentify Net V2)

PBLS_Military L2 Density Landscape

▲ Level 2 Density Landscape — Cluster distribution of all data (single cluster)

💡 L2 Key Finding — To a general-purpose AI, everything looks "similar": Wolfram's general-purpose neural network perceives all military equipment as a single cluster. Whether it's a K-2 tank or a Cobra helicopter, the general AI sees them all as "military equipment on a yellow-green background." The low density and multimodal distribution arise because 10 distinct equipment types are forcibly grouped together under a general-purpose lens. This is precisely why a military-specialized domain lens (Level 3) is needed.

Density Plots per Class — Distribution Patterns of Each Equipment

K-2 L2 Density Plot
K-2 Black Panther
K200 L2 Density Plot
K200 KIFV
K806 L2 Density Plot
K806 WAV
BMP-3 L2 Density Plot
BMP-3 (Enemy)
Cobra L2 Density Plot
AH-1 Cobra
Military Jeep L2 Density Plot
Military Jeep

Level 3 — Military-Specialized DataLens (79 Dimensions)

Level 3 applies domain-specific optimization. The feature space compressed to 79 dimensions is tuned to maximize discriminability between military equipment. Unlike the general-purpose lens, 3 clusters emerge — naturally grouped by the shape, size, and functional characteristics of the equipment.

PBLS_Military L3 PCA Overall Distribution

▲ Level 3 PCA Distribution — Class separation in domain-optimized 79 dimensions

PBLS_Military L3 Density Landscape

▲ Level 3 Density Landscape — 3 clusters identified (L2 single cluster → L3 tri-split)

3 Groups Discovered by Military AI

1

Heavy Armored Combat Vehicles

Track-based heavy armored ground combat platforms such as K-2 Black Panther, T-80U, BMP-3, and K200. Low, wide hulls are the common visual feature.

2

SPH & Heavy Artillery

K-9 Thunder, K806, and others with long gun barrels or distinctive hull shapes. Turret proportions and shape serve as classification criteria.

3

Light Support & Air Power

Jeep, Truck, AH-1 Cobra, etc. — relatively light vehicles with vertical profiles. Some cluster boundary confusion occurs at L3.

L3 Density Plots per Class

K-2 L3 Density Plot
K-2 (L3)
K-9 L3 Density Plot
K-9 (L3)
K200 L3 Density Plot
K200 (L3)
K806 L3 Density Plot
K806 (L3)
T-80U L3 Density Plot
T-80U (L3)
Military Truck L3 Density Plot
Truck (L3)

Outlier Sample Analysis — The Most Striking Scenes for AI

Let's examine the most "typical" and most "unusual" images in the dataset. This analysis reveals which scenes AI models learn as "archetypes" and which scenes may cause confusion.

🎯 High Density — The "Core" Scenes AI Is Most Confident About (L3)

The K-9 Thunder and K-2 Black Panther occupy the core of the high-density cluster. They are the "face" of the dataset.

K-9 High-Density Sample 1
K-9 (Density 1.285) 🔥
K-9 High-Density Sample 2
K-9 (Density 1.280)
K-9 High-Density Sample 3
K-9 (Density 1.227)
K-2 High-Density Sample
K-2 (Density 1.153)
K-2 High-Density Sample 2
K-2 (Density 1.147)
K-9 High-Density Sample 4
K-9 (Density 1.147)
💡 Insight — The dominance of bg5 (Background 5) and en3 (Environment 3): All high-density samples share en3 (environment condition 3) and bg5 (background 5). This is evidence that a specific lighting-background combination dominates the dataset's "standard." This is also linked to the duplicate image problem.

⚠️ Low Density — The Most Confusing Outlier Scenes for AI (L3)

T-80U, BMP-3, and K806 frequently appear as low-density outliers. These scenes carry a high risk of AI misidentification.

T-80U Low-Density Sample
T-80U (Density 0.283) 🔴
BMP-3 Low-Density Sample 1
BMP-3 (Density 0.306)
K806 Low-Density Sample
K806 (Density 0.309)
BMP-3 Low-Density Sample 2
BMP-3 (Density 0.310)
K-9 Low-Density Sample
K-9 (Density 0.312)
K-2 Low-Density Sample
K-2 (Density 0.312)

🔄 The Two Most Different Scenes — Extremes of the Dataset

The Military Jeep and K200 KIFV appear as the most visually different pair at L3.

Military Jeep — Farthest Pair
Military Jeep
Light · Vertical profile
K200 KIFV — Farthest Pair
K200 KIFV
Heavy armor · Horizontal profile

⬆️ These two scenes are the farthest apart in the L3 feature space. The pair AI distinguishes most clearly.

Recommendations — From 68 to a Higher Score

🥗

Data Diet

This is the core improvement recommended by DataClinic. The current data contains numerous near-duplicate similar images. In particular, images with the en3_bg5 combination are extremely clustered in the density space.

Removing duplicate images and replacing them with more diverse environment combinations can significantly improve the AI model's generalization performance.

🌍

Expand Environmental Diversity

The current data is dominated by "winter and autumn natural environments." For AI models to operate across diverse battlefields, desert, urban ruin, jungle, and nighttime environment data is also needed.

Domain Randomization: A technique that randomly varies background textures, lighting directions, and weather effects to enhance the AI model's real-world adaptability.

🔥 Overrepresentation of "Fire" Scenes in Some Clusters — L3 Analysis

The Level 3 analysis reveals that fire scenes (explosions and flames) appear somewhat frequently in certain clusters. If these scenes are concentrated in specific clusters, the AI risks incorrectly learning "fire = that equipment type." Adjusting the proportion of fire scenes or distributing them evenly across classes is recommended.

Conclusion — Possibilities and Limits of the Synthetic Battlefield

PBLS_Military is a highly meaningful starting point for defense AI research. Perfect class balance, HD resolution, and systematic environment combinations are advantages that only synthetic data can provide. The very fact that globally acclaimed Korean defense exports like the K-2 Black Panther and K-9 Thunder appear as AI training data reflects the elevated status of Korea's defense industry.

A DataClinic score of 68 is "a good start." By cleaning up duplicate images (Data Diet) and expanding environmental diversity, reaching the 80s is achievable. Going further, if this synthetic data is combined with real-world photographs (Hybrid Dataset), it will bring us one step closer to developing combat-ready defense AI models.

AI can learn the battlefield without live fire. Improving the quality of that learning is the next challenge for defense synthetic data.

PBLS_Military Key Summary Card

68
DataClinic Overall
10
Weapon System Classes
3,171
Synthetic Images
HD
1,344×768px

Original DataClinic Report: dataclinic.ai/en/report/224 · Not for commercial use