2026.03 · Pebblous Data Communication Team

Reading time: ~13 min · 한국어

Executive Summary

This article is based on the analysis from DataClinic Report #124. The Military Border Operations Synthetic Data is a marine border surveillance-specialized synthetic dataset built by Pebblous under a National Information Society Agency (NIA) project. Comprising 149,447 images at 88GB, it includes EO (electro-optical) and IR (infrared thermal) dual-sensor imagery, achieving a DataClinic overall score of 88 (Good). This predates PBLS_Drone (87) and PBLS_Military (68), making it the origin of Pebblous defense synthetic data.

88
DataClinic Overall Score
149,447
Total Images
88GB
Dataset Size
EO+IR
Dual-Sensor Modality

DataClinic Grade Summary

L1 Integrity Bad Image size ratio exceeds threshold (variable resolution)
L1 Missing Good
L1 Class Bal. N/A Single class
L1 Statistics Good
L2 DataLens N/A
L2 Geometry Good
L2 Distribution Good
L3 Not Diagnosed This report covers up to L2 only

Why Border Surveillance AI? — Threats from the Night Sea

South Korea's coastline stretches approximately 15,000km. It is physically impossible to monitor this entire coastline 24/7 with human personnel alone. Detecting infiltrations that are invisible to the naked eye — especially at night, in fog, heavy rain, or the bitter cold of winter — is a core challenge of modern border operations.

North Korean submarine infiltration, small rubber boat maritime intrusion, and underwater approach are all real threats. AI-based border surveillance systems must learn all these scenarios and respond faster and more accurately than humans. However, actual infiltration scenes cannot be filmed or collected — that is why synthetic data is necessary.

🌊

15,000km Coastline

24/7 human monitoring impossible — AI automated alert system essential

🌑

Night & Adverse Weather

Real infiltrations exploit night, fog, and storms — EO cameras alone have limits

🏭

Role of Synthetic Data

Real infiltration scenes cannot be filmed — CG generates all infiltration scenarios

NIA AI Data Construction Project: The National Information Society Agency (NIA) supports AI training data construction through AI Hub (aihub.or.kr) to advance national AI capabilities. The Military Border Operations Synthetic Data was built by Pebblous under this program, aiming to establish public data infrastructure for defense AI research.

Dataset Introduction — Pebblous' First Defense Synthetic Data

The Military Border Operations Synthetic Data is Pebblous' earliest defense synthetic dataset, with DataClinic diagnosis completed in early 2025. As the predecessor of PBLS_Military (ground equipment) and PBLS_Drone (drone recognition), it marks the starting point where Pebblous began accumulating defense synthetic data expertise.

Military Border Operations Synthetic Data — Representative Image Collage

Military Border Operations Synthetic Data — Representative Image Collage (DataClinic L1 Analysis)

Military Border Operations Synthetic Data Representative — EO Night High-Angle Surveillance

High-density representative sample — EO night W6 H7 condition (density 0.664, dataset maximum)

📊 Dataset Specifications

  • 🖼️ 149,447 images (diagnosed: 149,446)
  • 📦 88GB
  • 📐 960x540 ~ 1920x1080 — variable resolution
  • 🎨 RGB channels — both EO and IR encoded as RGB
  • 🏷️ Single class — "images" (border surveillance scenes)
  • 📅 2025.02.24 diagnosis completed
  • 🏛️ Source: NIA (National Information Society Agency)

🎯 Data Characteristics

  • 📷 EO (Electro-Optical) + IR (Infrared Thermal) dual sensor
  • 🌙 Night (NT) + Day (DT) time periods
  • ❄️ Summer (SU) + Winter (WI) seasons
  • 🌧️ 7-level weather conditions (W1~W7)
  • 📡 7-level camera altitude/angle (H1~H7)
  • 🎯 Single to compound infiltration scenarios

⚠️ Not Available for Commercial Use

The Military Border Operations Synthetic Data was built for defense-specific purposes through an NIA project and is not licensed for commercial use. It may only be used for research, education, and defense AI development purposes.

Decoding the Filename — Complete 7-Dimensional Condition Encoding

The filenames in this dataset are not simple numbers. They fully encode which sensor, which season/time/weather, what altitude, and what infiltration combination was captured using 7 codes. This structure itself reveals the design philosophy of defense synthetic data.

Filename Structure Decoded

EO_SU_NT_W6_H7_B5_0027.jpg
EO / IR
Sensor Type
EO=Electro-Optical
IR=Infrared Thermal
SU / WI
Season
SU=Summer
WI=Winter
NT / DT
Time of Day
NT=Night
DT=Day
W1~W7
Weather
7 levels
(Clear→Severe)
H1~H7
Camera Alt./Angle
H1=Close-range Low
H7=Long-range High
A1~E5
Infiltration Code
Single/Compound
Core of Edge Cases
0001~0027
Scene Number
Camera Position/
Angle Sequence
💡 What the Codes Reveal — The Dominance of W6_H7_0027: The condition dominating the high-density top ranks is EO_SU_NT_W6_H7_*_0027. EO sensor, summer night, W6 weather, H7 (long-range high angle), highest scene number (0027). This is the "standard surveillance scenario" of this dataset. Conversely, low-density outliers are dominated by IR, H1, DT, WI combinations — IR thermal, close-range, daytime, winter are the edge cases.

Edge Cases — The Spectrum of Infiltration Scenarios

The infiltration target codes (A1~E5 combinations) in the filename are the true core of this dataset. From solo infiltration to compound infiltrations of 3-5+ persons, it covers the full range of tactical infiltration scenarios. These combinations are exactly the "edge cases of enemy infiltration".

EO night solo infiltration B5
EO Visible Light Night

Solo Infiltration — B5

W6 weather / H7 long-range high angle / Scene 0027. Dataset max density (0.664). The standard surveillance scene AI detects most easily.

EO night 3-person compound infiltration A3A2A1
EO Visible Light Night

3-Person Compound — A3+A2+A1

Same environment with 3 infiltrators. Density 0.660 — still high-density. Compound infiltration remains detectable in standard night surveillance.

EO night mixed infiltration E3D3
EO Visible Light Night

Mixed Infiltration — E3+D3

Different type (E-series + D-series) infiltration combination. Density 0.628 — lower high-density range. Mixed infiltration begins to increase AI detection difficulty.

IR night 2-person infiltration A4A3
IR Thermal Night

IR Night 2-Person — A4+A3

Switching to thermal camera causes density to plummet (0.105). Same scene, but AI recognizes IR images as "atypical" — the sensor modality barrier.

EO winter daytime infiltration B3A3
Daytime Winter

Winter Daytime — B3+A3

Winter (WI) + Daytime (DT) combination. Density 0.087 — extreme low-density edge case. Winter background and daylight create a completely different visual pattern from standard night surveillance.

EO daytime close-range 2-person infiltration A1A5
Daytime Top Edge Case

Daytime Close-Range — A1+A5

DT (daytime) + H1 (close-range) + compound target. Density 0.086 — overall minimum. The scene AI finds most confusing. Pre-alert situation.

💡 Three Axes of Edge Cases — What This Dataset Captures:
  • Sensor Switch: AI recognition difficulty spikes when switching from EO (visible) to IR (thermal)
  • Environmental Extremes: Winter (WI) + Daytime (DT) + Close-range (H1) is the most atypical combination
  • Infiltration Complexity: Detection difficulty increases from solo (A1) to same-type group (A3A2A1) to mixed-type compound (E3D3, B3A3)

Level 1 — Basic Quality Diagnosis

Overall Mean Image — The "Archetype" of Border Surveillance Scenes

This is the result of pixel-wise averaging 149,446 images. EO and IR, night and day, summer and winter all blend together into a blurry composite. It reveals the common structure of border surveillance scenes: sea and sky in the background with small infiltration subjects in the center.

Military Border Operations Overall Mean Image

Overall Mean Image — 149,446 images pixel average (DataClinic L1)

✅ L1 Strengths

  • 🎨 RGB Channel Consistency: Both EO and IR unified as RGB encoding
  • Zero Missing Values: No corrupted or empty images among 149,447
  • 📊 L1 Statistics: Good — Rich structural and textural diversity
  • 🗂️ 149,447 images — Largest scale among the three datasets

⚠️ L1 Cautions

  • 📐 Integrity: Bad — Variable resolution issue
  • 🔀 960x540 ~ 1920x1080: Same aspect ratio but size threshold exceeded
  • 📋 No Labels: Single class for unsupervised learning
💡 Practical Implications of Variable Resolution: The coexistence of 960x540 and 1920x1080 means resolution varied depending on shooting distance, equipment, and rendering conditions. Input image normalization is essential for AI model training, and multi-scale architectures (FPN, SAHI, etc.) capable of handling various resolution conditions are recommended. This integrity issue is the main reason the score didn't go higher than 88.

Level 2 — DataLens Analysis (Wolfram ImageIdentify Net V2)

At Level 2, Wolfram's general-purpose neural network analyzes 149,446 images in a 1,280-dimensional feature space. Despite being a single class, the visual difference between EO and IR is clearly visible in the distribution.

Military Border Operations L2 PCA Distribution

Level 2 PCA Distribution — EO/IR mixed distribution in 1280-dim feature space (Wolfram ImageIdentify Net V2)

Military Border Operations L2 Density Terrain Map

Level 2 Density Terrain Map — 1 major cluster, mean density 0.211 (low)

L2 Key Metrics

1,280
Observed Dimensions
0.211
Mean Density (Low)
3.6%
Outlier Ratio
1
Major Clusters
💡 Mean Density 0.211 — Why It's the Lowest Among the Three Datasets: The 0.211 mean, lower than PBLS_Drone (0.3) and PBLS_Military, exists because EO and IR images are spread across two distant regions in the embedding space. Visible light and thermal images have completely different visual characteristics at the pixel level, so the general-purpose AI places them in separate locations. As a result, the overall distribution is dispersed, lowering the mean density. This is not a weakness — it is evidence that the data richly covers diverse visual conditions.

Outlier Sample Analysis — The Invisible Boundary Between EO and IR

🟢 High Density — "Standard Surveillance" Scenes

The top 20 high-density samples are all EO + Summer + Night + W6 + H7 combinations. Scene numbers also cluster around 0027 or the 0012~0018 range. The conditions AI recognizes as "typical border surveillance" are clearly revealed.

High density #1
EO W6 H7 B5 (0.664) 🔥
High density #2
EO W6 H7 A3A2A1 (0.660)
High density #3
EO W6 H7 B1 (0.659)
High density #4
EO W6 H7 A3 (0.658)
High density #8
EO W6 H3 E2 0012 (0.637)
High density #15
EO W6 H5 B1 0018 (0.621)

🔴 Low Density — The Most Challenging Edge Cases for AI

The low-density top ranks are dominated by IR images, Daytime (DT), Winter (WI), and Close-range (H1). The exact opposite conditions of high density.

Low density #1
EO DT H1 A1A5 (0.086) 🔴
Low density #2
IR NT H1 C1A1 (0.087)
Low density #3
EO NT H1 A4E4 0002 (0.087)
Low density #4
EO WI DT B3A3 (0.087)
Low density #7
IR NT W7 H1 A1A5 (0.089)
Low density #10
EO WI DT C3A3B3 (0.090)

🔄 The Two Most Different Scenes — EO vs IR Extremes

Reference high-density EO night
EO Night — Reference (Pivot)
Density 0.664 (Max)
Farthest IR night
IR Night — Farthest Image
Density 0.100 (Extreme)

Both are "night infiltration" scenes, but to AI's eyes, EO (visible light) and IR (thermal) are completely different worlds.

💡 Key Finding — EO and IR Are Different Languages to AI: The closest image pairs are all different infiltration code combinations under EO night W6 H7 conditions. The farthest image pairs are EO night (high-density) vs IR night/day (low-density) pairs. In other words, sensor modality (EO vs IR) determines embedding distance more than the type or number of infiltrators. Operational defense AI requires a multimodal architecture that processes EO and IR as separate modalities.

Pebblous' Three Defense Synthetic Datasets — An Evolutionary Trajectory

Starting with the Military Border Operations Synthetic Data (#124), Pebblous' defense synthetic data has grown increasingly diverse and sophisticated. Comparing the three datasets reveals the direction of Pebblous' synthetic data technology evolution.

124
Military Border Operations Synthetic Data (2025.02)
NIA project · 149,447 images · 88GB · 88 pts · EO+IR dual sensor · Marine border surveillance
7-dimensional condition encoding in filenames · Variable resolution · Diagnosed up to L2
224
PBLS_Military (2026.03)
Self-built · 3,171 images · 10 classes · 68 pts · 10 ground equipment types · L1~L3 diagnosed
Perfect class balance · Duplicate image issue · DataDiet recommended
226
PBLS_Drone (2026.03)
Self-built · 28,801 images · 52GB · 87 pts · Drone recognition · L1~L3 diagnosed
Fixed FHD resolution · 12 drone types · DataBulkup recommended

Three Dataset Comparison

Metric #124 Border #224 Military #226 Drone
DataClinic Score 88 68 87
Images 149,447 3,171 28,801
Data Size 88GB 52GB
Classes Single 10 Single
L2 Mean Density 0.211 0.300
Recommended Action BulkUp Diet BulkUp

Recommendations — Beyond 88, the Next Steps

💪

Data BulkUp — Focus on IR Region

This is DataClinic's core recommendation. Currently, IR images are concentrated in low-density regions of the embedding space, making them underrepresented. Adding more diverse infiltration scenario images under IR conditions could significantly improve the AI model's thermal recognition performance.

📐

Resolution Normalization — Improve Integrity

The coexistence of 960x540 and 1920x1080 is the cause of the "Bad" integrity grade. When generating new images, unifying to a single resolution (1920x1080) or upscaling existing images to ensure consistency is recommended.

⚠️ L3 Diagnosis Not Performed — Opportunity for Deeper Analysis

This report was diagnosed only up to L2. Unlike PBLS_Drone and Military, there is no domain-specific lens (L3) analysis, meaning infiltration-type distribution differences or EO/IR cluster separability could not be precisely characterized. Performing an additional L3 diagnosis would allow much more accurate identification of DataBulkup target areas.

Multimodal AI Architecture Recommended: The analysis showing EO and IR occupy completely different positions in embedding space suggests that a multimodal Fusion architecture (processing EO and IR through separate encoders then fusing) is more suitable for this dataset than a simple CNN classifier. Additionally, domain adaptation techniques are needed to fine-tune models trained on synthetic images to work with real surveillance camera footage.

Conclusion — The Origin of Pebblous Defense Synthetic Data

The Military Border Operations Synthetic Data is not just another dataset. It is the work where Pebblous took its first step into the new domain of defense AI synthetic data. Built as public infrastructure through an NIA project, this dataset laid the foundation for the Pebblous defense synthetic data series that continued with PBLS_Military and PBLS_Drone.

The high score of 88 stems from EO/IR dual sensors, complete 7-dimensional condition encoding, and the large-scale composition of 149,447 images. The variable resolution integrity issue and the need for IR low-density region augmentation are improvements to be addressed in the next version.

For AI to catch shadows crossing the night sea, diverse training data across many conditions is required. This dataset is the first attempt at that goal, and Pebblous' defense synthetic data journey continues today.

Key Summary Card

88
DataClinic Overall
149,447
Synthetic Images
88GB
Dataset Size
EO+IR
Dual Sensor

Original DataClinic Report: dataclinic.ai/en/report/124 · Not for commercial use · NIA project data