PebbloSim: Synthetic Data Generator for Physical AI

Executive Summary

The Physical AI market is poised for explosive growth with the rise of AI technologies that interact with the physical world across manufacturing, robotics, defense, and shipbuilding. However, the practical impossibility of collecting rare data such as defects, accidents, and edge cases -- critical to AI model robustness -- creates a 'Data Famine' that serves as the key bottleneck. PebbloSim is a strategic application designed to solve this problem, serving as the core execution engine of Pebblous Data Greenhouse's 'Action' layer.

PebbloSim adopts a 'Neuro-Symbolic Hybrid World Model' that combines the logical consistency of symbolic simulation with the visual expressiveness of neural generative models, producing high-quality synthetic data free from Physical Hallucination. Through a four-stage workflow comprising the Digital Twin Engine, GenSim Manager, Multimodal Generator, and PebbloScope Module, diagnosis-prescription-generation-verification cycles autonomously, with Vector-to-Param technology precisely targeting data voids.

Over four quarterly PoC cycles spanning one year, PebbloSim progresses from automotive process validation (PoC 1) through defense (PoC 2), shipbuilding (PoC 3), to full autonomous operation (PoC 4), providing auditable operational evidence for regulatory compliance with the EU AI Act, ISO 42001, and more. PebbloSim is the core infrastructure realizing the paradigm shift from 'collecting' data to 'cultivating' it.

Introduction: The Data Bottleneck in the Physical AI Era

The Physical AI market is poised for explosive growth with the rise of AI technologies that interact with the physical world across manufacturing, robotics, defense, and shipbuilding. However, behind this innovation lies a critical bottleneck: 'Data Famine'.

In particular, rare data such as defects, edge cases, and disasters -- key factors determining AI model robustness -- are nearly impossible to intentionally collect in the real world. For example, rare defect scenarios occurring at less than 0.001% in smart factory welding processes, or extreme weather conditions and sudden accident situations that autonomous vehicles may encounter -- the shortage of such data represents the greatest barrier to Physical AI adoption.

"Defect data cannot be collected, accident data requires waiting for incidents,
and disaster data must never occur. So how do we train AI?"

PebbloSim is a strategic application designed to solve this data bottleneck. Within Pebblous's core asset, the 'Data Greenhouse (Pebblous Data Greenhouse)' ecosystem, it serves as the core execution engine of the 'Action' layer that autonomously cycles through data 'Observe-Judge-Act-Prove,' and is the most powerful application proving that Data Greenhouse is an operational framework that takes 'Responsibility' for data quality and lifecycle beyond being a simple observation system.

PebbloSim makes a decisive contribution to accelerating industry-specific AI model development — the core objective of the AADS (Agentic AI Data Scientist) Phase 2 project — and to securing data leadership in the Physical AI market. This document clearly defines PebbloSim's fundamental reason for existence (Why) and ultimate goal (What), and describes the concrete architecture and development strategy to realize them.

1. Vision & Core Concepts

The Physical AI market is poised for explosive growth, but behind it lies a critical bottleneck called 'Data Famine'. PebbloSim is the core execution engine of the Data Greenhouse, designed to solve this problem.

PebbloSim is the core execution engine of the 'Action' layer within Pebblous's key asset, the 'Pebblous Data Greenhouse' ecosystem, which autonomously cycles through data 'Observe-Judge-Act-Prove' operations.

1.1 Core Concepts & Objectives

PebbloSim's core concept is defined as "a digital twin-based simulation and synthetic data generation platform for creating Physical AI training data." Going beyond simply building virtual environments, it organically integrates with the Data Greenhouse to autonomously produce high-quality data immediately usable for AI training.

While existing generative AI (Sora, Stable Diffusion, etc.) relies on probabilistic correlations that violate physics laws, producing 'Physical Hallucination,' PebbloSim combines neural rendering on top of the physical consistency guaranteed by the digital twin engine. This Zero Physical Hallucination principle is the foundation of PebbloSim's technical credibility and the prerequisite for synthetic data to substitute for real process data in industrial settings.

🎯

High-Quality AI-Ready Data

Generate edge case data with extremely low real-world occurrence using 'Hyper-Synthetic Data' technology

📋

Operational Evidence

Generate auditable records that prove the logic and rationale behind data growth

1.2 Key Application Domains

PebbloSim prioritizes four key domains where 'Data Famine' is most severe. These domains share a common challenge: collecting defect, accident, and edge case data is practically impossible, and there is high demand for high-quality synthetic data with guaranteed physical consistency.

🚗 Automotive/Manufacturing

Generate stability data for autonomous manufacturing systems by simulating anomalies such as robot collisions and part dislocations in flexible manufacturing environments

🛡️ Defense

Generate surveillance and tactical training data in on-premise environments

🚢 Shipbuilding

Optimize ship construction processes through digital twins combining 3D CAD with sensor data

🤖 Robotics

Generate data for complex movements and exception handling of humanoid robots

1.3 Key Differentiators

The synthetic data generation market already has numerous players. PebbloSim secures structural differentiation in this market through three key elements — recording the generation process itself as auditable operational evidence, guaranteeing physical consistency through a neuro-symbolic world model, and the self-reinforcing virtuous cycle (Data Flywheel) where diagnostic and generative accuracy improves with use.

Differentiator 1: PebbloSim as 'Operational Evidence'
PebbloSim functioning as 'core Operational Evidence' means it is not merely a tool that produces synthetic data files (.jpg, .mp4, etc.), but rather generates an Audit Trail that proves "why data was created and through what process quality was improved" -- a record of causal relationships.

A 'proof of process' that explains the 'outcome' -- conventional simulators only deliver "the data you requested," but PebbloSim also delivers "a record of how the Data Greenhouse autonomously solved the problem." It does not simply generate "an image of a rainy day." Instead, the full causal chain is recorded: "Data Clinic diagnosed 'insufficient rainy-weather data' (Why), AADS configured 'precipitation 10mm, illuminance 50 lux' (How), and PebbloSim generated this data (Action)." The 'Operational Evidence Package' generated by PebbloSim combines the following three elements:

📋 Diagnosis-Based Prescription

Records of reverse-converting 'Vector Space Void' coordinates detected by Data Clinic into simulation parameters (Vector-to-Param)

⚙️ Execution & Generation Logs

Execution records including digital twin state, applied physics laws, and generated multimodal data

✅ Improvement Verification Report

Before/After comparison showing how much the Quality Index improved with the generated data

Differentiator 2: Neuro-Symbolic Hybrid World Model
To overcome the limitations of existing Generative AI that ignores physical laws, PebbloSim adopts an approach that combines the logical consistency of 'Symbolic Simulation' with the visual expressiveness of 'Neural Generative Models'.

Existing video generation AI (e.g., Sora, Stable Diffusion) only learns probabilistic correlations between pixels, causing 'Physical Hallucination' that violates physics laws -- such as cars floating in mid-air or shadow directions being incorrect. In contrast, PebbloSim first builds a 'World Model' governed by physics laws including gravity, friction, and optics, then applies generative AI technology as the 'skin' on top of this 'skeleton.'

🎯 Zero Physical Hallucination

Guaranteed Physically Consistent data

💡 Explainable Causality

Clear causal explanations such as "The car skidded because the friction coefficient was 0.3"

🔧 Precise Controllability

Numerically precise control such as "rainfall 30mm/h, collision angle 45 degrees, speed 60km/h"

Differentiator 3: Self-Reinforcing Virtuous Cycle (Data Flywheel)
A structure that gets smarter with use. Data production leads to enhanced AI intelligence, and enhanced intelligence produces more sophisticated data -- implementing a self-reinforcing virtuous cycle. This creates a technological Moat that is difficult for competitors to replicate.

This virtuous cycle operates through three mechanisms. First, Internalizing Intelligence -- The high-quality synthetic data (Curriculum Data) generated by PebbloSim is not merely delivered to customers; it is also used to retrain the core AI models that serve as the system's brain. Through this 'Self-Training Loop,' the system evolves over time to understand increasingly complex physical situations and design more sophisticated scenarios.

Second, Reinforcement Cycle -- When the system detects data gaps, simulations generate data to fill them. As this data improves model performance, the enhanced model discovers even more subtle data defects that were previously invisible. This infinite loop creates a Data Flywheel effect that exponentially increases the value of enterprise data assets.

Third, Asset Appreciation -- While conventional software becomes obsolete over time, the Data Greenhouse powered by PebbloSim becomes a 'value-appreciating asset' where diagnostic and generative accuracy improve as data accumulates. This creates an unassailable technological moat that competitors cannot replicate in the short term.

1.4 Regulatory Compliance & Business Value

PebbloSim's operational evidence serves as the key to solving the AI regulatory and reliability challenges that enterprises face.

📜 Regulatory Compliance Documentation

The EU AI Act and ISO/IEC 42001 (AI Management Systems) require proof of what data AI models were trained on. PebbloSim's operational evidence serves as auditable documentation demonstrating "we scientifically diagnosed and reinforced insufficient safety data in this manner."

🛡️ Physical AI Safety Assurance

In Physical AI domains such as robotics and autonomous driving, learning from accident data is essential. PebbloSim generates accident data that cannot be obtained in reality, serving as a 'Safety Assurance Certificate' proving that models were trained on this data.

2. System Architecture

PebbloSim is a core application that runs on top of the Data Greenhouse, an AI data operating system (OS). It is defined by a clear workflow: Engine (Infrastructure) + Scenario (Blueprint) = Simulator Instance (GenSim).

2.1 Four-Stage Workflow

PebbloSim's workflow is not the operation of a standalone simulator, but the execution cycle of an application running on a Data Operating System (Data OS). It specializes in 'Data Bulk-up' that reinforces sparse data regions through physics-based simulation, while its process is fully synchronized with the Greenhouse's Observe-Judge-Act-Prove loop. This architecture enables PebbloSim to function not as an individual tool (simulator) but as scalable platform infrastructure.

🏭

Twin

Base Class

📐

Design

Architect

⚡

Generate

Action

👁️

Verify

Director

GenSim
Loop

2.2 Core Module Function Definitions

Module	Role	Core Technology
Digital Twin Engine The Base Class	Digital base environment precisely replicating real-world physics laws and environments	NVIDIA Omniverse, Reality Sync, Ground Truth provision
GenSim Manager The Architect	Translates abstract commands into concrete simulation scripts	Ontology & LLM, Intent Translation
Multimodal Generator Action Engine	Actively produces multimodal data within GenSim instances	Vector-to-Param, Precision Targeting
PebbloScope Module The Director	Visually monitors simulations and provides final approval	Interactive Link, Human-in-the-Loop

2.3 Greenhouse Integration Mechanism

PebbloSim's business value comes not from standalone operation but from its organic integration with the Greenhouse ecosystem. By automating the pipeline from diagnosis through generation to verification, enterprises eliminate the need to build separate manual pipelines for data quality improvement. This structurally reduces both the cost and time of quality improvement, freeing data engineers from repetitive tasks.

🔍 From Diagnosis to Prescription (Clinic → Architect)

Data bias/deficiency information diagnosed by Data Clinic is delivered to the GenSim Manager as blueprints for precision scenario generation

⚡ Execution & Supply (Action Engine → Greenhouse)

Precision-targeting voids in the neuro-symbolic representation space via Vector-to-Param technology for high-efficiency data bulk-up

✅ Verification & Circulation (Director → Flywheel)

Only data passing PebbloScope's approval gate is assetized, completing the Data Flywheel structure through AI model retraining

3. Scenario-Based Workflow

Example Scenario: A workflow to resolve a situation where AI detection rates are declining due to insufficient 'micro-scratch' defect data in low-light conditions during automotive painting processes.

The automotive painting process was selected as the first validation scenario for a clear reason: it is a representative case where the absence of rare defect data directly bottlenecks AI model performance. Micro-scratches occur at less than 0.1% frequency in actual processes, making it difficult to acquire sufficient training data, and detection difficulty changes dramatically depending on on-site lighting conditions. This scenario enables quantitative proof of Sim-to-Real transfer effectiveness, making it the industrial reference that most intuitively demonstrates PebbloSim's ROI to both investors and customers.

1

Diagnosis & Prescription

Data Clinic diagnoses that "micro-scratch data at illumination below 50 lux accounts for less than 1%," and AADS generates a command to "create 1,000 micro-scratch data samples in low-light environments"

2

Scenario Design & Translation

GenSim Manager translates the abstract command by referencing the ontology into specifics: 'light source brightness 30~50 lux,' 'scratch texture with depth 0.1mm and length within 2cm'

3

Virtual Environment Construction & Data Generation

The Digital Twin Engine replicates the actual factory environment, and Vector-to-Param technology precision-targets only the missing 'dark environment' data for generation

4

Visualization & Quality Verification

PebbloScope verifies that data is distributed in the intended regions, and validates ontology connections via Interactive Link

5

Governance Approval & Ingestion

After user 'Approve', data is ingested into the data lake; all processes are recorded as Audit Logs for regulatory compliance

Data Flywheel Effect
As this workflow repeats, the customer's data system evolves into a 'living asset,' and a Network Effect is created where Pebblous's AADS-LLM and VLM are also strengthened in tandem.

4. Phased Development Strategy

To successfully build a complex and innovative platform like PebbloSim, an incremental and iterative approach is essential rather than a 'Big Bang' development method. Pebblous adopts a development strategy that progressively completes PebbloSim through 4 PoC cycles (3 months each) over the course of one year.

The core of this approach is the 'Wedge Use Case' strategy. We prove immediate ROI through automotive processes (PoC 1) -- the most urgent domain with the clearest impact -- then use this as a springboard to expand into defense (PoC 2) and shipbuilding (PoC 3), ultimately completing the fully autonomous platform (PoC 4). Each phase deepens integration with the Data Greenhouse and is designed to directly contribute to achieving the quantitative targets of the AADS Phase 2 government project. In other words, it is a strategy of progressive deepening: 'Closing the Loop -> Sovereignty -> Data Depth -> Full Autonomy.'

This four-cycle strategy also doubles as a risk management framework. Each PoC takes the deliverables of the previous stage as prerequisites, so technical uncertainties discovered in early stages are immediately reflected in subsequent designs. Furthermore, since every cycle's outputs are directly incorporated into AADS Phase 2 government project technical reports, technology validation and project execution are synchronized on a single timeline — an efficient structure.

PoC #1 Foundation Building & Automotive Process Validation (Months 1-3)

Focus on 'Closing the Loop'. Prove that the 'diagnosis-prescription-generation-verification' pipeline operates seamlessly.

• Build basic physics environment (Class) for automotive manufacturing lines including robot arms, conveyor belts, etc.
• Implement manual selection and execution of 2-3 fixed scenarios
• Develop basic synthetic data generation module focused on visual data (RGB images)

PoC #2 Defense Domain Expansion & Sovereign System Validation (Months 4-6)

Focus on 'Sovereignty & Security'. Validate a standalone Data Greenhouse that operates entirely within closed defense networks.

• Add defense-specific complex scenario assets such as 'infiltration,' 'loitering,' and 'abandonment'
• Equip tamper-proof governance module (security audit compliance)
• Package to operate with proprietary sLLM and rendering engine without foreign platforms

PoC #3 Multimodal Data Enhancement & Shipbuilding/Manufacturing Application (Months 7-9)

Focus on 'Depth of Data'. Enhance complex data processing capabilities and intelligent generation abilities that understand unstructured information.

• 'Spatiotemporal Synchronization' complex data generation engine combining 3D CAD with sensor logs
• Industrial VLM integration for automatic conversion of design drawing annotations into physical constraints
• Apply 'Interactive Link' to PebbloScope (neuro-symbolic bidirectional visualization)

PoC #4 Full Autonomy & Platform Completion (Months 10-12)

Completion of 'Autonomy & Connectivity'. Build a 'Self-Driving Data Ops' environment where humans only set goals and AI agents drive the entire process.

• Complete dedicated Agentic API Gateway for AADS agents
• Full automation of Vector-to-Param (core patent technology US 12,481,720)
• Autonomous PDIG loop execution and Human-in-the-Loop Smart Gate

5. Conclusion

5.1 Paradigm Shift

We declare the transition from an era of accidentally discovering and 'collecting' data in the real world, to an era of intentionally designing and 'cultivating' the data we need. PebbloSim serves as an inexhaustible source that infinitely supplies AI-Ready data, transcending the constraints of risk, cost, and time in the physical world.

5.2 Business Impact

The Physical AI era demands more than merely 'collecting' data. 'Cultivating' data and accumulating all evidence of the process in an auditable form is becoming a prerequisite for market entry. PebbloSim, at the center of this paradigm shift, completes the Data Greenhouse's diagnosis-judgment-generation-verification pipeline as a fully automated operating system.

📈

Growth

Evolving into an 'Appreciating Asset' whose value increases over time through the Data Flywheel

🤝

Trust

Proving AI model safety and transparency through Audit Trails compliant with ISO/IEC 5259 and ISO 42001 standards

Vision: Essential Infrastructure for the Physical AI Era
PebbloSim is the most powerful execution tool realizing Pebblous's "Makes Data Tangible" vision. It will establish itself as the essential 'Data Infrastructure' for key industries including automotive, defense, and shipbuilding to achieve breakthrough competitiveness by combining with AI.

Frequently Asked Questions

Q. What is PebbloSim?

PebbloSim is the core execution engine of the Data Greenhouse, a platform that generates high-quality synthetic data needed for Physical AI training through digital twin-based simulation. It precisely generates edge case data that is difficult to obtain in reality, ensuring the robustness of AI models.

Q. What is Physical AI data and why is it important?

Physical AI data is sensor data used to train AI systems that operate in the physical world, such as autonomous vehicles, robots, and drones. It consists of multimodal data collected from various sensors including camera footage, LiDAR point clouds, radar, and IMU, with dangerous situation (edge case) data that is difficult to safely collect in real environments being particularly important.

Q. How is multimodal synthetic data generated?

PebbloSim deploys virtual sensors in a digital twin-based simulation environment to generate data across various modalities including RGB camera, Depth, LiDAR, and radar in a synchronized state. The physics engine simulates accurate dynamics, and neural rendering technology adds realistic visual quality, creating synthetic data that is indistinguishable from reality.

Q. What is the relationship between Data Greenhouse and PebbloSim?

Data Greenhouse is the framework encompassing Pebblous's entire data ecosystem, and PebbloSim is its core execution engine. Within the Data Greenhouse's PDIG (Perceive-Diagnose-Intervene-Govern) cycle, PebbloSim handles the 'Intervene' stage to generate purpose-driven data, thereby activating the Data Flywheel virtuous cycle.

Q. What roles do AADS and Data Clinic play in PebbloSim?

AADS (Agentic AI Data Scientist) is a system where AI agents autonomously diagnose and prescribe data quality issues. Based on data problems diagnosed at Data Clinic (data gaps, biases, class imbalances, etc.), AADS directs PebbloSim to generate the necessary data, building an automated data quality improvement pipeline.

Q. What is the Neuro-Symbolic Hybrid World Model?

The Neuro-Symbolic Hybrid World Model is an approach that combines the logical consistency of symbolic simulation with the visual expressiveness of neural generative models. By adding realistic visualization through generative AI on top of accurate physics-based simulation, it generates data that is both physically consistent and realistic.

PDF Document Download

View or download the full content of this technical document as a PDF.

View PDF Download PDF