Executive Summary
The most important signal in the "K-Physical AI Full-Stack Strategy" unveiled on June 19, 2026 is this: Korea's government pinned the bottleneck of the Physical AI race on behavior data, not on chips and not on models. As it compressed the 40 tasks identified in the first phase into three flagship projects, it placed at the center a network of behavior-data training centers across five regions of the country. The top priority is not securing chips or building a domestic foundation model, but producing, at home, the trajectories of robots actually moving through Korean environments.
This is the national version of a claim Pebblous has made all along: you can borrow a model, but you cannot borrow data. GPUs and model weights can be imported, but the actions a robot performs on a Korean factory line, road, or facility exist only in that environment. Robot-learning research has confirmed, over and over, that demonstrations gathered in one setting transfer poorly to a different robot body or a different environment.
So the real front line is not drawn at "how much you collect" but at "whether you can guarantee it is of learnable quality." Korea has been here once before. The "Data Dam," funded with roughly 1.4 trillion won, was criticized as an "expensive white elephant" because its quality checks were a formality. With behavior data the consequences are harsher, because a badly learned action turns into an accident in the physical world. This report traces why the government's declaration is really a data problem, and why the question that follows it is quality.
The gap with the global leaders, and the hand Korea actually holds, can be compressed into four numbers.
10 billion mi
Tesla FSD cumulative driving, a scale of behavior data no single nation can easily replicate
1M trajectories
Largest open/research robot datasets (AgiBotWorld, 2,976 hours)
1,220 robots
Korea's manufacturing robot density per 10,000 workers. World #1, 7.5× the global average
45%
Sim-trained success rate on precise insertion, the gap synthetic data alone can't close
The New Front Line: Behavior Data, Not Chips or Models
On June 19, 2026, Korea's Ministry of Science and ICT launched the "K-Physical AI Full-Stack Strategy" alongside the second phase of the "Physical AI Alliance" (ZDNet Korea, June 19, 2026). It folded the 40 tasks identified in the first phase into three flagship projects, and what stands out is that the centerpiece is neither a chip supply chain nor a Korean foundation model. The first thing the government said it would build was a behavior-data training center.
The training centers, anchored in five regions, will generate data along two tracks. One is teleoperation, where a human remotely pilots a robot in physical space to collect demonstrations. The other is digital twins, where reality is replicated in virtual space to mass-produce synthetic behavior data. The design pursues both the fidelity of real data and the scale of synthetic data at once. Governance brings together eight ministries and roughly ten industry associations, with 15 action groups handling execution by domain. The Physical AI budget was set at about 402.2 billion won, and the next steps announced were detailing the training centers and writing them into the 2027 budget.
The weight of the declaration lies in what gets built first. Physical AI breaks down into three rough components: chips (compute), models (intelligence), and data (experience). The first two are close to standard goods you can buy on the global market. Data alone is not. By saying it would build the training centers first, the government effectively made official a judgment that data is the one asset Korea has to secure on its own.
The point: The government named the bottleneck of Physical AI as "data." It left chips and models to imports and elevated the domestic production of behavior data to the top national task. Whether this strategy succeeds comes down to "how you collect that data, and how you guarantee it."
Why Behavior Data Can't Be Borrowed
Behavior data is the observation-action trajectory of a robot carrying out a specific task. When inputs such as camera frames and force-sensor readings arrive, it records, as a time series, how the joints and gripper moved in response. Unlike static text or images, its essence is temporal alignment, physical causality, and an outcome of success or failure. Models trained on this data are commonly called VLA (Vision-Language-Action) policies.
The problem is that this data is, by nature, hard to transfer. Studies of robot imitation learning have shown repeatedly that what raises performance is the diversity of environments and objects, not the sheer count of demonstrations (Data Scaling Laws in Imitation Learning, 2024). No matter how many more demonstrations you gather on the same factory floor at the same workbench, returns diminish past a certain point; what raises the ceiling is a new environment, a new object, a new task.
2.1Two gaps: the embodiment gap and the domain gap
Non-transferability shows up in two forms. One is the embodiment gap: behavior collected on one robot arm does not carry over intact to a robot with different joint structure, actuation characteristics, or gripper. The other is the domain gap: in an environment with different lighting, flooring, object placement, or task ordering, a learned policy easily breaks down. This is why importing a million trajectories collected in the United States or China still cannot guarantee performance on the parts, fixtures, and work patterns of a Korean factory line.
A leading attempt to narrow these gaps is the Open X-Embodiment / RT-X project. Thirty-four labs pooled data from 22 different robot types into a unified dataset of a million trajectories and showed that some transfer is possible when training across many bodies at once (Open X-Embodiment). Yet even this kind of pooling does not substitute for "diversity of environments." The conclusion keeps coming back to the same place: behavior trajectories for Korean environments have to be built in Korea.
Model weights and GPUs are standard goods, but behavior data is strongly bound to the body and environment that produced it. "You can't borrow data" is not a metaphor; it is a summary of a technical fact named by the embodiment gap and the domain gap.
How the World Is Stockpiling Behavior Data
If you have to build the data yourself, the next question is "how much have the leaders built?" The short answer: the gap in raw scale is overwhelming. Tesla has accumulated more than 10 billion miles of behavior data from its self-driving vehicle fleet (as of May 2026, adding tens of millions of miles a day). That data is confined to a single environment, the road, but the scale is one no nation can easily replicate alone. Narrowing to robot manipulation, the gap is still wide. The table below compares the size of public and research behavior datasets.
| Dataset / owner | Scale | Characteristics |
|---|---|---|
| Tesla FSD | 10B+ cumulative miles | Self-driving fleet, large-scale collection in a single environment (roads) |
| Open X-Embodiment | 1M+ trajectories | 34 labs, 22 robot types pooled (cross-embodiment) |
| AgiBotWorld (CN) | 1M+ trajectories / 2,976 hrs | "Data factory" collecting from many robots at once in a dedicated facility |
| Physical Intelligence π0 | ~10,000 hrs | Demonstrations for a general-purpose manipulation foundation model |
| DROID | 76,000 trajectories / 350 hrs | In-the-wild collection across 13 institutions (12 months) |
The methods for manufacturing scale are evolving too. China runs a "data factory" model, lining up hundreds of robots in a 4,000-square-meter dedicated facility to capture demonstrations in parallel. Relatively low teleoperation labor costs also support mass production. In the United States, the hourly rate for teleoperation is falling fast as well, which is another way of saying the "collect more" race is accelerating.
| Teleoperation rate trend (per hour, U.S.) | Rate |
|---|---|
| Early 2024 | ~$340 |
| 2025 | ~$136 |
| March 2026 | ~$118 |
That the rate fell to about a third in a little over two years is a signal that "collecting a lot" of behavior data is becoming a commodity skill. Once anyone can pile up volume, the center of gravity in competition naturally shifts to "what did you collect, and how usable is it?" So where does Korea stand in this current?
3.1Korea's position: behind on shipments, but first in the foundation for collection
On raw scale, Korea is a latecomer. But on a different axis, the "foundation for producing data," the story changes. Korea's manufacturing robot density is 1,220 units per 10,000 workers, the highest in the world and 7.5 times the global average of 162 (IFR World Robotics 2025). Having robots already deployed more densely than anywhere else means the physical starting line for standing up teleoperation and proving-ground sites is among the best in the world.
| Country | Manufacturing robot density (units / 10,000 workers) |
|---|---|
| South Korea | 1,220 |
| Singapore | 730 |
| China | 470 |
| Germany | 415 |
| Japan | 397 |
| Global average | 162 |
A strong foundation, though, does not make data accumulate on its own. Robot density is "collection potential," not "learnable data." Turning robots on the floor into data-producing assets, and then making the resulting data usable, are separate problems. That is precisely the subject of the next section.
The Trap of 'Mass Production': How Do You Guarantee Quality?
The fastest lever for catching up on the scale race is synthetic data. NVIDIA generated 780,000 trajectories in 11 hours with its Isaac Sim simulator, a volume that would take a human roughly nine months (6,500 hours) to demonstrate by hand. Training on real and synthetic data together improved the performance of the humanoid foundation model GR00T by about 40% (NVIDIA Isaac Sim / GR00T). The government's plan to run teleoperation and digital twins in parallel is a reasonable design aimed at exactly this efficiency.
But synthesis hits a wall that is hard to cross: the physics gap between simulation and reality, the so-called sim-to-real gap. It stands out especially in precise contact tasks such as fitting parts together. Techniques like domain randomization push some tasks to high success rates, but a contact-critical task such as insertion, when learned in simulation alone, sees success fall to around 45%. Synthesis is a lever for cheaply reinforcing diversity and scale; it does not fully replace real data.
| Sim-trained task / technique | Success rate |
|---|---|
| Domain Randomization | ~93% |
| AutoMate (assembly automation) | ~84.5% |
| TRANSIC (sim-to-real transfer), avg. | ~81% |
| Precise insertion (contact-critical task) | ~45% |
Reading the table top to bottom reveals one trend. General manipulation, trained by randomizing the environment, clears 90%, but the success rate drops steeply as the task moves toward contact-critical fitting of parts. The closer a task gets to physical contact, the harder it is for simulation to stand in for reality, and that is exactly why synthetic data has to be treated as a complement to real data, not a full replacement.
4.1The lesson of the 'Data Dam'
Korea has already tried state-led, large-scale data construction once. From 2020 to 2022, the "Data Dam" project invested roughly 1.4 trillion won to build 691 types and 2.6 billion items of AI training data. The results were mixed. Critics called it "a project that degenerated into short-term jobs," "verification reduced to box-ticking," and "an expensive white elephant," and actual usage fell short of expectations. As the center of gravity shifted to the LLM era, the related budget plunged about 60%, from 538.2 billion won in 2022 to 218.8 billion won in 2023.
The lesson is clear. Volume of collection does not guarantee quality. The Data Dam's failure turned not on "how much" data was gathered but on whether it was verified as "usable." And behavior data carries harsher consequences than text. Mislabeled text yields a wrong answer; a badly learned action becomes a collision, a fall, or a malfunction in the physical world.
4.2What defines the quality of behavior data
So what, concretely, is the "quality" of behavior data? Extending the criteria once used to measure text-data quality into the physical, time-series domain, it organizes into five axes.
- Coverage & diversity: Is the distribution of environments, objects, and tasks broad enough? Skew toward one environment leads to overfitting.
- Success/failure labels: Is each demonstration accurately labeled as success or failure? The absence of failure data leaves a policy fragile.
- Temporal alignment: Are sensor inputs and action outputs precisely synchronized? Even a slight misalignment distorts causality.
- Teleoperator variance: Does demonstration quality swing depending on who did the piloting?
- Synthetic distribution fidelity: Does the synthetic data faithfully reflect the real-world distribution (sim-to-real alignment)?
Even with a million robot trajectories, a model stays trapped in one environment if diversity is low or failure labels are missing. The cheaper collection gets, the more differentiation comes not from "more" but from "better quality."
You Can't Borrow Data, So Quality Is the Strategy
The logic so far comes in three beats. First, behavior data is bound to body and environment, so it cannot be borrowed. Second, the global leaders are breaking through that non-transferability with overwhelming scale, and Korea cannot match them on raw scale. Third, then Korea's path forward is not scale but quality. If, for the same volume, data with higher diversity, better labels, and tighter alignment yields a stronger policy, shifting the axis of competition from "how much" to "how well" is the rational choice.
This shift points to the same conclusion at both the national-strategy level and the data-industry level. Once an era opens in which the state "produces" data, demand appears at the same time for a layer that makes that data "usable." If the training center is the factory that makes behavior data, then diagnosing, refining, and verifying its output is a process distinct from the factory itself. As the Data Dam's lesson shows, bolting that process on after the fact, as a formality, only repeats the same failure on a more dangerous stage.
Data cannot be borrowed, and merely collecting it is not enough. Guaranteeing it is the strategy. Quality standards and diagnostic tools for behavior data are upstream infrastructure that should be ready before the training centers go live.
Why Pebblous Is Watching
This announcement runs straight into the subjects Pebblous has long worked on. In the text and image AI-Ready Data space, Pebblous has built a quality methodology for diagnosing, refining, and verifying data (DataClinic, synthetic data, simulation). The moment the government declared "mass production of behavior data" a national task, the need for a layer that turns that data into learnable quality grew right alongside it.
6.1Data quality is performance, and it is safety
Noise in behavior data, whether teleoperator skill variance, faulty demonstrations, or distribution skew in synthetic data, gets learned directly as wrong behavior in a VLA policy. As the data scaling laws show, what sets the ceiling on performance is not volume but diversity, coverage, and alignment, which lines up exactly with the central claim of data-centric AI. Quality assurance is a precondition for performance and safety, not an option you attach later.
6.2Practical implications for customers and partners
For companies that will supply data to the training centers or build on top of them, a diagnostic, labeling, and filtering pipeline that judges "is this collected behavior data of learnable quality?" becomes the variable that separates cost from performance. The cheaper collection gets, the more anyone can pile up volume, and the more differentiation comes from quality. That is why the "diagnose, refine, verify" stage takes on a larger role in the data supply chain.
Editor's Note. As a company that works on data quality, Pebblous sees the same question staying central in the behavior-data era: "is this data learnable?" This report is not a promotion of any particular product, but an analysis that reads the "quality" question raised by the government's strategy through the Pebblous lens.
Thank you for reading to the end. Now that the government has placed behavior data at the center of national strategy, the next year will be a time that asks not "how much did you collect?" but "what quality did you collect it at?" Pebblous will keep looking into that question alongside you. If you have thoughts on behavior-data quality, or concerns from the field, we would welcome hearing them anytime.
Pebblous Data Communication Team
June 29, 2026
References
Sources for the policy announcements, academic research, and statistics cited in this report.
Academic Papers
- 1.Lin, Fanqi et al. (2024). "Data Scaling Laws in Imitation Learning for Robotic Manipulation." arXiv 2410.18647. arxiv.org/abs/2410.18647
- 2.Open X-Embodiment Collaboration. (2023). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." IEEE ICRA 2024. robotics-transformer-x.github.io
- 3.Bu, Qingwen et al. (2025). "AgiBot World Colosseo: A Large-scale Manipulation Platform for Robot Learning." IROS 2025. arxiv.org/abs/2503.06669
- 4.Khazatsky, Alexander et al. (2024). "DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset." RSS 2024. arxiv.org/abs/2403.12945
Industry & Technology
- 5.Physical Intelligence. (2024). "π0: A Vision-Language-Action Flow Model for General Robot Control." physicalintelligence.company/blog/pi0
- 6.NVIDIA Corporation. (2024). "Isaac Sim and Isaac GR00T: Synthetic Motion Data Generation for Humanoid Robots." NVIDIA Developer. developer.nvidia.com/isaac/sim
- 7.Electrek. (May 2026). "Tesla FSD Surpasses 10 Billion Cumulative Miles." electrek.co
Policy, Press & Statistics
- 8.ZDNet Korea. (June 19, 2026). "MSIT launches K-Physical AI Full-Stack Strategy and Physical AI Alliance Phase 2." zdnet.co.kr
- 9.VentureSquare. (June 19, 2026). "Physical AI Alliance Phase 2 — Three Divisions and 15 Action Groups." VentureSquare.
- 10.Hankyung. (June 25, 2026). "Detailing of the Data Training Centers and 2027 Budget Plans." Hankyung.
- 11.International Federation of Robotics. (2025). World Robotics 2025 — Robot Density. IFR. ifr.org
- 12.ZDNet Korea et al. (2020–2022). "Digital New Deal 'Data Dam' — 1.4 Trillion Won Invested, 691 Types / 2.6 Billion Items Built." ZDNet Korea.
- 13.newstheai.com et al. (2022–2023). "Data Dam Quality Controversy — Criticism of Short-Term Jobs and Box-Ticking Verification."
- 14.MarketsandMarkets. (2025). Humanoid Robot Market — Global Forecast to 2030. (Market-size and CAGR estimates vary widely by source; not directly cited in the body.) marketsandmarkets.com