Executive Summary
Humanoid robots are growing faster right now than at any point in history. Roughly 13,000 units shipped in 2025 alone, and one forecast puts the figure above 250,000 by 2030. Yet there is a paradox. The number of robots is exploding, but the experience those robots hold is not accumulating at the same rate. The manipulation data one team spends months collecting rarely transfers cleanly to another robot, another organization, or even to itself six months later. The real bottleneck this report examines is neither a bigger model nor a better actuator. It is data that fails to accumulate because no standard exists to hold it together.
The reason is that robot data demands something fundamentally different from text or images. It is not a simple sensor log but a web of relationships among a robot's body, its actions, the scene it acts in, the execution trace it leaves behind, and the outcome. The fragility is stark: a measured 40-millisecond mismatch between a camera and an IMU can throw position estimates off by as much as 10 meters. If the coordinate frame or the calibration goes unrecorded, the same motion becomes an entirely different signal on a different robot. And because this kind of real-world robot data must be produced by hand, it costs tens of times more than data churned out in simulation — so every time collected data can't be reused, that cost is paid all over again. Only when these relationships and their physical context are preserved and remain inspectable does data become experience that can be reused across time and across machines.
The ISO/WD 26264-1 draft, published in June 2026, is the first attempt to write that condition into an international standard. It did not appear out of nowhere. It sits on the natural continuation of the ISO 5259 series, which made the quality of text and image data measurable in the first place. The draft splits robot data into a horizontal infrastructure — covering lifecycle, provenance, quality, versioning, and traceability — and capability-specific modules such as manipulation, locomotion, and interaction. Making data quality something you can measure and diagnose is becoming a precondition for Physical AI.
~100,000×
data accumulation gap
language models ≈100,000 years vs. largest robot dataset ≈1 year
40ms → 10m
timing-induced position error
the failure mode when camera–IMU sync breaks
82×
real-world data collection cost
real world $180/hr vs. simulation $2.20/hr
~15×
shipment growth (2025→2030)
from ~13,000 units to over 250,000
Five times as many robots, experience stuck in place
The count of humanoid robots now gains a digit almost every year. By Omdia's tally, 2024 shipments were around 2,600 units; in 2025 they jumped past 13,000, more than a fivefold rise. TrendForce expects the figure to clear 50,000 in 2026, and Goldman Sachs offers a baseline scenario of more than 250,000 units shipped by 2030. Some 85–90% of that surge comes from China. While companies like AgiBot, Unitree, and Leju stamp out robots by the thousands each year, Tesla and Figure AI are expanding pilot deployments on factory floors.
Plotted out, the steepness of that curve is hard to miss. The 2026 and 2030 figures are forecasts.
Annual humanoid robot shipments (units). 2026 and 2030 are forecasts. Sources: Omdia (2024–2025 actuals), TrendForce (2026 forecast), Goldman Sachs Research (2030 forecast).
Look only at the unit curve and you would expect data to swell at the same pace. More robots, more robot-generated data. But more robots do not translate into more experience. Data being recorded and data accumulating into something reusable are two entirely different things. Ken Goldberg of UC Berkeley frames the gap in orders of magnitude. The internet-scale text today's large language models train on amounts, in human terms, to roughly 100,000 years of experience — yet even the largest robot teleoperation dataset ever assembled comes to about one year. Robot learning is more complex than language, and still the data in hand is about 100,000 times smaller.
"We don't have anywhere near enough data to train robots. A hundred thousand years is just the amount of text to train a language model, and training robots is far more complex, so we'll need even more." — Ken Goldberg, UC Berkeley (Science Robotics, 2025)
Three reasons data doesn't accumulate
Why doesn't the gap close? A 2025 survey of embodied-AI data engineering from AIRS at the Chinese University of Hong Kong sorts the causes into three bottlenecks. The first is high collection cost. Robot data has to be produced one motion at a time by a person teleoperating a robot; you cannot scrape it off the web the way you scrape text. The second is data silos. Every organization and every robot piles up data in its own format, so none of it crosses over. The third is the evaluation void: there is no common yardstick for telling good data from bad.
The three bottlenecks are not separate items on a list; they branch from a single root. Expensively collected data gets trapped in a silo because there is no standard, and with no yardstick for its quality, no other team can trust it enough to pick it up. So the next team collects everything from scratch. That is why a 15-fold increase in robots does not yield a 15-fold accumulation of experience. More units is not the same as more data assets. Without standards and authenticity, scale is merely the scale of the silo.
When data loses its relationships
Robot data is not a sequence of sensor values. The central point of the arXiv paper "Data Standards for Humanoid Robotics" is that robot data is, at its core, embodied structure. A single manipulation episode binds together the robot's body, the action it performed, the scene it faced, the execution trace that action left behind, and the outcome — all as one relationship. The half-second of data in which an arm grasps a cup means something only when the robot's joint configuration, the cup's position as the camera saw it, the forces and torques in that instant, and the success-or-failure verdict are all coherent within a single coordinate frame.
The trouble is that this relationship breaks the moment it crosses formats. Several enormous datasets have been released, but each expresses the relationship in its own format. Google's Open X-Embodiment uses RLDS, Hugging Face's LeRobot uses a different structure, and AgiBot World is built on HDF5. The very same event — "an arm grasps a cup" — is written down differently from dataset to dataset: where the coordinate origin sits, how time was stamped, what the units are. So no matter how large the data grows, it is not preserved across boundaries, and reuse stalls.
Lay the major robot-learning datasets side by side, scale against format. By trajectory count alone they are all enormous; read the format column alongside it, and their mutual incompatibility comes into view.
| Dataset | Released | Trajectories | Scale / composition | Format |
|---|---|---|---|---|
| RT-1 (Google) | 2022 | 130K | 13 robots, 700+ tasks | TFDS |
| Open X-Embodiment | 2023 | 1M+ | 22 robot types, 34 labs, 60 datasets merged | RLDS |
| DROID | 2024 | 76K | 350 hours, single Franka arm, 18 labs | RLDS |
| AgiBot World | 2025 | 1M+ | 2,976 hours, 87 skills, 106 scenes | HDF5 |
| ARIO | 2024 | 3M+ (estimated) | 258 scenarios, 5-sense multimodal, preprint | ARIO |
ARIO's trajectory count is from a pre–peer-review preprint (arXiv:2408.10899) and is therefore marked "estimated."
Open X-Embodiment is the most honest illustration of this incompatibility. To unify 60 scattered datasets into a single format called RLDS, 34 research labs collaborated. They gathered more than a million trajectories in one place — yet direct compatibility across the 22 embodiments was never fully solved. Working from public figures, our own estimate puts the engineering cost of merely converting 34 labs' data into a common format at roughly $270,000–$540,000. That is how expensive it is to stitch silos together after the fact, when no standard was there from the start.
The fact that a dataset is enormous and the fact that it gets reused are two separate things. If the relationships aren't preserved across formats, a million trajectories are just a million isolated records. What the standard is after is not the quantity of data but the preservation of the relationships that turn quantity into experience.
Why physical coherence must be transparent
The dimension that tests the preservation of relationships most sharply is physical coherence. Robot data carries one demand that text and images do not: timing, coordinate frames, calibration, kinematics, units, and synchronization must not drift out of agreement with one another. And those six things must be inspectable within the data itself. What time reference was used to stamp each event, where the coordinate origin lies, when and how each sensor was calibrated — all of it has to remain on the record before another system can safely reuse that data.
What happens when coherence breaks? This is not an abstract worry but a measured failure. The "Data Standards for Humanoid Robotics" paper reports, in hard numbers, how a small timing error swells into a large physical one. A 40-millisecond offset between a camera and an IMU pushes position estimates off by up to 10 meters and rotation by up to 3 degrees. When inter-machine perception timing slips by 849 milliseconds, an object closing at 6.30 meters per second is misestimated at 2.34 — an error of nearly 4 meters per second.
Gather the paper's quantitative failure modes in one place and the pattern is unmistakable. Each one says the same thing: reusing data turns dangerous the moment synchronization assumptions go unrecorded.
| Error type | Magnitude | Consequence |
|---|---|---|
| Camera–IMU timing offset | 40ms | 10m position error + 3° rotation error |
| Inter-machine perception timing | 849ms | velocity estimate 6.30 → 2.34 m/s (3.96 m/s error) |
| Camera–LiDAR synchronization | 34ms tolerance | at IoU 0.5 threshold (at 40 m/s) |
| Speed-and-separation monitoring | 100ms uncertainty | 0.2m of travel at 2 m/s approach speed |
| Audio–video synchronization | +45ms to −125ms | human detection threshold |
Source: arXiv:2606.19769, quoted directly. The figures in the table are the values reported in the paper.
What these numbers say is plain. If coordinate frames, calibration, and synchronization go unrecorded, the same motion data takes on an entirely different meaning for a different robot. Incoherence in the data carries straight through into a rupture in the model's internal representations. That is why physical coherence is the backbone of robot data quality. Making data reusable comes down, in the end, to keeping these six things transparently on the record.
Where transfer falls to 0%
The absence of coherence shows most dramatically in cross-embodiment transfer. Deploy a diffusion policy trained on one robot directly onto a new robot of different structure, and one experiment found the success rate drops to 0% — even though the same policy hit 81% on its original robot. Conversely, when data from several robots is bundled into a compatible form and trained together, as in Open X-Embodiment, success on out-of-distribution tasks improves by 50–200%. What made the difference was not the quantity of data but a compatible representation that carries across machines.
Built in two layers, inherited from ISO 5259
ISO/WD 26264-1 approaches the problem by splitting it into two layers. Its full title is "Humanoid robot datasets — Part 1: General requirements," and it is under development in working group WG 16 of ISO/TC 299, the robotics technical committee. The draft divides robot data into a horizontal infrastructure and capability-specific modules. The horizontal infrastructure is the foundation laid under all robot data — covering lifecycle, metadata, provenance, quality, versioning, and traceability. On top of it sit capability modules such as manipulation, locomotion, human-robot interaction (HRI), and cognition.
Picture how the two layers interlock and the point comes through clearly: whatever capability sits on top, the same data-quality infrastructure runs beneath it.
Capability Modules
HRI
Horizontal Infrastructure
The two-layer structure proposed by ISO/WD 26264-1. The horizontal infrastructure is the data-quality foundation shared by every capability module.
A text-quality standard crosses over to robots
The idea of a horizontal infrastructure is not new. A standard for measuring and managing the quality of data — text, images, and more — already exists: the ISO/IEC 5259 series. 5259 defines the quality characteristics of data, how to measure them, and how to manage them within a governance framework. ISO/WD 26264-1 extends this concept to robot data. It carries over 5259's skeleton of quality characteristics, measurement, and governance, and adds two dimensions native to robots: the physical coherence seen in the previous section, and the preservation of embodied relationships.
Position it this way and the character of the standard becomes clear. 26264 is not a separate rulebook for a robot-only world but the natural next chapter of a data-quality standard that began with text. Whatever the data is — a sentence, an image, or a robot's motion — the principle is the same: for it to become a reusable asset, its provenance, quality, and version must be recorded in a measurable way. Robot data simply adds one more, and the most exacting, test: agreement with the physical world.
Still just a draft, but the fork is already here
One thing needs to be stated plainly. What has been published is a WD — a Working Draft. It is the earliest position on the long road an ISO standard travels. From a WD it moves through a Committee Draft (CD) and a Draft International Standard (DIS) before reaching a full International Standard (IS), and that journey usually takes years. So ISO/WD 26264-1 is not a rule to follow today. It carries no force. But it is a map you can read in advance to see where the standard is heading.
Simplified, the stages the standard must pass through line up as the timeline that follows. The years of empty space between each stage are the time in which the industry has to make its choice.
WD
Working Draft
2026 · current position
CD
Committee Draft
committee circulation & consensus
DIS
Draft International Standard
member-body ballot
IS
International Standard
years out
During that gap the industry stands at a fork. One road is the data moat. The giants can stockpile vast data in their own formats and enjoy a short-term edge. China's AgiBot ecosystem and, in the US, Tesla, Figure, and NVIDIA accumulating their own data all lean this way. The other road is an interoperable ecosystem — pooling data in a common format to use together, as the Open X-Embodiment consortium and Hugging Face's LeRobot do.
Which road pays off over the long run is something the cost structure hints at. Collecting an hour of real-world multimodal data costs about $180, while producing the same data in simulation costs about $2.20 — an 82-fold difference. If real-world data is this expensive, the ability to reuse well-collected data across time and across machines is what governs ROI. An organization that has stockpiled data in standard-conformant form accumulates experience like compound interest, while data locked in a proprietary format is spent, disposably, every time.
The standard is still a draft, but the direction is already set. Whoever records data as measurable quality, and preserves its relationships and physical context, claims future-readiness first. The years of empty space between WD and IS are not time to fall behind — they are time to align early.
Editor's Note
What ISO/WD 26264-1 groups into a horizontal infrastructure — lifecycle, provenance, quality, versioning, traceability — is the same skeleton as the axes along which Pebblous has been diagnosing text and image data in DataClinic. From the vantage point of turning data quality into measurable indicators and treating the preservation of relationships and context as a condition of model performance, this report reads the moment when the AI-Ready Data problem, which began with text, crosses over into Physical AI.
References
Academic
- 1.Liu, S. et al. (2026). "Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI." arXiv:2606.19769.
- 2.Open X-Embodiment Collaboration. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. arXiv:2310.08864.
- 3.Khazatsky, A. et al. (2024). "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset." RSS 2024. arXiv:2403.12945.
- 4.Shi, P. et al. (2024). "All Robots in One (ARIO): A Comprehensive Unified Embodied Dataset." arXiv:2408.10899. (preprint)
- 5.Goldberg, K. (2025). "On the Data Gap in Robot Learning." Science Robotics. (incl. IEEE RAS Automation Research Spotlights talk)
- 6.AIRS, CUHK. (2025). "A Survey of Embodied Artificial Intelligence Data Engineering." (three bottlenecks: collection cost, data silos, evaluation void)
Standards · Policy · Statistics
- 7.ISO/TC 299/WG 16. (2026). "ISO/WD 26264-1 Humanoid robot datasets — Part 1: General requirements." (Working Draft)
- 8.ISO/IEC. "ISO/IEC 5259 series — Data quality for analytics and machine learning." (lineage standard)
- 9.Goldman Sachs Research. (2025). "Humanoid Robot: The AI Accelerant." (250,000-unit baseline scenario for 2030)
- 10.Omdia / Bloomberg. (2026, January 8). "Chinese Firms Dominated Global Humanoid Robot Shipments in 2025." (~13,000 units in 2025)
- 11.ARIA (Advanced Research + Invention Agency). (2025). "Position Paper: Revolutionising the Robotics Ecosystem Through Enhanced Interoperability."
- 12.MarketsandMarkets. (2025). "Humanoid Robot Market — Global Forecast 2025–2030." (CAGR 39.2%)
Industry · Press
- 13.The Robot Report. (2026). "AgiBot World 2026 dataset open-source to accelerate embodied AI development."
- 14.Hugging Face / LeRobot. (2025). "LeRobot Datasets — October 2025 Update." (PyTorch-based de facto data format)
※ Figures from the paper with a future identifier (arXiv:2606.x) are attributed to the values the paper reports. Some figures — shipments (Omdia ~13,000 vs. Counterpoint ~16,000), format-conversion cost ($270K–$540K, our own estimate) — are presented together with their source or estimation basis.