AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Contributors, AgiBot World

Executive Summary

AgiBotWorld 2026 is an open-source robot manipulation dataset with over one million trajectories, released by Chinese robotics startup AGIBOT in April 2026. It makes one choice that sets it apart from competing datasets: it does not discard failed demonstrations. When a robot drops an object or slips mid-grasp, that trajectory is kept and labeled with error_cause and restorable fields rather than filtered out. This choice — pricing failure data — is fast becoming a precondition for the next generation of robot foundation models.

The performance case holds. Evaluated with the Genie Operator-1 (GO-1) policy model, models trained on AgiBotWorld achieved a 30% improvement over Open X-Embodiment and a 32% gain over RDT on complex tasks. The dataset spans 9.36 TB, 217 tasks, and 5 deployment scenarios. The differentiator is not scale — it is annotation philosophy.

The question this raises for data practitioners is straightforward. The "failed demonstrations" being discarded today may carry exactly the capability that is hardest to teach: how to recover after a mistake. What you keep and what you throw away is, in the end, your definition of data quality.

Key Numbers

The four figures below compress what AgiBotWorld 2026 is and what it achieves. The first two define the dataset's context; the last two measure the impact of its failure-annotation strategy.

Source: AgiBot World Colosseo (arXiv:2503.06669)

1M+

Trajectories

217 tasks, 9.36 TB

95%

SayCan discard rate

276k → 12k kept

+30%

Performance gain

vs Open X-Embodiment

+32%

Complex tasks

vs RDT

1

The 95% Discard Habit

Google DeepMind's SayCan is the textbook case of robot dataset policy. It collected 276,000 episodes and kept only 12,000 — discarding more than 95% because they contained failures. This is not an outlier. It has been the industry standard: filter for successful demonstrations, treat everything else as noise.

The intuition behind it is clear enough. If you only train on perfect examples, the model should learn to perform perfectly. The problem is that this strips out the situations robots encounter most often in the real world: the fumble and the recovery after it.

Real deployment is not a clean-room setting. Lighting varies, objects sit at unexpected angles, grippers meet friction the simulation never modeled. A model trained only on "how to succeed" has no learned pattern for "what to do when something goes wrong." The trajectories being discarded are not random noise — most of them contain high-quality motion right up to the moment of failure, and that motion carries genuine signal about task structure.

▲ SayCan (left) discards 95%+ and keeps only successes; AgiBotWorld (right) annotates failure trajectories and keeps everything

A trajectory that ends in failure often contains many steps of high-quality motion. The approach, the positioning, the force application — all of it is there, up to the final slip. Discarding the entire trajectory because of that last moment is, in effect, throwing away the most instructive part of the demonstration.

2

How AgiBotWorld Handles Failure

AgiBotWorld 2026 applies a Hierarchical Annotation Framework to every collected trajectory. The annotation runs across three layers, each capturing a different granularity of the robot's action.

At the task level, each trajectory records long-horizon subtask instructions alongside success indicators. At the object level, 2D bounding boxes track target objects and their interactions frame by frame. At the skill level, atomic actions such as "Pick" and "Place" are segmented with frame boundaries and success markers attached to each.

▲ AgiBotWorld's hierarchical annotation — task, object, and skill layers capped by two failure-specific fields (error_cause, restorable) that span the whole hierarchy

What makes the difference are two additional fields. frame_detail.error_cause records why the failure happened — gripper slip, position error, and similar categories. frame_detail.restorable flags whether recovery from that failure is possible. With these fields, a failed trajectory is not noise to be removed; it is a labeled event that the model can learn from.

The data collection method reinforces this. Rather than scripted demonstrations, teleoperators respond in real time without a fixed script — free-form collection. This naturally produces a range of error types, creating the substrate from which models can learn self-correction priors: the patterns of what to do after something goes wrong.

The numbers reflect the approach. Evaluated with the Genie Operator-1 (GO-1) policy model, AgiBotWorld-trained models achieved 30% better performance than Open X-Embodiment and 32% better performance than RDT on complex tasks. These are not marginal gains — they argue that annotating failure is a measurable performance lever, not just a philosophical stance.

3

Clean Data Gets Redefined

The traditional definition of clean data was a filtered set of successful demonstrations. AgiBotWorld 2026 puts a different proposition: a dataset that covers the full range of task-relevant states — success, failure, and recovery — is cleaner in the sense that matters for generalization.

▲ The definition of "clean data" is moving from success-only filtering (left) to annotated full-coverage (right)

This shift is not unique to AgiBotWorld. Counterfactual Behavior Cloning (arXiv:2505.10760) and Temporal Behavior Tree-guided trajectory repair (arXiv:2604.04225) point in the same direction: imperfect demonstrations are signals to be annotated and reused, not filtered out. Voxel51's 2026 Physical AI report finds that 59% of teams struggle with poor labels and 47% cannot identify data that hurts model performance. In Physical AI, where incorrect labels directly affect spatial reasoning and safety outcomes, the curation problem is structural.

Editor's Note. Pebblous defines data curation not as "deciding what to discard" but as "deciding what to value." AgiBotWorld 2026 shows this definition holding at the frontier of robotics. In the data pipelines feeding the next generation of robot foundation models, preserving and annotating failure trajectories is becoming less of an option and more of a prerequisite.

Pebblous Data Communication Team
June 22, 2026

R

References

R.1Academic Papers

1.AgiBot World Contributors. (2025). "AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems." arXiv:2503.06669.
2.Counterfactual Behavior Cloning Team. (2025). "Counterfactual Behavior Cloning: Offline Imitation Learning from Imperfect Human Demonstrations." arXiv:2505.10760.
3.Temporal Behavior Tree Team. (2026). "Temporal Behavior Tree-Guided Trajectory Repair for Robot Learning." arXiv:2604.04225.

R.2Dataset

4.AGIBOT. (2026). "AgiBotWorld2026." HuggingFace. CC BY-NC-SA 4.0.

R.3Industry

5.The Robot Report. (2026, April 7). "AGIBOT WORLD 2026 dataset is open-source to accelerate embodied AI development."
6.Humanoid Robotics Technology. (2026). "AGIBOT Launches AGIBOT WORLD 2026 to Power the Next Wave of Embodied AI."