technical

Summary: 1X World Model

October 2025

The 1X World Model introduces a new paradigm for robotic learning that prioritizes imagination over direct action mapping. Instead of training robots primarily on costly teleoperation data, the system learns from large-scale human video, extracting physical and behavioral priors about how the world works. Given a visual scene and a language prompt, the model generates a short video depicting a plausible future outcome. An inverse dynamics model then converts this imagined future into executable motor commands. By separating "what should happen" from "how to do it," the approach improves generalization, reduces data collection costs, and aligns more naturally with humanoid embodiment. This architecture enables robots to scale learning through observation and prediction rather than task-specific supervision.

Technical Breakdown (System-Level)

Core Components

The first core component is the World Model (Generative Video Model). It takes as input the current visual observation combined with a text instruction and outputs a short video predicting future world states. This model learns physical dynamics, object affordances, and motion priors from large-scale video. Critically, it is trained primarily on human video, not robot trajectories.

The second core component is the Inverse Dynamics Model (IDM). It takes as input the current state combined with predicted future frames and outputs low-level robot actions such as joint commands and control signals. The IDM grounds the imagined future into physically executable behavior.

Key Architectural Idea

The system decouples planning from control. The world model answers "what should happen next?" while the inverse dynamics model answers "how do I make that happen with this body?" This separation avoids brittle end-to-end mappings from pixels to actions and allows the model to reason at a higher semantic and physical level before committing to control.

Why Video Works

Internet video encodes object interaction patterns, human motion trajectories, implicit physics constraints, and causal relationships between action and outcome. Because 1X's robots are humanoid and kinematically similar to humans, these learned priors transfer more directly to robotic embodiment.

Comparison to VLA-Based Systems

Traditional VLA Approach

Vision-Language-Action systems directly map vision plus language to actions. This approach requires large amounts of robot-specific data and is often brittle outside the training distribution. VLA systems struggle with long-horizon reasoning, and the action space is tightly coupled to training tasks.

World Model Approach (1X)

In contrast, the world model approach maps vision plus language to an imagined future, then derives actions from that prediction. It learns from scalable human video data, achieves stronger generalization to unseen tasks and objects, and supports long-horizon planning through prediction. Action generation is mediated by physical plausibility.

Key Difference

VLAs treat action as the primary output. World models treat prediction as the primary output and derive action secondarily. This mirrors human cognition more closely. Humans imagine outcomes before acting, rather than directly computing motor commands from perception.

Why This Matters

As hardware and low-level control become commoditized, the bottleneck shifts to generalization, reasoning, and adaptability. World models offer a path toward robots that can learn broadly, plan flexibly, and act robustly in real-world environments.