The Test-Time Compute Evolution of VLA Models

January 26, 2026

A year ago, OpenAI impressed the AI world with o1's reasoning traces. Now, the same revolution is happening in robotics.

Over the past month, frontier labs have released a new generation of Vision-Language-Action (VLA) models that represent a fundamental shift in how robots think and act. These aren't just incremental improvements, they're following the same evolutionary playbook that transformed large language models. The key? Increasing test-time compute through intermediate reasoning traces.

The First Generation: Promise and Limitations

The first generation of VLA models: pi0.6, gr00t N1.5 showed remarkable promise. These models successfully operated across different embodiments, including humanoids with 20+ degrees of freedom. They were one-shot models, outputting results instantaneously, and for many simple tasks, they worked.

But their Achilles' heel was generalization. These models struggled when environments changed. Different lighting conditions, new backgrounds, or variations in object colors would cause performance to degrade significantly. The models hadn't learned to truly understand their physical environment, they had memorized patterns from their training data.

The naive solution was to throw more data at the problem. Tools like NVIDIA's Dreamgen offered a way to multiply datasets by generatively switching video backgrounds, creating artificial variety. While useful as a scaling tool, this approach proved insufficient. You can't imitate your way to true generalization, no matter how much synthetic data you generate.

Learning to Reason from LLMs

To understand what's happening in robotics now, we need to look back at the LLM breakthrough. OpenAI's o1 model didn't just generate answers, it generated reasoning traces. The model was trained to form a cohesive logical chain of thought before responding, rather than immediately jumping to an answer.

This required two key components: training data that included good reasoning traces, and explicit rewards for generating those traces. The result was a decoder-only transformer that had been conditioned to think step-by-step, making its reasoning process visible and verifiable.

The improvement wasn't just incremental, it was transformative. By increasing test-time compute and making the model's reasoning explicit, o1 achieved capabilities that seemed out of reach for previous generation models.

Alpamayo: Text-Based Reasoning for Autonomous Driving

NVIDIA's Alpamayo brings this same approach to autonomous driving. Instead of directly outputting driving actions, Alpamayo generates text-based reasoning traces that express physical causality and laws. The model thinks through what it sees and grounds its decisions in physically feasible realities.

Consider this example from Alpamayo's demonstrations: when the model observes a ball rolling, it reasons "ball rolling could be a hazard." This isn't a scene-specific reaction, it's an expression of fundamental physical understanding. The reasoning applies universally, regardless of whether the ball appears in a suburban neighborhood, a parking lot, or an unfamiliar environment.

This is fundamentally different from training a model to avoid balls in 50 different scenes and hoping it develops an "instinctive" avoidance response. Alpamayo reasons about abstract physical laws that are universally true, making it far more generalizable than pattern-matching approaches. The model doesn'tjust react, it understands why it's reacting.

1XWM: Imagining Futures Through Video

While Alpamayo uses text to reason, 1X's world model (1XWM) takes a different approach: it imagines futures through video. This model increases test-time compute not through language, but by generating visual simulations of possible outcomes and selecting the best path forward.

The training pipeline is sophisticated. The model starts with pre-training on internet video data, giving it a foundation in general physics. It then undergoes mid-training on ego-centric human data, learning how humans navigate and manipulate their environment. Finally, post-training on 1X's specific humanoid embodiment data familiarizes the model with the specific capabilities and constraints of a humanoid robot.

When the robot needs to pick up a cup, 1XWM doesn't just execute a single planned trajectory. Instead, it imagines multiple possible trajectories from point A to point B. Some trajectories might spill liquid; others might result in collision with nearby obstacles. An intermediate reasoning step evaluates these imagined futures and selects the optimal trajectory to execute.

The model is literally simulating different realities before acting, running experiments in its mind to find the best approach.

The Human Connection: Why Both Approaches Make Sense

What's striking about these two approaches is how deeply they resonate with human cognition. Different people reason differently, but research into consciousness suggests we develop virtual models of ourselves and simulate different realities before acting. This is exactly what 1XWM does, it creates a virtual model of its embodiment and tests different scenarios.

At the same time, anyone who has worked through a difficult problem knows the experience of inner monologue. We talk ourselves through challenges, building sequential chains of reasoning. This is Alpamayo'sapproach, explicit linguistic reasoning about cause and effect.

Both approaches are grounded in human intuition about how thinking works. We both simulate outcomes and narrate our reasoning. As VLA models approach the sophistication of human minds, it makes sense that they would develop both capabilities.

What's Next: Convergence and Implications

My take on the next generation of models: a combination of both approaches. Imagine a robot that can both imagine visual futures and articulate its reasoning about physical laws. Text-based reasoning could guide the generation of video simulations, while visual imagination could ground abstract reasoning in concrete possibilities.

This convergence will enable entirely new applications. Robots that can truly generalize across environments and tasks will proliferate across the physical world. The bottleneck won't be the model'scapability, it will be our imagination about where to deploy them.

Footnote

Decoder-only transformer: A neural network architecture that generates outputs sequentially, predicting one token at a time based on previous tokens. Unlike encoder-decoder models, these transformers only have the generation component, making them well-suited for autoregressive tasks like language generation and, as it turns out, reasoning trace generation.

Sources

← Back to all posts