1X World Model

Where

When

In machine learning, a world model is a computer program that can imagine how the world evolves in response to an agent’s behavior. Building on advancements in video generation and world models for autonomous vehicles, we have trained a world model that serves as a virtual simulator for our robots.

From the same starting image sequence, our world model can imagine multiple futures from different robot action proposals.

It can also predict non-trivial object interactions like rigid bodies, effects of dropping objects, partial observability, deformable objects (curtains, laundry), and articulated objects (doors, drawers, curtains, chairs).

In this post we’ll share why world models for robots are important, the capabilities and limitations of our current models, and a new dataset and public competition to encourage more research in this direction.

The Robotics Problem

World models solve a very practical and yet often overlooked challenge when building general-purpose robots: evaluation. If you train a robot to perform 1000 unique tasks, it is very hard to know whether a new model has made the robot better at all 1000 tasks, compared to a prior model. Even the same model weights can experience a rapid degradation in performance in a matter of days due to subtle changes in the environment background or ambient lighting.

An example T-shirt folding model we trained that degrades in performance over the course of 50 days.

‍

If the environment keeps changing over time, then old experiments performed in that environment are no longer reproducible because the old environment no longer exists! This problem gets worse if you are evaluating multi-task systems in a constantly-changing setting like the home or the office. This makes careful robotic science in the real world frustratingly hard.

Careful measurement of capabilities allows one to predict how capabilities will scale when one increases data, compute, and model size – these “scaling laws” defend the enormous investment that goes into general-purpose AI systems like ChatGPT. If robotics is to have its “ChatGPT moment”, we must first establish its “Scaling Laws”.

Other Ways To Evaluate

Physics-based simulation (Bullet, Mujoco, Isaac Sim, Drake) are a reasonable way to quickly test robot policies. They are resettable and reproducible, allowing researchers to carefully compare different control algorithms. However, these simulators are mostly designed for rigid body dynamics and require a lot of manual asset authoring. How to simulate robot hands opening a cardboard box of coffee filters, cutting fruit with a knife, unscrewing a frozen jar of preserves, or interacting with other intelligent agents like humans? Everyday objects and animals encountered in home environments are notoriously difficult to simulate, so simulation environments used in robotics tend to be visually sterile and lack the diversity of the real world use case. Small-scale evaluation on a limited number of tasks in real or sim is not predictive of large-scale evaluation in the real world.

World Models

We’re taking a radically new approach to evaluation of general-purpose robots: learning a simulator directly from raw sensor data and using it to evaluate our policies across millions of scenarios. By learning a simulator directly from real data, you can absorb the full complexity of the real world without manual asset creation.

Over the last year, we’ve gathered thousands of hours of data on EVE humanoids doing diverse mobile manipulation tasks in homes and offices and interacting with people. We combined the video and action data to train a world model that can anticipate future video from observations and actions.

‍

Action Controllability

Our world model is capable of generating diverse outcomes based on different action commands. Below we show various generations conditioning the world model on four different trajectories, each of which start from the same initial frames. As before, the examples shown are not included during training.

‍

The main value of the world model comes from simulating object interactions. In the following generations, we provide the model the same initial frames and three different sets of actions to grasp boxes. In each scenario, the box(es) grasped are lifted and moved in accordance with the motion of the gripper, while the other boxes remain undisturbed.

Even when actions are not provided, the world model generates plausible video, such as learning that people and obstacles should be avoided when driving:

‍

Long-Horizon Tasks

We can also generate long-horizon videos. The example below simulates a complete t-shirt folding demonstration. T-shirts and deformable objects tend to be difficult to implement in rigid body simulators.

‍

Current Failure Modes

Object Coherence

Our model can fail to maintain the shape and color of objects during interaction, and at times, objects may completely disappear. Additionally, when objects are occluded or displayed at unfavorable angles, their appearance can become distorted throughout the generation.

‍

Laws of Physics

The generation on the left demonstrates that our model has an emergent understanding of physical properties, as evidenced by the spoon falling to the table when released by the gripper. However, there are many instances where generations fail to adhere to physical laws, such as on the right where the plate remains suspended in the air.

‍

Self-recognition

We placed EVE in front of a mirror to see if generations would result in mirrored actions, but we did not see successful recognition or “self-understanding"

‍

World Model Challenge

As shown by the examples above, there is still much work to be done. World models have the potential to solve general purpose simulation and evaluation, enabling robots that are safe, reliable, and intelligent in a wide variety of scenarios. As such, we see this effort as a grand challenge in robotics that the community can work on solving together. To help accelerate progress towards solving world models for robotics, we are releasing over 100 hours of vector-quantized video (Apache 2.0), pretrained baseline models, and the 1X World Model Challenge, a three-stage challenge with cash prizes.

‍

Active Challenges

Compression Challenge | Prize: $10,000 USD

The first challenge, compression, is about how well one can minimize training loss on an extremely diverse robot dataset. The lower the loss, the better the model understands the training data. Even though there are many different ways to implement a world model, optimizing loss well is a general objective that underpins nearly all large-scale deep learning tasks. A $10k prize is awarded to the first submission that achieves a loss of 8.0 on our private test set. The Github repo provides code and pretrained weights for Llama and GENIE-based world models.

‍

Coming Soon

Sampling Challenge

The second challenge, sampling, is about how well and how quickly a model can generate videos of the future. Details of the Sampling Challenge will be announced soon, based on lessons learned from running the Stage 1 Challenge.

Evaluation Challenge

The third challenge, evaluation, is our holy grail: can you predict how well a robot performs before you test it in the real world? Details of the Evaluation Challenge will be announced after we’ve learned lessons from Stage 1 and Stage 2 Challenges.

Submit solutions to: challenge@1x.tech

‍

We’re Hiring!

If you’re excited about these directions, we have open roles on the 1X AI team. Internally, we have a large dataset of high resolution robot data across even more diverse scenarios. Our ambitions for world models go beyond just solving the general evaluation problem; once you can step an agent in this world model and perform evaluation, you can follow on with policy enhancement and policy training in a completely learned simulation.

‍

Github - starter code, evals, baseline implementations
Discord - chat with our engineers

Team Member

Title

Hometown

Languages