Get updates

1X World Model

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
4 min read
1X World Model
September 17, 2024
4 min read
1X World Model
September 17, 2024

In machine learning, a world model is a computer program that can imagine how the world evolves in response to an agent’s behavior. Building on advancements in video generation and world models for autonomous vehicles, we have trained a world model that serves as a virtual simulator for our robots.

From the same starting image sequence, our world model can imagine multiple futures from different robot action proposals.

It can also predict non-trivial object interactions like rigid bodies, effects of dropping objects, partial observability, deformable objects (curtains, laundry), and articulated objects (doors, drawers, curtains, chairs).

In this post we’ll share why world models for robots are important, the capabilities and limitations of our current models, and a new dataset and public competition to encourage more research in this direction.

The Robotics Problem

World models solve a very practical and yet often overlooked challenge when building general-purpose robots: evaluation. If you train a robot to perform 1000 unique tasks, it is very hard to know whether a new model has made the robot better at all 1000 tasks, compared to a prior model. Even the same model weights can experience a rapid degradation in performance in a matter of days due to subtle changes in the environment background or ambient lighting.

An example T-shirt folding model we trained that degrades in performance over the course of 50 days.

If the environment keeps changing over time, then old experiments performed in that environment are no longer reproducible because the old environment no longer exists! This problem gets worse if you are evaluating multi-task systems in a constantly-changing setting like the home or the office. This makes careful robotic science in the real world frustratingly hard.

Careful measurement of capabilities allows one to predict how capabilities will scale when one increases data, compute, and model size – these “scaling laws” defend the enormous investment that goes into general-purpose AI systems like ChatGPT. If robotics is to have its “ChatGPT moment”, we must first establish its “Scaling Laws”.

Other Ways To Evaluate

Physics-based simulation (Bullet, Mujoco, Isaac Sim, Drake) are a reasonable way to quickly test robot policies. They are resettable and reproducible, allowing researchers to carefully compare different control algorithms. However, these simulators are mostly designed for rigid body dynamics and require a lot of manual asset authoring. How to simulate robot hands opening a cardboard box of coffee filters, cutting fruit with a knife, unscrewing a frozen jar of preserves, or interacting with other intelligent agents like humans? Everyday objects and animals encountered in home environments are notoriously difficult to simulate, so simulation environments used in robotics tend to be visually sterile and lack the diversity of the real world use case. Small-scale evaluation on a limited number of tasks in real or sim is not predictive of large-scale evaluation in the real world.

World Models

We’re taking a radically new approach to evaluation of general-purpose robots: learning a simulator directly from raw sensor data and using it to evaluate our policies across millions of scenarios. By learning a simulator directly from real data, you can absorb the full complexity of the real world without manual asset creation.

Over the last year, we’ve gathered thousands of hours of data on EVE humanoids doing diverse mobile manipulation tasks in homes and offices and interacting with people. We combined the video and action data to train a world model that can anticipate future video from observations and actions.

Action Controllability

Our world model is capable of generating diverse outcomes based on different action commands. Below we show various generations conditioning the world model on four different trajectories, each of which start from the same initial frames. As before, the examples shown are not included during training.

Left Door Trajectory
Right Door Trajectory
Play the Air Guitar

The main value of the world model comes from simulating object interactions. In the following generations, we provide the model the same initial frames and three different sets of actions to grasp boxes. In each scenario, the box(es) grasped are lifted and moved in accordance with the motion of the gripper, while the other boxes remain undisturbed.

Even when actions are not provided, the world model generates plausible video, such as learning that people and obstacles should be avoided when driving:

Long-Horizon Tasks

We can also generate long-horizon videos. The example below simulates a complete t-shirt folding demonstration. T-shirts and deformable objects tend to be difficult to implement in rigid body simulators.

Current Failure Modes

Object Coherence

Our model can fail to maintain the shape and color of objects during interaction, and at times, objects may completely disappear. Additionally, when objects are occluded or displayed at unfavorable angles, their appearance can become distorted throughout the generation.

Laws of Physics

The generation on the left demonstrates that our model has an emergent understanding of physical properties, as evidenced by the spoon falling to the table when released by the gripper. However, there are many instances where generations fail to adhere to physical laws, such as on the right where the plate remains suspended in the air.

Self-recognition

We placed EVE in front of a mirror to see if generations would result in mirrored actions, but we did not see successful recognition or “self-understanding"

World Model Challenge

As shown by the examples above, there is still much work to be done. World models have the potential to solve general purpose simulation and evaluation, enabling robots that are safe, reliable, and intelligent in a wide variety of scenarios. As such, we see this effort as a grand challenge in robotics that the community can work on solving together. To help accelerate progress towards solving world models for robotics, we are releasing over 100 hours of vector-quantized video (Apache 2.0), pretrained baseline models, and the 1X World Model Challenge, a three-stage challenge with cash prizes.

Active Challenges

Compression Challenge | Prize: $10,000 USD

The first challenge, compression, is about how well one can minimize training loss on an extremely diverse robot dataset. The lower the loss, the better the model understands the training data. Even though there are many different ways to implement a world model, optimizing loss well is a general objective that underpins nearly all large-scale deep learning tasks. A $10k prize is awarded to the first submission that achieves a loss of 8.0 on our private test set. The Github repo provides code and pretrained weights for Llama and GENIE-based world models.

Coming Soon

Sampling Challenge

The second challenge, sampling, is about how well and how quickly a model can generate videos of the future. Details of the Sampling Challenge will be announced soon, based on lessons learned from running the Stage 1 Challenge.

Evaluation Challenge

The third challenge, evaluation, is our holy grail: can you predict how well a robot performs before you test it in the real world? Details of the Evaluation Challenge will be announced after we’ve learned lessons from Stage 1 and Stage 2 Challenges.

Submit solutions to: challenge@1x.tech

We’re Hiring!

If you’re excited about these directions, we have open roles on the 1X AI team. Internally, we have a large dataset of high resolution robot data across even more diverse scenarios. Our ambitions for world models go beyond just solving the general evaluation problem; once you can step an agent in this world model and perform evaluation, you can follow on with policy enhancement and policy training in a completely learned simulation.

Github - starter code, evals, baseline implementations
Discord - chat with our engineers

2 min read
AI Update: Voice Commands & Chaining Tasks
May 31, 2024
2 min read
AI Update: Voice Commands & Chaining Tasks
May 31, 2024

We have previously developed an autonomous model that can merge many tasks into a single goal-conditioned neural network. However, when multi-task models are small (<100M parameters), adding data to fix one task’s behavior often adversely affects behaviors on other tasks. Increasing the model parameter count can mitigate this forgetting problem, but also take longer to train, which slows down our ability to find out what demonstrations we should gather to improve robot behavior. 

How do we iterate quickly on the data while building a generalist robot that can do many tasks with a single neural network? We want to decouple our ability to quickly improve task performance from our ability to merge multiple capabilities into a single neural network. To accomplish this, we’ve built a voice-controlled natural language interface to chain short-horizon capabilities across multiple small models into longer ones. With humans directing the skill chaining, this allows us to accomplish the long-horizon behaviors shown in this video:

Although humans can do long horizon chores trivially, chaining multiple autonomous robot skills in a sequence is hard because the second skill has to generalize to all the slightly random starting positions that the robot finds itself in when the first skill finishes. This compounds with every successive skill - the third skill has to handle the variation in outcomes of the second skill, and so forth.

From the user perspective, the robot is capable of doing many natural language tasks and the actual number of models controlling the robot is abstracted away. This allows us to merge the single-task models into goal-conditioned models over time. Single-task models also provide a good baseline to do shadow mode evaluations: comparing how a new model’s predictions differ from an existing baseline at test-time. Once the goal-conditioned model matches single-task model predictions well, we can switch over to a more powerful, unified model with no change to the user workflow.

Directing robots with this high-level language interface offers a new user experience for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with high level language and let the low-level policies execute low-level actions to realize those high-level goals. Because high-level actions are sent infrequently, operators can even control robots remotely, as shown below:

Note that the above video is not completely autonomous; humans are dictating when robots should switch tasks. Naturally, the next step after building a dataset of vision-to-natural language command pairs is to automate the prediction of high level actions using vision-language models like GPT-4o, VILA, and Gemini Vision.

Stay tuned! 
Eric Jang

Less than 1 min read
Podcast: 1X CEO, Bernt Børnich on the Venture Europe Podcast
May 2, 2024
Less than 1 min read
Podcast: 1X CEO, Bernt Børnich on the Venture Europe Podcast
May 2, 2024

In the latest episode of the Venture Europe Podcast, Bernt Børnich, CEO of 1X, sits down with host Calin Fabri to explore the evolving world of humanoid robotics.

Bernt shares his journey from a curious child dismantling kitchen gadgets to founding and leading 1X. He gives insight into the development of NEO, 1X’s next-generation android designed to assist with everyday tasks at home. He discusses the importance of designing safe, compliant humanoids capable of working alongside people in their daily environments. 

Bernt also discusses 1X's strategic expansion, with AI development centered in San Francisco Bay and a new manufacturing facility built in Norway. 

Throughout the episode, he explores the technical and ethical challenges of integrating androids into society, aiming to create an abundant supply of labor.

Listen on Apple Podcast

Listen on Google Podcast

Listen on Amazon Music

2 min read
Scaling NEO Production: 1X builds in-house manufacturing facility
April 2, 2024
2 min read
Scaling NEO Production: 1X builds in-house manufacturing facility
April 2, 2024

MOSS; NORWAY: 1X is currently developing its own production facility, actuator manufacturing, and robot assembly facility in Moss, Norway, right next to our campus and engineering team. This decision is more than just a matter of convenience—it's a commitment to keep building a vertically integrated company where every component of EVE and NEO is designed and produced in-house.

“The close proximity of both the actuator manufacturing, robot assembly, and testing site offers great advantages, especially for our team of creative engineers, brimming with fresh, yet untested ideas. Being adjacent to the manufacturing and assembly process allows them to quickly understand the practical aspects of transforming their creative concepts into feasible, efficient-to-manufacture products, says VP of Manufacturing Operations & Engineering, Csaba Hartmann. 

The manufacturing team consists of diverse professionals, including specialized manufacturing engineers and mechanical designers, process engineers, automation experts, quality engineers, supply chain experts, safety officers, and others. Each member plays a role in designing, trialing, and rolling out our large-scale manufacturing initiatives, contributing to enhancing scalability, rapid iterations, and safety at every stage of the manufacturing and assembly process. 

“Enabling teams that work side by side with each other and thus can easily get and act on feedback, is crucial for us to evolve and improve our products rapidly”, says Hartmann.

All 1X androids are designed with a safety-first mindset, featuring gearless motors and a soft exterior. Our commitment to safety extends beyond design, incorporating measures throughout the assembly process to ensure products are built to specs: thorough testing, quality control, and precise assembly processes.

We’re adopting quality control measures inspired by the automotive industry. We conduct thorough Design Failure Mode and Effects Analysis (DFMEA) on each assembly component to proactively identify and mitigate potential safety risks. 

“Our quality team interprets the results of the DFMEA and PFMEA and then defines the rigorous checks for the assembly process to ensure no safety aspect is overlooked,” says Hartmann. 

The assembly process includes rigorous checks of critical quality parameters to ensure no safety aspect is overlooked. Precision in the use of testing and assembly tools is emphasized to maintain high standards of accuracy. All components, especially motors, undergo extensive testing at multiple stages of assembly to validate their performance and reliability.

"At 1X, we prioritize scalable, cost-efficient manufacturing by integrating engineering expertise and rigorous quality control. Our approach leverages advanced technologies and carefully selected materials to enhance production efficiency. Committed to scalability, we ensure every process is optimized for cost-effectiveness and growth", says 1X CEO Bernt Børnich.

Join us

If you find this work interesting, we’d like to call attention to a few roles that we are hiring for to accelerate our mission toward creating an abundant supply of labor via safe intelligent androids:

We also have other open roles across mechanical, electrical, and software disciplines. Follow 1x_tech on X for more updates, and join us in living in the future.

CNN: Decoding humanoid robots
March 18, 2024
CNN: Decoding humanoid robots
March 18, 2024
Less than 1 min read
1X Attends NVIDIA GTC
March 12, 2024
Less than 1 min read
1X Attends NVIDIA GTC
March 12, 2024

1X will be attending the NVIDIA GTC Conference on March 18th. Our involvement signifies 1X's dedication to advancing in the field of Embodied AI, showcasing our latest developments, and engaging with the global AI community.

The NVIDIA GTC Conference is renowned for being a pivotal event that gathers innovators, researchers, and industry leaders worldwide to explore the latest advancements in AI, machine learning, and related technologies. Attendees can look forward to a program full of insightful talks, dynamic workshops, and demonstrations.

For more information about the conference or to register:
NVIDIA GTC Conference Official Page
Conference Program

We look forward to connecting with professionals to share our passion for AI and robotics at the event. See you at NVIDIA GTC.

IEEE: What’s going on behind the scenes with 1X’s end-to-end autonomy
February 12, 2024
IEEE: What’s going on behind the scenes with 1X’s end-to-end autonomy
February 12, 2024
1 min read
AI Update: All Neural Networks. All Autonomous. All 1X Speed.
February 8, 2024
1 min read
AI Update: All Neural Networks. All Autonomous. All 1X Speed.
February 8, 2024

1X's mission is to provide an abundant supply of physical labor via safe, intelligent androids. Our environments are designed for humans, so we design our hardware to take after the human form for maximum generality. To make the best use of this general-purpose hardware, we also pursue the maximally general approach to autonomy: learning motor behaviors end-to-end from vision using neural networks.

We deployed this system on EVE for patrolling tasks in 2023, and are now excited to share some of the new capabilities our androids have learned purely end-to-end from data:

Every behavior you see in the above video is controlled by a single vision-based neural network that emits actions at 10Hz. The neural network consumes images and emits actions to control the driving, the arms, gripper, torso, and head. The video contains no teleoperation, no computer graphics, no cuts, no video speedups, no scripted trajectory playback. It's all controlled via neural networks, all autonomous, all 1X speed.

To train the ML models that generate these behaviors, we have assembled a high-quality, diverse dataset of demonstrations across 30 EVE robots. We use that data to train a “base model” that understands a broad set of physical behaviors, from cleaning to tidying homes to picking up objects to interacting socially with humans and other robots. We then fine-tuned that model into a more specific family of capabilities (e.g. a model for general door manipulation and another for warehouse tasks) and then fine-tuned those models further to align the behavior with solving specific tasks (e.g. open this specific door). This strategy allows us to onboard new skills in just a few minutes of data collection and training on a desktop GPU.

All of the capabilities shown in the video were trained by our android operators. They represent a new generation of "Software 2.0 Engineers'' who express robot capabilities through data instead of writing code. Our ability to teach our robots short mobile manipulation skills is no longer constrained by the number of AI engineers, so this creates a lot of flexibility in what our androids can do for our customers.

Join Us!

If you find this work interesting, we’d like to call attention to two roles that we are hiring for to accelerate our mission toward general-purpose physically embodied intelligence:

Over the last year we’ve built out a data engine for solving general-purpose mobile manipulation tasks in a completely end-to-end manner. We’ve convinced ourselves that it works, so now we're hiring AI researchers in the SF Bay Area to scale it up to 10x as many robots and teleoperators. We're looking for experts in imitation learning, reinforcement learning, large-scale training, and skills relevant to scaling up deployments of autonomous vehicles. You'll be working in a fast-paced team of generalists that ship features to our fleet on a 24-hour release cycle. The work is a mix of pioneering new learning algorithms and fixing speed bottlenecks in our data flywheel. We are relentless in simplifying algorithms and infrastructure as much as possible. 

We're also hiring android operators in both our Oslo and Mountain View offices to collect data, train models with that data, and evaluate those models. Unlike most data collection jobs, our teleoperators are empowered to train their own models to automate their own tasks and think deeply about how data maps to learned robot behavior. If you want to experience what it is like to live in a real-life "Westworld", we'd love for you to apply.

We also have other open roles across mechanical, electrical, and software disciplines that make the foundation possible to ship all of this cutting-edge ML technology. Follow 1x_tech on X for more updates, and join us in living in the future.

TechCrunch: OpenAI-backed 1X raises another $100M for the race to humanoid robots
January 12, 2024
TechCrunch: OpenAI-backed 1X raises another $100M for the race to humanoid robots
January 12, 2024

NEO Featured in NVIDIA GTC Keynote

1X on Social Media

No items found.
1X
@1x_tech
1X
@1x_tech
We're proud of NEO, not just for its specifications but because it's not an industrial machine. It's lightweight, low-energy, soft, compliant, and safe among people. 1X designs our robots differently so they can work with us.
No items found.
Bernt Øivind Børnich
@BerntBornich
Bernt Øivind Børnich
@BerntBornich
Narrow wedge approaches to LLMs never worked and neither will they for humanoids, that's why safety and cost is king. Maximize the width of your data distribution and train on your test set when you can.
No items found.
1X
@1x_tech
1X
@1x_tech
By 2030, 85 million jobs could be unfilled. That’s more than 3x the population of Scandinavia. If humanity is going to keep progressing, humans need support.
No items found.
Eric Jang
@ericjang11
Eric Jang
@ericjang11

My talk at UPenn @GRASPlab is a summary of my worldviews in AI and robotics: humanoid form factor, consumer over enterprise, end2end deep learning, farm2table data. If this roadmap excites you, we're hiring on the 1X AI team! http://1x.tech/careers

No items found.
1X
@1x_tech
1X
@1x_tech
1X’s mission is to create an abundant supply of physical labor through androids that work alongside humans. We're excited to share our latest progress on teaching EVEs general-purpose skills. The following is all autonomous, all 1X speed, all controlled with a single set of neural network weights.

A selection of our open positions

Senior Mechanical Engineer, Hands
Moss, Norway
Embedded Firmware Engineer, Generalist
Moss, Norway
Head of Industrial Automation
Moss, Norway
Production Solutions Engineer (Tooling Engineer)
Moss, Norway
Supply Chain and Production Planner
Moss, Norway