Robot ChatGPT Arrives: Large-scale Model Enters the Real World, DeepMind Makes Breakthrough

This article is translated from “机器之心”.

We know that after mastering language and images in the network, large models will eventually enter the real world, and “embodied intelligence” should be the next step in development.

Integrating large models into robots, using simple natural language instead of complex instructions to form specific action plans without the need for additional data and training, this vision seems very promising but also somewhat distant. After all, the field of robotics is notoriously difficult.

However, the evolution of AI is faster than we imagine.

This Friday, Google DeepMind announced the launch of RT-2: the world’s first Vision-Language-Action (VLA) model that controls robots.

Now, robots can manipulate objects directly, just like ChatGPT, without complex instructions.

To what extent has RT-2 reached in terms of intelligence? DeepMind researchers demonstrated it with a robotic arm. They told the AI to select “extinct animals,” and the arm extended, opened its claw, and grabbed a dinosaur toy.

Previously, robots were unable to reliably understand objects they had never seen before, let alone make inferences such as connecting “extinct animals” to “plastic dinosaur toys.”

Tell the robot to give Taylor Swift a can of cola:

It is good news for humans that this robot is a true fan.

The development of large language models like ChatGPT is revolutionizing the field of robotics. Google has equipped robots with state-of-the-art language models, giving them an artificial brain.

In a recent paper submitted by DeepMind researchers, they stated that the RT-2 model was trained based on network and robot data, leveraging the research progress of large language models like Bard, and combining it with robot data. The new model can also understand instructions in languages other than English.

Google executives said that RT-2 is a significant leap in the way robots are manufactured and programmed. “Due to this change, we have to reconsider our entire research plan,” said Vincent Vanhoucke, Google DeepMind’s Head of Robotics Technology. “Many of the things we did before have become completely useless.”

How was RT-2 implemented?

DeepMind’s RT-2 is essentially a Robotic Transformer model.

Making robots understand human language, demonstrate survival capabilities, and perform simple tasks in the real physical world, as seen in science fiction movies, is not an easy task. Compared to virtual environments, the real physical world is complex and disorderly, and robots usually require complex instructions to perform simple tasks for humans. On the other hand, humans instinctively know what to do.

Previously, training robots took a long time, and researchers had to develop separate solutions for different tasks. With the powerful capabilities of RT-2, robots can analyze more information on their own and infer what to do next.

RT-2 is built on the foundation of the Vision-Language Model (VLM) and introduces a new concept: the Vision-Language-Action (VLA) model. It learns from network and robot data and transforms this knowledge into general instructions that robots can control. The model can even use prompts for reasoning, such as which drink is most suitable for a tired person (energy drink).

In fact, Google introduced the RT-1 version of the robot last year, which only required a single pre-training model. RT-1 could generate instructions from different sensory inputs, such as vision and text, to perform various tasks.

To build a good pre-training model, a large amount of data for self-supervised learning is needed. RT-2 is built on the foundation of RT-1 and uses the demonstration data from RT-1, which was collected by 13 robots in office and kitchen environments over a period of 17 months.

DeepMind has created the VLA model.

As mentioned earlier, RT-2 is built on the foundation of VLM, and the VLM models have already been trained on web-scale data and can be used to perform tasks such as visual question answering, image captioning, and object recognition. In addition, researchers have made adaptive adjustments to the previously proposed VLM models, PaLI-X (Pathways Language and Image model) and PaLM-E (Pathways Language model Embodied), which serve as the backbone of RT-2. These models’ visual-language-action versions are called RT-2-PaLI-X and RT-2-PaLM-E.

To enable the visual-language model to control the robot, the final step is action control. The researchers used a very simple method: they represented robot actions as another language, namely text tokens, and trained them together with web-scale visual-language datasets.

The encoding of robot actions is based on the discretization method proposed by Brohan et al. for the RT-1 model.

As shown in the figure below, the researchers represented robot actions as text strings, which can be sequences of token numbers representing robot actions, such as “1 128 91 241 5 101 127 217”.

The string starts with a flag indicating whether the robot should continue or terminate the current episode. Then, based on the instructions, the robot changes the position and rotation of the end effector and performs commands such as grasping objects.

Since actions are represented as text strings, executing action commands for the robot is as simple as executing string commands. With this representation, we can directly fine-tune existing visual-language models and transform them into visual-language-action models.

During the inference process, text tokens are decomposed into robot actions, enabling closed-loop control.


The researchers conducted a series of qualitative and quantitative experiments on the RT-2 model.

The figure below shows the performance of RT-2 in semantic understanding and basic reasoning. For example, for the task of “putting strawberries in the correct bowl,” RT-2 not only needs to understand the representation of strawberries and bowls but also needs to reason in the context of the scene to know that strawberries should be placed with similar fruits. For the task of “picking up a bag about to fall off the table,” RT-2 needs to understand the physical properties of the bag to eliminate ambiguity between two bags and identify objects in an unstable position.

It is worth noting that all the interaction processes tested in these scenarios have never been seen in the robot’s data.

The figure below shows that the RT-2 model outperforms the previous RT-1 and visual pre-training (VC-1) baselines on four benchmark tests.

RT-2 retains the performance of the robot on the original tasks and improves its performance in previously unseen scenarios, increasing from 32% in RT-1 to 62%.

The results indicate that the visual-language model (VLM) can be transformed into a powerful visual-language-action (VLA) model. By combining VLM pre-training with robot data, robots can be directly controlled.

Similar to ChatGPT, if such capabilities are widely applied, the world is expected to undergo significant changes. However, Google does not have immediate plans to apply RT-2 robots, but the researchers believe that these robots that can understand human language will not only stay at the level of demonstrating their abilities.

Just imagine, robots with built-in language models can be placed in warehouses, help you fetch medicine, and even serve as home assistants—folding clothes, retrieving items from the dishwasher, and tidying up around the house.

It may truly open the door to using robots in human environments, taking over all the physical labor tasks that were not covered by the impact of large models on job positions, as predicted in OpenAI’s report on

Want to receive TechNavi Newsletter in email?