The guest today is **Tan Jie, Senior Research Scientist and Tech Lead at Google DeepMind Robotics Team**. His research focuses on applying foundation models and deep reinforcement learning methods to the field of robotics.
There have always been two narratives in the field of robotics between China and the US: the market generally believes that China develops faster in hardware, while the US leads in robotic brain design.
**In this episode, Tan Jie will offer us a glimpse into the cutting-edge narrative of robotics from a Silicon Valley perspective, especially that of Google DeepMind.**
Not long ago, they just released their new work "Gemini Robotics 1.5 brings AI agents into the physical world," and we also discussed their latest findings.
Due to the guest's work environment, there will be a certain degree of mixed Chinese and English, and we ask for everyone's understanding and support.
> **02:00 Robotics is doing graphics in the real world; graphics is doing robotics in simulation.**
Guest's brief biography: Liked playing games as a child, pursued a Ph.D. in computer graphics.
The transition from graphics to robotics.
My first paper at Google, "Sim-to-Real: Learning Agile Locomotion For Quadruped Robots," pioneered the application of reinforcement learning and sim-to-real in legged robots.
Paradigm Shift: The first in the past decade was reinforcement learning, the second was large language models.
The impact of large language models on robotics (large language models are like the cerebrum, reinforcement learning is like the cerebellum).
> **13:06 Is the robotics foundation model truly a very independent discipline? So far, not yet.**
What stage has robotics development reached today?
It's not an exaggeration for a decade to pass from a demo to actual implementation.
From my perspective, I have to admit that the development of robotics intelligence in recent years has mainly relied on multimodal large models.
But what do multimodal models lack? They lack the output of robot actions.
When you truly have a generalist model, specialized models simply cannot compete with it.
> **23:44 The biggest problem in Robotics is data; it's in a very complex unstructured environment where anything can happen.**
The biggest problem is still data.
But robotics operates in a very complex unstructured environment where anything can happen.
It requires an extremely large amount of very diverse data, but such data does not currently exist.
There are many startups now called "data factory."
What does the so-called "data pyramid" include?
> **27:52 Gemini Robotics 1.5: We have a method called motion transfer, which is our unique secret.**
What are the most important discoveries of Gemini Robotics 1.5?
First, we incorporated "thinking" into the VLA model.
The second very important breakthrough is cross-embodiment transfer.
In the Gemini Robotics 1.5 work, we made a distinction between fast and slow models.
It should be a transitional approach, as it is currently constrained by computational power and model size.
When you want a unify model, it must be very large.
Motion Transfer? It's very secret.
> **47:32 Generating a huge amount of simulated data is an important means to compensate for its shortcomings.**
One point we attach great importance to is data, data, data.
Teleoperation is very difficult data to acquire.
We will put more effort into using, for example, simulation data, human video, data from YouTube, and even model-generated data, such as some data generated by VEO.
Real data has no sim-to-real gap, but generalizability is determined by data coverage, not by whether it's real or virtual data itself.
In the near future, traditional physical simulation will gradually be replaced by generative model-based simulation.
I believe in scalable data.
> **01:03:48 A world model is Vision-Language-Vision, vision and language in, generating the next frame of images.**
The definition of a world model is: if you provide the previous frame and the robot's action, you can predict the next frame.
From another perspective, VEO is a video generation model, but Genie is more like a world model.
When you can have an input at each frame to change your next frame, that feeling is a world model; but if it's an already generated, static video of a few seconds, then it's not.
A world model is essentially Vision-Language-Vision, with vision and language as input, it can generate the next frame of images.
> **01:08:29 If you have a dexterous hand, haptics become very important. The reason I previously thought haptics were not important was limited by the hardware at the time.**
If you have a dexterous hand, haptics become very important.
The reason I previously thought haptics were not important was that it was actually limited by the hardware at the time.
We are still in the gripper era.
For all tasks that can be accomplished by grippers, I still believe vision can solve 95% of the problems.
In the future, humanoid robots will not be the only form, but they will definitely be a mainstream form.
If your goal is to solve AGI in the physical world, then I will be very focused on what the final form looks like; other things might be distractions.
> **01:17:35 A person with a sense of mission will not tolerate saying "I'm on a wrong ship."**
Have there been any changes in Google AI or robotics research culture in recent years?
Whether it's promotion, performance review, incentive, or various structures, Google wants to create an environment where more people can work together to solve bigger problems.
Like Gemini Robotics, it's more top-down.
I found that China might not be as competitive as me; I might work 70 to 80 hours a week.
Seriously, this era really can't wait, otherwise others will have already done it.
A lot of AI is mathematics, and Chinese people are generally better at mathematics.
《106. Talking with Wang He about the Academic Edge History of Embodied Intelligence and the Man-made Chaos after Capital Bombardment》
《109. Robots Encountering a Data Famine? Talking with Xie Chen: Simulation and Synthetic Data, Meta's Sky-High Acquisition, and Alexandr Wang》
[More Information]
The text version of this episode has been published. Please search for our studio's official public account:
语言即世界language is world
Original title:
121. 对DeepMind谭捷的访谈:机器人、跨本体、世界模型、Gemini Robotics 1.5和Google
Original description:
<figure><img src="https://image.xyzcdn.net/Flo18nNUSP7OUNlTf8UgCdHxio6O.jpg" /></figure><p>今天的嘉宾是<s…