Google's RT-2 Brings Human-Like Learning to Robots
Two Steps Forward.
Google’s latest robotics model, called Robotic Transformer 2 (RT-2), amounts to a first step toward what Google describes as “a major leap in the way robots are built and programmed.” As a result, Google was forced to rethink their entire research program because many of the things that they were previously working on have been invalidated.
Robots still fall short of human-level dexterity and fail at some basic tasks, but Google’s use of AI language models to give robots new skills of reasoning and improvisation represents an encouraging breakthrough: it links semantics with robots. Let me put this in perspective.
For years, engineers at Google and other companies trained robots to do a mechanical task – flipping a burger, for example – by programming them with a specific list of instructions: Lower the spatula 6.5 inches, slide it forward until it encounters resistance, raise it 4.2 inches, rotate it 180 degrees, and so on. Robots would then repeat the task over and over, with engineers adjusting the instructions each time until they got it right.
This approach has limited uses, and training robots this way is slow and labor-intensive. It requires collecting lots of data from real-world tests. And if the robot needed to do something new – to flip a pancake instead of a burger – it had to be reprogrammed from square one.
What if…?
Partly because of these limitations, hardware robots have been slower to improve than software-based machines. But recently, researchers at Google had an idea. What if, instead of being programmed for specific tasks one at a time, robots could use an AI language model – one that had been trained on endless amounts of internet text – to learn new skills for themselves?
They began experimenting with these language models, and realized that they already have a lot of knowledge, so they started connecting them to robots.
Google’s first pass at marrying language models and physical robots was introduced last year. They called it PaLM-SayCan, and while it drew moderate attention, its usefulness was limited. The robots couldn’t interpret images – an important skill, if they are to navigate the world. Researchers could write out step-by-step instructions for different tasks, but they couldn’t turn those steps into actions.
However, Google’s new robotics model, RT-2, can do exactly that. The company calls it a “vision-language-action” (VLA) model – an AI system that not only sees and analyzes the world around it, but tells a robot how to move.
It does this by translating the robot’s movements into a series of numbers, called tokens, and incorporating those tokens into the same training data as the language model. Eventually, just as ChatGPT or Bard learns to anticipate which words should come next in a poem or a history essay, RT-2 can learn to guess how a robot’s arm should move to pick up a ball or throw an empty soda can into the recycling bin.
In other words, this model can learn to speak robot!
New Risks
In demonstrations, the robots haven’t been perfect. One incorrectly identified the flavor of a can of LaCroix placed on the table in front of it. Another time, when it was asked what kind of fruit was on a table, a robot simply answered “white.” (It was a banana.)
Granted, moving objects around in the chaotic physical world is harder than doing it in a controlled lab environment. And given that AI language models frequently make mistakes or invent nonsensical answers – researchers call that hallucination or confabulation – using them as the brains of robots would carry new risks.
Google, though, claims RT-2 is equipped with numerous safety features. In addition to a big red button on the back of every robot – which stops the robot in its tracks when pressed – the system uses sensors to avoid bumping into people or objects.
The AI software built into RT-2 has its own safeguards, which it can use to prevent the robot from doing anything harmful. One example: Google’s robots can be trained not to pick up containers with water in them, because water can damage their hardware if it spills.
Not the Jetsons…Yet!
Google has no immediate plans to sell RT-2 robots or release them more widely, but its researchers believe these new language-equipped machines will eventually be useful for more than just parlor tricks. Robots with built-in language models could be put into warehouses, used in medicine or even deployed as household assistants.
We’ll all have to wait for “Rosie the maid,” but AI and robotics continue to make amazing advances, and will someday be a part of our new normal. Until then, we’ll have to rely on our old stand-by…humans.