- Google Research and Alphabet-owned Everyday Robots combine what they call ‘SayCan’ (language models with real-world basis in pre-trained abilities) with PaLM, or Pathways Language Model.
- Google researchers explain how they organize the robot’s planning capabilities to choose one of its’skills’ based on a high-level instruction from a human, and then analyze how likely each possible skill is for completing the instruction in their paper ‘Do As I Can, Not As I Say.’
Google Research and Alphabet-owned Everyday Robots integrate SayCan (language models with a real-world foundation in pre-trained skills) and PaLM, or Pathways Language Model, its largest language model. Researchers at Everyday Robots are utilizing large-scale language models to assist robots in avoiding misinterpretation of human communication that could result in inappropriate or even dangerous actions.
This combination, known as PaLM-SayCan, demonstrates a way forward for simplifying human-robot communication and enhancing robotic task performance.
Vincent Vanhoucke, distinguished scientist and head of robotics at Google Research, explains, “PaLM can help the robotic system process more complex, open-ended prompts and respond to them in ways that are reasonable and sensible,”
Large language models such as OpenAI’s GPT-3 can simulate how humans use language and assist programmers with auto code completion suggestions like GitHub’s Copilot, but these do not translate to the physical world in which robots may one day operate in a domestic setting.
On the robotics side, factory robots are rigidly programmed today. Google’s research demonstrates how humans could one day use natural language to ask a robot a question that requires the robot to comprehend the question’s context and then take an appropriate action in a given environment.
For instance, the current GPT-3 response to “I spilled my drink, can you help?” is “You could try using a vacuum cleaner.” That is potentially dangerous behavior. LaMDA, Google’s conversational or dialogue-based AI, responds, “Do you want me to find a cleaner?” while FLAN responds, “I’m sorry, I didn’t mean to spill it.”
The team from Google Research and Everyday Robots tested the PALM-SayCan method in a kitchen environment using a robot.
Their strategy involved ‘grounding’ PaLM in the context of a robot receiving high-level commands from a human, where the robot must determine what actions are useful and what it is capable of in that environment.
Now, when a Google researcher says “I spilled my drink, can you help?” the robot responds with a sponge and attempts to place the empty can in the correct recycling bin. Additional training could include learning how to clean up the spill.
Vanhoucke describes the operation of grounding the language model in PaLM-SayCan.
“PaLM suggests possible approaches to a task based on language comprehension, and robot models do the same based on a skill set that is technically feasible. The combined system then cross-references the two to identify more effective and realizable robot strategies.”
In addition to facilitating human-robot communication, this strategy enhances the robot’s performance and capacity to plan and execute tasks.
In their paper titled ‘Do As I Can, Not As I Say,’ Google researchers describe how they structure a robot’s planning capabilities to identify one of its’skills’ based on a high-level instruction from a human, and then assess the likelihood of each possible skill to fulfill the instruction.
“Practically, we structure the planning as a dialog between a user and a robot, in which a user provides the high level-instruction, e.g. ‘How would you bring me a coke can?’ and the language model responds with an explicit sequence e.g. ‘I would: 1. Find a coke can, 2. Pick up the coke can, 3. Bring it to you, 4. Done’.”
“SayCan, given a high-level instruction, selects the skill to perform by combining probabilities from a language model (representing the probability that a skill is useful for the instruction) and probabilities from a value function (representing the probability of successfully executing said skill). This emits a feasible and useful ability. Repeating the process by adding the selected skill to the robot’s response and querying the models until the output step is to conclude.”