How GPT-4 Powers Alter3: Mapping Natural Language to Robot Actions

**LLMs Revolutionizing Robot Control**

Researchers at the University of Tokyo and Alternative Machine have made significant progress in the field of robotics by developing a humanoid robot named Alter3 that can understand and execute natural language commands. Alter3 utilizes the power of large language models (LLMs), such as GPT-4, to perform complex tasks like taking selfies or mimicking behaviors like being a ghost. This research is part of a larger trend that combines LLMs with robotics systems, pushing the boundaries of robotics research and offering promising opportunities for the future.

**How LLMs Enable Robot Control**

Alter3 relies on GPT-4 as the backend model, which processes natural language instructions describing an action or situation. The model uses an “agentic framework” to plan a series of actions necessary for the robot to achieve its goal. In the planning stage, the model acts as a planner, determining the specific steps required for the desired action. These steps are then passed on to a coding agent that generates the commands for the robot to execute. Although GPT-4 hasn’t been trained on Alter3’s programming commands, its in-context learning ability allows it to adapt its behavior to the robot’s API. The model maps each step to API commands that are sent for execution.

**Incorporating Human Feedback and Memory**

To ensure more accurate execution, the researchers have implemented a feedback system that allows humans to provide corrections and refinements. For example, if the robot’s action sequence doesn’t produce the desired behavior, humans can offer instructions like “Raise your arm a bit more.” Another GPT-4 agent reasons over the code, makes necessary corrections, and returns the refined action sequence to the robot. This feedback loop improves Alter3’s performance and allows for continuous learning and improvement.

**Versatility and Realism with GPT-4**

Alter3 has been put to the test on various tasks, including everyday actions like taking selfies and drinking tea, as well as mimicry motions such as pretending to be a ghost or a snake. The researchers found that GPT-4’s extensive knowledge about human behaviors enables Alter3 to create more realistic behavior plans. Moreover, the model can infer emotions from text and reflect them in Alter3’s physical responses, even when emotional expressions are not explicitly stated. This adds a new level of depth and authenticity to humanoid robot interactions.

**The Growing Importance of Foundation Models**

The use of foundation models, like GPT-4, is gaining popularity in robotics research. For instance, Figure, a company valued at $2.6 billion, leverages OpenAI models to understand human instructions and carry out real-world actions. As foundation models become more multimodal, robotics systems will become better equipped to reason about their environment and make informed decisions. Alter3 is a prime example of using off-the-shelf foundation models as reasoning and planning modules in robotics control systems. It doesn’t require a fine-tuned version of GPT-4 and can be adapted for other humanoid robots.

**Challenges in Robot Development**

While LLMs offer immense potential, there are still challenges in developing robots that can perform basic tasks like object grasping, maintaining balance, and navigating their surroundings. These fundamental skills are essential for robots to interact effectively with the world, but they often require additional work beyond what LLMs can handle. Chris Paxton, an AI and robotics research scientist, highlights the importance of these tasks and the need for more data to address them effectively.

In conclusion, the integration of LLMs with robotics systems, as demonstrated by Alter3, represents a significant advancement in the field of robotics. By leveraging the power of language models like GPT-4, robots can understand and execute complex instructions, mimic human behaviors, and even infer emotions. While there are still challenges to overcome in creating robots with basic physical capabilities, the combination of LLMs and robotics systems holds great promise for the future of human-robot interactions.