RT-H: Action Hierarchies Using Language
arxiv(2024)
摘要
Language provides a way to break down complex concepts into digestible
pieces. Recent works in robot imitation learning use language-conditioned
policies that predict actions given visual observations and the high-level task
specified in language. These methods leverage the structure of natural language
to share data between semantically similar tasks (e.g., "pick coke can" and
"pick an apple") in multi-task datasets. However, as tasks become more
semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data
between tasks becomes harder, so learning to map high-level tasks to actions
requires much more demonstration data. To bridge tasks and actions, our insight
is to teach the robot the language of actions, describing low-level motions
with more fine-grained phrases like "move arm forward". Predicting these
language motions as an intermediate step between tasks and actions forces the
policy to learn the shared structure of low-level motions across seemingly
disparate tasks. Furthermore, a policy that is conditioned on language motions
can easily be corrected during execution through human-specified language
motions. This enables a new paradigm for flexible policies that can learn from
human intervention in language. Our method RT-H builds an action hierarchy
using language motions: it first learns to predict language motions, and
conditioned on this and the high-level task, it predicts actions, using visual
context at all stages. We show that RT-H leverages this language-action
hierarchy to learn policies that are more robust and flexible by effectively
tapping into multi-task datasets. We show that these policies not only allow
for responding to language interventions, but can also learn from such
interventions and outperform methods that learn from teleoperated
interventions. Our website and videos are found at
https://rt-hierarchy.github.io.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要