Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics
arxiv(2024)
摘要
We show that off-the-shelf text-based Transformers, with no additional
training, can perform few-shot in-context visual imitation learning, mapping
visual observations to action sequences that emulate the demonstrator's
behaviour. We achieve this by transforming visual observations (inputs) and
trajectories of actions (outputs) into sequences of tokens that a
text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a
framework we call Keypoint Action Tokens (KAT). Despite being trained only on
language, we show that these Transformers excel at translating tokenised visual
keypoint observations into action trajectories, performing on par or better
than state-of-the-art imitation learning (diffusion policies) in the low-data
regime on a suite of real-world, everyday tasks. Rather than operating in the
language domain as is typical, KAT leverages text-based Transformers to operate
in the vision and action domains to learn general patterns in demonstration
data for highly efficient imitation learning, indicating promising new avenues
for repurposing natural language models for embodied tasks. Videos are
available at https://www.robot-learning.uk/keypoint-action-tokens.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要