Vision-Language Models as a Source of Rewards

Kate Baumli, Satinder Baveja,Feryal Behbahani,Harris Chan,Gheorghe Comanici,Sebastian Flennerhag,Maxime Gazeau, Kristian Holsheimer,Dan Horgan,Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney,Volodymyr Mnih,Alexander Neitz, Dmitry Nikulin, Fabio Pardo,Jack Parker-Holder,John Quan,Tim Rocktäschel,Himanshu Sahni,Tom Schaul,Yannick Schroecker, Stephen Spencer, Richie Steigerwald,Luyu Wang, Lei Zhang

arxiv(2023)

Cited 0|Views48
No score
Abstract
Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined