Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Tao Tu,Qing Ping,Govind Thattai,Gokhan Tur,Prem Natarajan,Thattai Govindarajan

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021（2021）

引用 15|浏览82

暂无评分

摘要

GuessWhat?! is a visual dialog guessing game which incorporates a Questioner agent that generates a sequence of questions, while an Oracle agent answers the respective questions about a target object in an image. Based on this dialog history between the Questioner and the Oracle, a Guesser agent makes a final guess of the target object. While previous work has focused on dialogue policy optimization and visual-linguistic information fusion, most work learns the vision-linguistic encoding for the three agents solely on the GuessWhat?! dataset without shared and prior knowledge of vision-linguistic representation. To bridge these gaps, this paper proposes new Oracle, Guesser and Questioner models that take advantage of a pretrained vision-linguistic model, VilBERT. For Oracle model, we introduce a two-way background/target fusion mechanism to understand both intra and inter-object questions. For Guesser model, we introduce a state-estimator that best utilizes VilBERT's strength in single-turn referring expression comprehension. For the Questioner, we share the state-estimator from pretrained Guesser with Questioner to guide the question generator. Experimental results show that our proposed models outperform state-of-the-art models significantly by 7%, 10%, 12% for Oracle, Guesser and End-to-End Questioner respectively.

查看译文

关键词

visual dialog guessing game,Questioner agent,Oracle agent,Guesser agent,dialogue policy optimization,visual-linguistic information fusion,vision-linguistic encoding,GuessWhat,vision-linguistic representation,pretrained vision-linguistic model,state-estimator,question generator,visual dialog agents,visual-linguistic representation,VilBERT model,single-turn referring expression comprehension

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要