PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues.

AAAI(2023)

引用 0|浏览0
暂无评分
摘要
Humans naturally use referring expressions with verbal utterances and nonverbal gestures to refer to objects and events. As these referring expressions can be interpreted differently from the speaker's or the observer's perspective, people effectively decide on the perspective in comprehending the expressions. However, existing models do not explicitly learn perspective grounding , which often causes the models to perform poorly in understanding embodied referring expressions. To make it exacerbate, these models are often trained on datasets collected in non-embodied settings without nonverbal gestures and curated from an exocentric perspective. To address these issues, in this paper, we present a perspective-aware multitask learning model, called PATRON, for relation and object grounding tasks in embodied settings by utilizing verbal utterances and nonverbal cues. In PATRON, we have developed a guided fusion approach, where a perspective grounding task guides the relation and object grounding task. Through this approach, PATRON learns disentangled task-specific and task-guidance representations, where task-guidance representations guide the extraction of salient multimodal features to ground the relation and object accurately. Furthermore, we have curated a synthetic dataset of embodied referring expressions with multimodal cues, called CAESARPRO. The experimental results suggest that PATRON outperforms the evaluated state-of-the-art visual-language models. Additionally, the results indicate that learning to ground perspective helps machine learning models to improve the performance of the relation and object grounding task. Furthermore, the insights from the extensive experimental results and the proposed dataset will enable researchers to evaluate visual-language models' effectiveness in understanding referring expressions in other embodied settings.
更多
查看译文
关键词
expression grounding,multimodal,perspective-aware
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要