Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Wenxuan Wang,Yisi Zhang,Xingjian He,Yichen Yan,Zijia Zhao,Xinlong Wang,Jing Liu

Annual Meeting of the Association for Computational Linguistics（2024）

University of Chinese Academy of Sciences (UCAS Beijing Academy of Artificial Intelligence

Cited 3|Views39

Abstract

Visual grounding (VG) aims at locating the foreground entities that match thegiven natural language expression. Previous datasets and methods for classic VGtask mainly rely on the prior assumption that the given expression mustliterally refer to the target object, which greatly impedes the practicaldeployment of agents in real-world scenarios. Since users usually prefer toprovide the intention-based expressions for the desired object instead ofcovering all the details, it is necessary for the agents to interpret theintention-driven instructions. Thus, in this work, we take a step further tothe intention-driven visual-language (V-L) understanding. To promote classic VGtowards human intention interpretation, we propose a new intention-drivenvisual grounding (IVG) task and build a largest-scale IVG dataset namedIntentionVG with free-form intention expressions. Considering that practicalagents need to move and find specific targets among various scenarios torealize the grounding task, our IVG task and IntentionVG dataset have taken thecrucial properties of both multi-scenario perception and egocentric view intoconsideration. Besides, various types of models are set up as the baselines torealize our IVG task. Extensive experiments on our IntentionVG dataset andbaselines demonstrate the necessity and efficacy of our method for the V-Lfield. To foster future research in this direction, our newly built dataset andbaselines will be publicly available.

Translated text

Key words

Data Visualization,Natural Language Generation

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined