Reasoning across language and vision in machines and humans

Reasoning across language and vision in machines and humans(2013)

引用 23|浏览20
暂无评分
摘要
Humans not only outperform AI and computer-vision systems, but use an unknown computational mechanism to perform tasks for which no suitable approaches exist. I present work investigating both novel tasks and how humans approach them in the context of computer vision and linguistics. I demonstrate a system which, like children, acquires high-level linguistic knowledge about the world. Robots learn to play physically-instantiated board games and use that knowledge to engage in physical play. To further integrate language and vision I develop an approach which produces rich sentential descriptions of events depicted in videos. I then show how to simultaneously detect and track objects, recognize events, and produce sentences. This tighter integration of language and vision enables a novel task: sentential video retrieval. A video corpus can be searched for clips which depict a target sentence rather than just a collection of individual query words. This work assumes a compositional representation of events, composing sentence models from word models. Perhaps the reason why humans perform tasks such as the above with ease is because of a tight integration of language and vision exploiting the compositionality inherent in both modalities. I present work indicating that this may be the case. Humans are shown videos while fMRI data is acquired and sentences which describe those videos are recovered compositionally.
更多
查看译文
关键词
target sentence,suitable approach,rich sentential description,sentential video retrieval,computer vision,physical play,high-level linguistic knowledge,composing sentence model,present work,novel task
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要