Entity linking across vision and language
Multimedia Tools Appl.(2017)
摘要
We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modeling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.
更多查看译文
关键词
Entity linking, Animal labeling, Multimedia indexing, Language-vision alignment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络