Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding

WACV(2023)

引用 2|浏览27
暂无评分
摘要
To understand our surrounding world, our brain is continuously inundated with multisensory information and their complex interactions coming from the outside world at any given moment. While processing this information might seem effortless for human brains, it is challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with a single type of integration but require more sophisticated approaches. In this paper, we propose a new simple method to address the multisensory integration in video understanding. Unlike previous works where a single fusion type is used, we design a multi-head model with individual event-specific layers to deal with different audio-visual relationships, enabling different ways of audio-visual fusion. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos, e.g., semantically matched moments, and rhythmic events. Moreover, although our network is trained with single labels, our multi-head design can inherently output additional semantically meaningful multi-labels for a video. As an application, we demonstrate that our proposed method can expose the extent of event-characteristics of popular benchmark datasets.
更多
查看译文
关键词
fusion,video,event-specific,audio-visual
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要