Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos

International Journal of Computer Vision(2023)

引用 1|浏览19
暂无评分
摘要
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multi-task learning method for audio–visual saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information. Specifically, we first introduce a large-scale database of multi-face video in visual-audio condition, containing eye-tracking data and sound source annotations. Using this database, we find that sound influences human attention, and conversely attention offers a cue to determine sound source on multi-face video. Guided by these findings, an audio–visual multi-task network (AVM-Net) is introduced to predict saliency and locate sound source. AVM-Net consists of three branches corresponding to visual, audio and face modalities. The visual branch has a two-stream architecture to capture spatial and temporal information. Face and audio branches encode audio signals and faces, respectively. Finally, a spatio-temporal multi-modal graph is constructed to model the interaction among multiple faces. With joint optimization of these branches, the intrinsic correlation of the tasks of saliency prediction and sound source localization is utilized and their performance is boosted by each other. Experiments show that the proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
更多
查看译文
关键词
Saliency prediction,Audio–visual,Multi-face video,Deep learning,Sound source localization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要