Multi-scale network with shared cross-attention for audio–visual correlation learning

Neural Comput. Appl.(2023)

引用 0|浏览2
暂无评分
摘要
Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences between audio and video. We face two challenges during audio–visual correlation learning: (i) audio and visual feature sequences, respectively, belong to different feature spaces, and (ii) semantic mismatch between audio and visual sequences inevitably happens. To solve these challenges, existing works mainly focus on how to efficiently extract discriminative features, while ignoring the abundant granular features of audio and visual modalities. In this work, we introduce the multi-scale network with shared cross-attention (MSNSCA) module for audio–visual correlation learning, a supervised representation learning framework for capturing semantic audio–visual correspondences by integrating a multi-scale feature extraction module with strong cross-attention into an end-to-end trainable deep network. MSNSCA can extract more effective audio–visual particle features with excellent audio–visual semantic matching capability. Experiments on various audio–visual learning tasks, including audio–visual matching and retrieval on benchmark datasets, demonstrate the effectiveness of the proposed MSNSCA model.
更多
查看译文
关键词
Cross-modal retrieval, Multi-scale network, Cross-attention, Representation learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要