Synchformer: Efficient Synchronization from Sparse Cues

Vladimir Iashin,Weidi Xie,Esa Rahtu,Andrew Zisserman

CoRR（2024）

引用 0|浏览4

暂无评分

摘要

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

查看译文

关键词

Audio-visual synchronization,transformers,multi-modal contrastive learning,evidence attribution

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要