Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
British Machine Vision Conference(2023)
摘要
Previous one-stage action detection approaches have modelled temporal
dependencies using only the visual modality. In this paper, we explore
different strategies to incorporate the audio modality, using multi-scale
cross-attention to fuse the two modalities. We also demonstrate the correlation
between the distance from the timestep to the action centre and the accuracy of
the predicted boundaries. Thus, we propose a novel network head to estimate the
closeness of timesteps to the action centre, which we call the centricity
score. This leads to increased confidence for proposals that exhibit more
precise boundaries. Our method can be integrated with other one-stage
anchor-free architectures and we demonstrate this on three recent baselines on
the EPIC-Kitchens-100 action detection benchmark where we achieve
state-of-the-art performance. Detailed ablation studies showcase the benefits
of fusing audio and our proposed centricity scores. Code and models for our
proposed method are publicly available at
https://github.com/hanielwang/Audio-Visual-TAD.git
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要