TIM: A Time Interval Machine for Audio-Visual Action Recognition
arxiv(2024)
摘要
Diverse actions give rise to rich audio-visual signals in long videos. Recent
works showcase that the two modalities of audio and video exhibit different
temporal extents of events and distinct labels. We address the interplay
between the two modalities in long videos by explicitly modelling the temporal
extents of audio and visual events. We propose the Time Interval Machine (TIM)
where a modality-specific time interval poses as a query to a transformer
encoder that ingests a long video input. The encoder then attends to the
specified interval, as well as the surrounding context in both modalities, in
order to recognise the ongoing action.
We test TIM on three long audio-visual video datasets: EPIC-KITCHENS,
Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On
EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly
larger pre-training by 2.9
show that TIM can be adapted for action detection, using dense multi-scale
interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and
showing strong performance on the Perception Test. Our ablations show the
critical role of integrating the two modalities and modelling their time
intervals in achieving this performance. Code and models at:
https://github.com/JacobChalk/TIM
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要