GTA: Global Temporal Attention for Video Action Understanding.

British Machine Vision Conference(2021)

引用 22|浏览93
暂无评分
摘要
Self-attention learns pairwise interactions via dot products to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. In particular, we demonstrate that the entangled modeling of spatial-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. We introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA randomly initializes a global attention matrix that is intended to learn stable temporal structures to generalize across different samples. GTA is further augmented with a cross-channel multi-head fashion to exploit feature interactions for better temporal modeling. We apply GTA not only on pixels but also on semantically similar regions identified automatically by a learned transformation matrix. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances the temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
更多
查看译文
关键词
global temporal attention,action,video
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要