STAR plus plus : Rethinking spatio-temporal cross attention transformer for video action recognition

APPLIED INTELLIGENCE(2023)

引用 0|浏览2
暂无评分
摘要
Video action recognition needs to model any differences by subdividing the spatio-temporal features to distinguish various actions. We propose rethinking spatio-temporal cross attention transformer (STAR++), a multi-modal transformer-based model that uses both RGB and skeleton information as an extended version of STAR-Transformer. STAR++ unifies the encoder-decoder structure of the base spatio-temporal cross attention transformer (STAR-Transformer) into an encoder structure and applies a new method of using interval attention as spatio-temporal cross attention. STAR++ provides interval attention from local features to global features as the layer deepens, allowing it to learn appropriately based on the transformer properties, improving the performance. In addition, STAR++ additionally proposes a deformable 3D token selection that can dynamically select and learn tokens for an attention operation such that tokens can be efficiently learned. The proposed STAR++ demonstrated competitive performance when compared with other state-of-the-art models using Penn action and NTU-RGB+D 60, 120, which are action recognition benchmark datasets. In addition, an ablation study was conducted to confirm that each proposed module has an essential effect on the performance improvement.
更多
查看译文
关键词
Action recognition, Vision transformer, Multi-modal, Deformable 3d token selection, Spatio-temporal cross attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要