Learning Correlation Structures for Vision Transformers
CVPR 2024(2024)
摘要
We introduce a new attention mechanism, dubbed structural self-attention
(StructSA), that leverages rich correlation patterns naturally emerging in
key-query interactions of attention. StructSA generates attention maps by
recognizing space-time structures of key-query correlations via convolution and
uses them to dynamically aggregate local contexts of value features. This
effectively leverages rich structural patterns in images and videos such as
scene layouts, object motion, and inter-object relations. Using StructSA as a
main building block, we develop the structural vision transformer (StructViT)
and evaluate its effectiveness on both image and video classification tasks,
achieving state-of-the-art results on ImageNet-1K, Kinetics-400,
Something-Something V1 V2, Diving-48, and FineGym.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要