Semantic-Disentangled Transformer With Noun-Verb Embedding for Compositional Action Recognition

IEEE TRANSACTIONS ON IMAGE PROCESSING(2024)

引用 0|浏览4
暂无评分
摘要
Recognizing actions performed on unseen objects, known as Compositional Action Recognition (CAR), has attracted increasing attention in recent years. The main challenge is to overcome the distribution shift of "action-objects" pairs between the training and testing sets. Previous works for CAR usually introduce extra information (e.g. bounding box) to enhance the dynamic cues of video features. However, these approaches do not essentially eliminate the inherent inductive bias in the video, which can be regarded as the stumbling block for model generalization. Because the video features are usually extracted from the visually cluttered areas in which many objects cannot be removed or masked explicitly. To this end, this work attempts to implicitly accomplish semantic-level decoupling of "object-action" in the high-level feature space. Specifically, we propose a novel Semantic-Decoupling Transformer framework, dubbed as DeFormer, which contains two insightful sub-modules: Objects-Motion Decoupler (OMD) and Semantic-Decoupling Constrainer (SDC). In OMD, we initialize several learnable tokens incorporating annotation priors to learn an instance-level representation and then decouple it into the appearance feature and motion feature in high-level visual space. In SDC, we use textual information in the high-level language space to construct a dual-contrastive association to constrain the decoupled appearance feature and motion feature obtained in OMD. Extensive experiments verify the generalization ability of DeFormer. Specifically, compared to the baseline method, DeFormer achieves absolute improvements of 3%, 3.3%, and 5.4% under three different settings on STH-ELSE, while corresponding improvements on EPIC-KITCHENS-55 are 4.7%, 9.2%, and 4.4%. Besides, DeFormer gains state-of-the-art results either on ground-truth or detected annotations.
更多
查看译文
关键词
Compositional action recognition,video transformer,semantic-disentangled
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要