Motion Stimulation for Compositional Action Recognition

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 14|浏览63
暂无评分
摘要
Recognizing the unseen combinations of action and different objects, namely (zero-shot) compositional action recognition, is extremely challenging for conventional action recognition algorithms in real-world applications. Previous methods focus on enhancing the dynamic clues of objects that appear in the scene by building region features or tracklet embedding from ground-truths or detected bounding boxes. These methods rely heavily on manual annotation or the quality of detectors, which are inflexible for practical applications. In this work, we aim to mining the temporal clues from moving objects or hands without explicit supervision. Thus, we propose a novel Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames. Furthermore, MS consists of the following three steps: motion feature extraction, motion feature recalibration, and action-centric excitation. The proposed MS block can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms. Extensive experimental results on three action recognition datasets, the Something-Else, IKEA-Assembly and EPIC-KITCHENS datasets, indicate the effectiveness and interpretability of our MS block.
更多
查看译文
关键词
Motion stimulation,compositional action recognition,motion feature extraction,motion feature recalibration,action-centric excitation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要