Deformable patch embedding-based shift module-enhanced transformer for panoramic action recognition

Vis. Comput.(2023)

引用 0|浏览0
暂无评分
摘要
360^∘ video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a deformable patch embedding-based temporal shift module-enhanced vision transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by equirectangular projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29 % and an activity accuracy of 8.18 % , where the recent EgoK360 dataset is employed.
更多
查看译文
关键词
Panoramic, Action recognition, Vision transformer, Temporal shift
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要