Video-Based Human Pose Regression via Decoupled Space-Time Aggregation
CVPR 2024(2024)
摘要
By leveraging temporal dependency in video sequences, multi-frame human pose
estimation algorithms have demonstrated remarkable results in complicated
situations, such as occlusion, motion blur, and video defocus. These algorithms
are predominantly based on heatmaps, resulting in high computation and storage
requirements per frame, which limits their flexibility and real-time
application in video scenarios, particularly on edge devices. In this paper, we
develop an efficient and effective video-based human pose regression method,
which bypasses intermediate representations such as heatmaps and instead
directly maps the input to the output joint coordinates. Despite the inherent
spatial correlation among adjacent joints of the human pose, the temporal
trajectory of each individual joint exhibits relative independence. In light of
this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to
separately capture the spatial contexts between adjacent joints and the
temporal cues of each individual joint, thereby avoiding the conflation of
spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token
for each joint to facilitate the modeling of their spatiotemporal dependencies.
With the proposed joint-wise local-awareness attention mechanism, our method is
capable of efficiently and flexibly utilizing the spatial dependency of
adjacent joints and the temporal dependency of each joint itself. Extensive
experiments demonstrate the superiority of our method. Compared to previous
regression-based single-frame human pose estimation methods, DSTA significantly
enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017.
Furthermore, our approach either surpasses or is on par with the
state-of-the-art heatmap-based multi-frame human pose estimation methods.
Project page: https://github.com/zgspose/DSTA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要