Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning
arxiv(2024)
摘要
For a complete comprehension of multi-person scenes, it is essential to go
beyond basic tasks like detection and tracking. Higher-level tasks, such as
understanding the interactions and social activities among individuals, are
also crucial. Progress towards models that can fully understand scenes
involving multiple people is hindered by a lack of sufficient annotated data
for such high-level tasks. To address this challenge, we introduce Social-MAE,
a simple yet effective transformer-based masked autoencoder framework for
multi-person human motion data. The framework uses masked modeling to pre-train
the encoder to reconstruct masked human joint trajectories, enabling it to
learn generalizable and data efficient representations of motion in human
crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a
lighter-weight transformer as the MAE decoder which operates on multi-person
joints' trajectory in the frequency domain. After the reconstruction task, the
MAE decoder is replaced with a task-specific decoder and the model is
fine-tuned end-to-end for a variety of high-level social tasks. Our proposed
model combined with our pre-training approach achieves the state-of-the-art
results on various high-level social tasks, including multi-person pose
forecasting, social grouping, and social action understanding. These
improvements are demonstrated across four popular multi-person datasets
encompassing both human 2D and 3D body pose.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要