A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals
arxiv(2024)
摘要
Estimating full-body human motion via sparse tracking signals from
head-mounted displays and hand controllers in 3D scenes is crucial to
applications in AR/VR. One of the biggest challenges to this task is the
one-to-many mapping from sparse observations to dense full-body motions, which
endowed inherent ambiguities. To help resolve this ambiguous problem, we
introduce a new framework to combine rich contextual information provided by
scenes to benefit full-body motion tracking from sparse observations. To
estimate plausible human motions given sparse tracking signals and 3D scenes,
we develop S^2Fusion, a unified framework fusing Scene and
sparse Signals with a conditional difFusion model.
S^2Fusion first extracts the spatial-temporal relations residing in
the sparse signals via a periodic autoencoder, and then produces time-alignment
feature embedding as additional inputs. Subsequently, by drawing initial noisy
motion from a pre-trained prior, S^2Fusion utilizes conditional
diffusion to fuse scene geometry and sparse tracking signals to generate
full-body scene-aware motions. The sampling procedure of S^2Fusion is
further guided by a specially designed scene-penetration loss and
phase-matching loss, which effectively regularizes the motion of the lower body
even in the absence of any tracking signals, making the generated motion much
more plausible and coherent. Extensive experimental results have demonstrated
that our S^2Fusion outperforms the state-of-the-art in terms of
estimation quality and smoothness.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要