MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model
arxiv(2024)
摘要
The body movements accompanying speech aid speakers in expressing their
ideas. Co-speech motion generation is one of the important approaches for
synthesizing realistic avatars. Due to the intricate correspondence between
speech and motion, generating realistic and diverse motion is a challenging
task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion
generation framework based on the diffusion model to ensure both the
authenticity and diversity of generated motion. We propose a progressive fusion
strategy to enhance the interaction of inter-modal and intra-modal, efficiently
integrating multi-modal information. Specifically, we employ a masked style
matrix based on emotion and identity information to control the generation of
different motion styles. Temporal modeling of speech and motion is partitioned
into style-guided specific feature encoding and shared feature encoding, aiming
to learn both inter-modal and intra-modal features. Besides, we propose a
geometric loss to enforce the joints' velocity and acceleration coherence among
frames. Our framework generates vivid, diverse, and style-controllable motion
of arbitrary length through inputting speech and editing identity and emotion.
Extensive experiments demonstrate that our method outperforms current co-speech
motion generation methods including upper body and challenging full body.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要