A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech

International Conference on Computer Graphics and Interactive Techniques(2022)

引用 30|浏览22
BSTRACT Recent deep learning-based approaches have shown promising results for synthesizing plausible 3D human gestures from speech input. However, these approaches typically offer limited freedom to incorporate user control. Furthermore, training such models in a supervised manner often does not capture the multi-modal nature of the data, particularly because the same audio input can produce different gesture outputs. To address these problems, we present an approach for generating controllable 3D gestures that combines the advantage of database matching and deep generative modeling. Our method predicts 3D body motion by sequentially searching for the most plausible audio-gesture clips from a database using a k-Nearest Neighbors (k-NN) algorithm that considers the similarity to both the input audio and the previous body pose information. To further improve the synthesis quality, we propose a conditional Generative Adversarial Network (cGAN) model to provide a data-driven refinement to the k-NN result by comparing its plausibility against the ground truth audio-gesture pairs. Our novel approach enables direct and more varied control manipulation that is not possible with prior learning-based counterparts. Our experiments show that our proposed approach outperforms recent models on control-based synthesis tasks using high-level signals such as motion statistics while enabling flexible and effective user control for lower-level signals. 1
AI 理解论文