Discrete Diffusion for Co-Speech Gesture Synthesis

Ankur Chemburkar, Shuhong Lu,Andrew Feng

ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction(2023)

引用 0|浏览1
暂无评分
摘要
In this paper, we describe the gesture synthesis system we developed for our entry to the GENEA Challenge 2023. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. Therefore compared to a deterministic regression model, a probabilistic model will be preferred to handle the one-to-many mapping problem. Our system utilizes the vector-quantized variational autoencoder (VQ-VAE) and discrete diffusion as the framework for predicting co-speech gestures. Since the gesture motions are produced via sampling the discrete gesture tokens using the discrete diffusion process, the method is able to produce diverse gestures given the same speech input. Based on the user evaluation results, we further discuss about the strength and limitations of our system, and provide the lessons learned when developing and tuning the system. The subjective evaluation results show that our method ranks in the middle for human-likeness among all submitted entries. In the the speech appropriateness evaluations, our method has preferences of 55.4% for matched agent gesture and 51.1% for matched interlocutor gestures. Overall, we demonstrated the potential of discrete diffusion models in gesture generation.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要