PhISANet: Phonetically Informed Speech Animation Network

Salvador Medina,Sarah L. Taylor, Carsten Stoll, Gareth Edwards,Alex Hauptmann,Shinji Watanabe,Iain Matthews

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览2
暂无评分
摘要
Realistic animation is crucial for immersive and seamless human-avatar interactions as digital avatars become more prevalent. This work presents PhISANet, an encoder-decoder model that realistically animates the face and tongue solely from speech. PhISANet leverages neural audio representations trained on vast amounts of speech to map the speech signal into animation parameters that control the lower face and tongue of realistic 3D models. By integrating a novel multi-task learning strategy during the training phase, PhISANet reincorporates the phonetic information from the input speech, improving articulation in the generated animations. A thorough quantitative and qualitative study validates this improvement, and it determines that WavLM and Whisper features are ideal for training a generalizable speech-animation model regardless of gender, age, and language.
更多
查看译文
关键词
Speech Animation,Multi-task Learning,CTC,Tongue,EMA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要