Multimodal Speaker Adaptation Of Acoustic Model And Language Model For Asr Using Speaker Face Embedding

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2019)

引用 5|浏览19
暂无评分
摘要
We present an investigation into the adaptation of the acoustic model and the language model for automatic speech recognition (ASR) using speaker face for transcription of a multimedia dataset. We begin by overviewing relevant previous work on the integration of visual signals into ASR systems. Our experimental investigation shows a small improvement in word error rate (WER) for the transcription of a collection of instruction videos using adaptation of the acoustic model and the language model with fixed-length face embedding vectors. We also present potential approaches to integrating human facial information, and body gestures into ASR as further directions for research on this topic.
更多
查看译文
关键词
multimodal speech recognition, face embedding, adaptation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要