Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)

引用 4|浏览426
暂无评分
摘要
Conversational emotion and sentiment analysis approaches rely on Natural Language Understanding (NLU) and audio processing components to achieve the goal of detecting emotions and sentiment based on what is being said. While there has been marked progress in pushing the state-of-the-art of theses methods on benchmark multimodal data sets, such as the Multimodal EmotionLines Dataset (MELD), the advances still seem to lag behind what has been achieved in the domain of mainstream Automatic Speech Recognition (ASR) and NLU applications and we were unable to identify any widely used products, services or production-ready systems that would enable the user to reliably detect emotions from audio recordings of multi-party conversations. Published, state-of-the-art scientific studies of multi-view emotion recognition seem to take it for granted that a human-generated or edited transcript is available as input to the NLU modules, providing no information of what happens in a realistic application scenario, where audio only is available and the NLU processing has to rely on text generated by ASR. Motivated by this insight, we present a study designed to evaluate the possibility of applying widely-used state-of-the-art commercial ASR products as the initial audio processing component in an emotion-from-speech detection system. We propose an approach which relies on commercially available products and services, such as Google Speech-to-Text, Mozilla DeepSpeech and the NVIDIA NeMo toolkit to process the audio and applies state-of-the-art NLU approaches for emotion recognition, in order to quickly create a robust, production-ready emotion-from-speech detection system applicable to multi-party conversations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要