Improving Speech Emotion Recognition via Fine-tuning ASR with Speaker Information

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)(2022)

引用 2|浏览3
暂无评分
摘要
Speech emotion recognition (SER), which derives the emotion labels (happiness, anger, sadness, and neutral state) from speech inputs, is an important component of human-computer interaction (HCI) interfaces. However, the following issues limit the capability of SER. First, small scale of currently existing datasets is insufficient for training and reliably evaluating more advanced neural models. Second, it can be observed that SER performance is affected by linguistics, phonetics, and speaker information. However, previous works only focused on one of them or largely omitted speaker information from SER systems while human emotion is highly individualistic. To address these issues, we propose to fine-tune an automatic speech recognition (ASR) model based on Conformer, a recent advanced deep learning architecture previously trained on large-scale datasets, for the downstream task of SER. This helps our SER model utilize numerous available ASR datasets to well capture phonetics and linguistics in each speech utterance. In addition, we propose to perform segment-wise concatenation of speaker embedding and statically-pooled ASR embeddings to simultaneously encode both phonetics, linguistics and speaker information into a single speech segment representation. Experimental results demonstrated that our proposed model achieves state-of-the-art (SoTA) performance on the benchmark interactive emotional dyadic motion capture (IEMOCAP) corpus.
更多
查看译文
关键词
Transfer Learning,Speech Emotion Recognition,Speech Recognition,Speaker Verification,Conformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要