Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

IEEE SIGNAL PROCESSING LETTERS(2022)

引用 15|浏览20
暂无评分
摘要
Speech emotion recognition combining linguistic content and audio signals in the dialog is a challenging task. Nevertheless, previous approaches have failed to explore emotion cues in contextual interactions and ignored the long-range dependencies between elements from different modalities. To tackle the above issues, this letter proposes a multimodal speech emotion recognition method using audio and text data. We first present a contextual transformer module to introduce contextual information via embedding the previous utterances between interlocutors, which enhances the emotion representation of the current utterance. Then, the proposed cross-modal transformer module focuses on the interactions between text and audio modalities, adaptively promoting the fusion from one modality to another. Furthermore, we construct associative topological relation over mini-batch and learn the association between deep fused features with graph convolutional network. Experimental results on the IEMOCAP and MELD datasets show that our method outperforms current state-of-the-art methods.
更多
查看译文
关键词
Transformers,Emotion recognition,Convolution,Acoustics,Speech recognition,Stacking,Pipelines,Contextual interaction,cross-modal interaction,graph convolutional network,speech emotion recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要