Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

Dingkang Yang,Shuai Huang,Yang Liu,Lihua Zhang

IEEE SIGNAL PROCESSING LETTERS（2022）

引用 15|浏览20

暂无评分

摘要

Speech emotion recognition combining linguistic content and audio signals in the dialog is a challenging task. Nevertheless, previous approaches have failed to explore emotion cues in contextual interactions and ignored the long-range dependencies between elements from different modalities. To tackle the above issues, this letter proposes a multimodal speech emotion recognition method using audio and text data. We first present a contextual transformer module to introduce contextual information via embedding the previous utterances between interlocutors, which enhances the emotion representation of the current utterance. Then, the proposed cross-modal transformer module focuses on the interactions between text and audio modalities, adaptively promoting the fusion from one modality to another. Furthermore, we construct associative topological relation over mini-batch and learn the association between deep fused features with graph convolutional network. Experimental results on the IEMOCAP and MELD datasets show that our method outperforms current state-of-the-art methods.

查看译文

关键词

Transformers,Emotion recognition,Convolution,Acoustics,Speech recognition,Stacking,Pipelines,Contextual interaction,cross-modal interaction,graph convolutional network,speech emotion recognition

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要