Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition.

Interspeech(2021)

引用 8|浏览1
暂无评分
摘要
Various studies have confirmed the necessity and benefits of leveraging multimodal features for SER, and the latest research results show that the temporal information captured by the transformer is very useful for improving multimodal speech emotion recognition. However, the dependency between different modalities and high-level temporal-feature learning using a deeper transformer is yet to be investigated. Thus, we propose a multimodal transformer with sharing weights for speech emotion recognition. The proposed network shares the weights across the modalities in each transformer layer to learn the correlation among multiple modalities. In addition, since the emotion contained in a speech generally include audio and text features, both of which have not only internal dependence but also mutual dependence, we design a deep multimodal attention mechanism to capture these two kinds of emotional dependence. We evaluated our model on the publicly available IEMOCAP dataset. The experimental results demonstrate that the proposed model yielded a promising result.
更多
查看译文
关键词
Speech emotion recognition,Transformer,Sharing weights,Multimodal attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要