DBT: multimodal emotion recognition based on dual-branch transformer

Yufan Yi,Yan Tian, Cong He, Yajing Fan,Xinli Hu,Yiping Xu

The Journal of Supercomputing(2022)

引用 0|浏览13
There are very few labeled datasets in speech emotion recognition. The reason is that emotion is subjective and requires much time for labeling experts to identify emotion categories, while the wav2vec2.0 model is a general model for obtaining speech representations through self-supervised training. Therefore, we try to apply it to speech-emotion recognition tasks. We propose a multimodal dual-branch transformer network. For the speech processing branch, first, we use wav2vec2.0 to extract speech features. Then, a fine-tuning strategy and a self-attention-based interlayer feature fusion strategy are used. Second, a fully convolutional classification network is used for emotion classification. Then, we use RoBERTa for text emotion recognition and bimodal fusion by an improved weighted Dempster–Shafer (DS) strategy. In addition, we propose an accuracy-weighted label smoothing method, which can improve recognition accuracy. We perform comprehensive experiments on two benchmarks: IEMOCAP and CASIA, covering both Chinese and English datasets. The experimental results show that the proposed method has higher accuracy than state-of-the-art methods.
wav2vec2.0,Model fine-tuning,Adaptive interlayer fusion,Weighted label smoothing,Weighted DS strategy
AI 理解论文
Chat Paper