Speech Emotion Recognition Model Based on Attention CNN Bi-GRU Fusing Visual Information

ENGINEERING LETTERS(2022)

引用 0|浏览10
暂无评分
摘要
The problem of low recognition accuracy of emotion recognition models is easily caused by interference such as data redundancy and irrelevant features. In this paper, we propose a speech emotion recognition (SER) method based on an attentional convolutional neural network (CNN) bidirectional gated recurrent unit (Bi-GRU) fusing visual information. First, we pretrained the log-mel spectrograms in a ResNet-based attentional convolutional neural network (RACNN) to extract speech features. Second, the CNN-extracted facial static appearance features are fused with speech features using a deep Bi-GRU to obtain speech appearance features. A series of gated recurrent units with attention mechanisms (AGRUs) are used to extract facial geometric features. Then, the hybrid features are obtained by further combining the integrated speech appearance features with facial geometric features, and kernel linear discriminant analysis (KLDA) is used to discriminate them. Finally, the proposed method in this paper obtained accuracies of 87.92% and 89.65% on the RAVDESS and eNTERFACE'05 emotion databases, respectively. The experimental results demonstrate that the method in this paper effectively improved the accuracy and robustness of SER.
更多
查看译文
关键词
SER, visual information, Bi-GRU, AGRUs, KLDA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要