A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition

Man-Machine Speech Communication(2023)

引用 0|浏览27
暂无评分
摘要
Speech emotion recognition (SER) is the task of automatically identifying human emotions from the analysis of utterances. In practical applications, the task is often affected by subsidiary information, such as speaker or phoneme information. Traditional domain adaptation approaches are often applied to remove unwanted domain-specific knowledge, but often unavoidably contribute to the loss of useful categorical information. In this paper, we proposed a time-frequency attention mechanism based on multi-task learning (MTL). This uses its own content information to obtain self attention in time and channel dimensions, and obtain weight knowledge in the frequency dimension through domain information extracted from MTL. We conduct extensive evaluations on the IEMOCAP benchmark to assess the effectiveness of the proposed representation. Results demonstrate a recognition performance of 73.24% weighted accuracy (WA) and 73.18% unweighted accuracy (UA) over four emotions, outperforming the baseline by about 4%.
更多
查看译文
关键词
speech emotion recognition, convolutional neural network, multi-task learning, attention mechanism
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要