Graph based emotion recognition with attention pooling for variable-length utterances

Neurocomputing(2022)

引用 1|浏览24
暂无评分
摘要
Previous speech emotion recognition (SER) methods normally deal with variable-length utterance inputs by padding shorter ones or clipping longer ones into equal-length utterances, which may introduce invalid information or discard useful emotional segments. To address this issue, in this paper, we cast the SER problem into a graph classification task by transforming variable-length utterances into graphs to avoid padding or cutting. In our approach, frames (short windowed segments) in an utterance are presented as nodes in a graph. Acoustic features extracted from frames are treated as node feature vectors and nodes are connected according to their temporal relationship. Different graph convolutional networks (GCNs) are explored for node/frame embedding learning, and kinds of graph pooling methods are compared to obtain graph/utterance-level emotional representation from node embeddings. Extensive experiments with different GCN components and pooling mechanisms are conducted on the IEMOCAP and MSP-IMPRO datasets. The experimental results show that a combination of GraphSAGE with multi-head attention pooling (MHAPool) achieves the best weighted accuracy (WA) and comparable unweighted accuracy (UA) on both datasets compared with other state-of-the-art SER models, which demonstrates the effectiveness of the proposed graph-based network for SER task.
更多
查看译文
关键词
Speech emotion recognition,Graph convolutional network,Attention pooling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要