Learning coordinated emotion representation between voice and face

Applied Intelligence(2022)

引用 1|浏览1
暂无评分
摘要
Voice and face information are two most important perceptual modalities for human. In recent years, many researchers show great interest in learning cross-modal representations for different face-voice association tasks. However, these existing methods focus on the various biological attributions but rarely take emotion semantics between voice and face into account. In this paper, we present a novel two-stream model, called Emo tion R epresentation L earning Net work (EmoRL-Net), to learn the cross-modal coordinated emotion representations for various downstream matching and retrieval tasks. Within the proposed approach, we first propose two sub-network architectures that learn two unimodal features from the two modalities. Afterwards, we train EmoRL-Net by an objective loss function which includes one explicit and two implicit constraints. Meanwhile, an online semi-hard negative mining strategy is utilized to construct triplet units in a mini-batch manner, thereby stabilize and speeding up the learning process. Extensive experiments demonstrate that the proposed method can benefit various face-voice emotion tasks, including cross-modal verification, 1:2 matching, 1:N matching, and retrieval scenarios. The experiment results also show the proposed method outperforms the state-of-the-art approaches.
更多
查看译文
关键词
Face-voice emotion relationship, Coordinated emotion representation, Cross-modal matching, Metric learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要