Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition

Conference of the International Speech Communication Association (INTERSPEECH)(2022)

引用 0|浏览5
暂无评分
摘要
This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multi-modal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the effectiveness of the proposed method.
更多
查看译文
关键词
audio-visual emotion recognition, cross-modal transformer, interactive co-learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要