Audio-Visual Emotion Forecasting: Characterizing and Predicting Future Emotion Using Deep Learning

2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019)(2019)

引用 9|浏览64
暂无评分
摘要
Emotion forecasting is the task of predicting the future emotion of a speaker-i.e., the emotion label of the future speaking turn-based on the speaker's past and current audiovisual cues. Emotion forecasting systems require new problem formulations that differ from traditional emotion recognition systems. In this paper, we first explore two types of forecasting windows (i.e., analysis windows for which the speaker's emotion is being forecasted): utterance forecasting and time forecasting. Utterance forecasting is based on speaking turns and forecasts what the speaker's emotion will be after one, two, or three speaking turns. Time forecasting forecasts what the speaker's emotion will be after a certain range of time, such as 3-8, 8- 13, and 13-18 seconds. We then investigate the benefit of using the past audio-visual cues in addition to the current utterance. We design emotion forecasting models using deep learning. We compare the performances of fully-connected deep neural network (FC-DNN), deep long short-term memory (D-LSTM), and deep bidirectional long short-term memory (D-BLSTM) recurrent neural networks (RNNs). This allows us to examine the benefit of modeling dynamic patterns in emotion forecasting tasks. Our experimental results on the IEMOCAP benchmark dataset demonstrate that D-BLSTM and D-LSTM outperform FC-DNN by up to 2.42% in unweighted recall. When using both the current and past utterances, deep dynamic models show an improvement of up to 2.39% compared to their performance when using only the current utterance. We further analyze the benefit of using current and past utterance information compared to using the current and randomly chosen utterance information, and we find the performance improvement rises to 7.53%. The novelty in this study comes from its formulation of emotion forecasting problems and the understanding of how current and past audio-visual cues reveal future emotional information.
更多
查看译文
关键词
utterance forecasting,speaking turns,time forecasting,emotion forecasting models,deep learning,fully-connected deep neural network,current utterances,past utterances,deep dynamic models,audio-visual emotion forecasting,emotion label,emotion forecasting systems,forecasting windows,audio-visual cues,utterance information,emotion recognition systems,future emotion prediction,emotional information,deep bidirectional long short-term memory recurrent neural networks,D-BLSTM recurrent neural networks,deep long short-term memory,dynamic pattern modeling,IEMOCAP benchmark dataset,time 13.0 s to 18.0 s
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要