A Comparison of Time-based Models for Multimodal Emotion Recognition

Ege Kesim, Selahattin Serdar Helli,Sena Nur Cavsak, Senem Tanberk

CoRR(2023)

引用 0|浏览3
暂无评分
摘要
Emotion recognition has become an important research topic in the field of human-computer interaction. Studies on audio and videos to understand emotions focused mainly on analyzing facial expressions and classified 6 basic emotions. In this study, the performance of different sequence models which are frequently used in literature is compared for multi-modal emotion recognition problems. The audio and images were first processed by multi-layered CNN models, and the outputs of these models were fed into various sequence models. The sequence models are GRU, Transformer, LSTM, and Max Pooling. Accuracy, precision, harmonic, and macro F1 Score values of all models were calculated. The multi-modal CREMA-D dataset was used in the experiments. As a result of the comparison of the CREMA-D dataset, GRU-based architecture with 0.640 showed the best result in harmonic F1 score, LSTM-based architecture with 0.699 in precision and 0.678 in macro F1 score, while sensitivity showed the best results over time with Max Pooling-based architecture with 0.620. As a result, it has been observed that the sequence models performances are close to each other. Also, it should be noted that maxpooling has similar performance to the other models although it is a predefined layer.
更多
查看译文
关键词
multimodal emotion recognition,models,time-based
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要