A MODEL ENSEMBLE APPROACH FOR AUDIO-VISUAL SCENE CLASSIFICATION Technical Report

semanticscholar(2021)

引用 0|浏览3
暂无评分
摘要
In this technical report, we present our approach to Task 1b AudioVisual Scene Classification (AVSC) in the DCASE 2021 Challenge. We employ pre-trained networks trained on image datasets to extract video embedding whereas for audio embedding models trained from scratch are more appropriate for feature extraction. We propose several models for the AVSC task based on different audio and video embeddings using early fusion strategy. Besides, we propose to use acoustic and visual segment model (AVSM) to extract text embedding. Data augmentation methods are used during training. Furthermore, a two-stage classification strategy is adopted by leveraging on score fusion of two classifiers. Finally, model ensemble of two-stage AVSC classifiers is used to obtain more robust predictions. The proposed systems are evaluated on the development dataset of TAU Urban Audio Visual Scenes 2021. Compared with the official baseline system, our approach can achieve a much lower log loss of 0.141 and a much higher accuracy of 95.3%.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要