Self-supervised Object Detection Network From Sound Cues Based on Knowledge Distillation with Multimodal Cross Level Feature Alignment

Liu Shibei,Chen Ying

2023 9th International Conference on Computer and Communications (ICCC)(2023)

引用 0|浏览0
暂无评分
摘要
Sound, as one of the inherent attributes of objects, can provide valuable information for object detection. At present, the method of object location only by monitoring ambient sound is less robust. To solve this problem, a multimodal self-supervised knowledge distillation object detection network with cross level feature alignment is proposed. Taking RGB and depth images as input of teacher network and audio as input for student network, a multi-teacher cross-level feature alignment loss based on attention fusion is designed. It integrates students ’deep and shallow features to learn teachers’ corresponding middle layer features, so as to extract comprehensive knowledge with more efficiency. Positioning distillation loss is also added to obtain more localization information. In the multimodal audio-visual detection MAVD data set, the mAP value of the network increased 11.6% compared with the baseline network, demonstrating the superiority of the detection network.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要