Self-supervised Object Detection Network From Sound Cues Based on Knowledge Distillation with Multimodal Cross Level Feature Alignment

Liu Shibei,Chen Ying

2023 9th International Conference on Computer and Communications (ICCC)(2023)

引用 0|浏览0
Sound, as one of the inherent attributes of objects, can provide valuable information for object detection. At present, the method of object location only by monitoring ambient sound is less robust. To solve this problem, a multimodal self-supervised knowledge distillation object detection network with cross level feature alignment is proposed. Taking RGB and depth images as input of teacher network and audio as input for student network, a multi-teacher cross-level feature alignment loss based on attention fusion is designed. It integrates students ’deep and shallow features to learn teachers’ corresponding middle layer features, so as to extract comprehensive knowledge with more efficiency. Positioning distillation loss is also added to obtain more localization information. In the multimodal audio-visual detection MAVD data set, the mAP value of the network increased 11.6% compared with the baseline network, demonstrating the superiority of the detection network.
AI 理解论文
Chat Paper