Knowledge Distillation for Efficient Audio-Visual Video Captioning

Özkan Çaylı,Xubo Liu,Volkan Kılıç,Wenwu Wang

2023 31st European Signal Processing Conference (EUSIPCO)（2023）

引用 0|浏览36

暂无评分

摘要

Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields. Deep neural networks are the dominant methods, offering state-of-the-art performance. However, these methods are often undeployable in low-power devices like smartphones due to the large size of the model parameters. In this paper, we propose to exploit simple pooling front-end and down-sampling algorithms with knowledge distillation for audio and visual attributes using a reduced number of audio-visual frames. With the help of knowledge distillation from the teacher model, our proposed method greatly reduces the redundant information in audio-visual streams without losing critical contexts for caption generation. Extensive experimental evaluations on the MSR-VTT dataset demonstrate that our proposed approach significantly reduces the inference time by about 80% with a small sacrifice (less than 0.02%) in captioning accuracy.

查看译文

关键词

Image Processing,Audio Processing,Natural Language Processing,Deep Learning,Video Captioning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要