Enhancing Multimodal Alignment with Momentum Augmentation for Dense Video Captioning

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览10
暂无评分
摘要
Dense video captioning aims to localize multiple events from an untrimmed video and generate corresponding captions for each event. Fusing different modalities(e.g. rgb, flow, audio) via transformer structure is a promising way to improve the caption performance. However, it is challenging for the cross-modal encoder to learn multimodal interactions due to their inherent disparities of distribution. In this paper, we propose a novel transformer structure with contrastive learning to align different modalities. Specifically, to avoid the limitation of small batch size and false contrastive targets, we design an event-aligned momentum augmentation strategy to apply contrast learning for dense video captioning. The experimental result shows that our proposals outperform all existing multimodal fusion methods for dense video captioning.
更多
查看译文
关键词
Dense video captioning,Multimodal fusion,Contrastive learning,Momentum augmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要