Background Disturbance Mitigation for Video Captioning Via Entity-Action Relocation

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览2
暂无评分
摘要
Video captioning aims to generate sentences to accurately describe the video content, in which video background plays the role of prompts. State-of-the-art methods tend to explore richer video representations adequately, fusing with language to improve caption quality, which has shown great success. However, they focus on exploiting foreground semantics, ignoring the potential negative impact of video background disturbance to caption generation, i.e., the entities and the actions are misjudged by a similar video background. To ameliorate this issue, we propose Entity-Action Relocation (EAR) to enhance the adaptability of entities and actions to various backgrounds by giving them the background. Specifically, for an extracted original video feature, we construct a mixed background for all entities and actions to form a distracting video feature sample. After that, contrastive learning is applied to pull the generated caption of the original representations and of the distracting representations closer, and to push the former away from the generated caption of other videos, explicitly concentrating on the entities and actions of the current video scene. Extensive experiments on two public datasets (MSR-VTT and MSVD) demonstrate that dealing with background disturbance for video can obtain a competitive caption generation effect.
更多
查看译文
关键词
Video Captioning,Background Disturbance Mitigation,Entity-Action Relocation,Contrastive Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要