Conditional Video-Text Reconstruction Network with Cauchy Mask for Weakly Supervised Temporal Sentence Grounding

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME(2023)

引用 0|浏览10
暂无评分
摘要
Temporal sentence grounding aims to detect the target segment most related to a given query in an untrimmed video. To alleviate the expensive annotation cost for temporal labels, researchers paid more attention to weakly supervised setting. Prior studies neglected the utilization of video representation reconstruction, which led to an unbalanced alignment learning. Moreover, they used different strategies to generate proposals which ignored the temporal structure in a query. In this paper, we propose a novel Conditional Video-Text Reconstruction Network (CVTRN). It supports conditional reconstruction of video and text representation. Specifically, video and text features are fused to compute semantic alignment, which is the condition of reconstruction. A new mask strategy for mask conditioned sentence reconstruction is also devised. This strategy focuses more on boundary regions than the widely used Gaussian mask in previous methods. Experimental results on two public benchmark datasets show that our CVTRN outperforms the state-of-the-art methods.
更多
查看译文
关键词
Weakly supervised,temporal sentence grounding,conditional reconstruction,Cauchy mask
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要