Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks

IEEE Transactions on Pattern Analysis and Machine Intelligence(2022)

引用 118|浏览79
暂无评分
摘要
Given only video-level action categorical labels during training, weakly-supervised temporal action localization (WS-TAL) learns to detect action instances and locates their temporal boundaries in untrimmed videos. Compared to its fully supervised counterpart, WS-TAL is more cost-effective in data labeling and thus favorable in practical applications. However, the coarse video-level supervision inevitably incurs ambiguities in action localization, especially in untrimmed videos containing multiple action instances. To overcome this challenge, we observe that significant temporal contrasts among video snippets, e.g., caused by temporal discontinuities and sudden changes, often occur around true action boundaries. This motivates us to introduce a Contrast-based Localization EvaluAtioN Network (CleanNet), whose core is a new temporal action proposal evaluator, which provides fine-grained pseudo supervision by leveraging the temporal contrasts among snippet-level classification predictions. As a result, the uncertainty in locating action instances can be resolved via evaluating their temporal contrast scores. Moreover, the new action localization module is an integral part of CleanNet which enables end-to-end training. This is in contrast to many existing WS-TAL methods where action localization is merely a post-processing step. Besides, we also explore the usage of temporal contrast on temporal action proposal (TAP) generation task, which we believe is the first attempt with the weak supervision setting. Experiments on the THUMOS14, ActivityNet v1.2 and v1.3 datasets validate the efficacy of our method against existing state-of-the-art WS-TAL algorithms.
更多
查看译文
关键词
Action localization,weakly supervised learning,temporal contrast
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要