Multi-level alignment for few-shot temporal action localization

SSRN Electronic Journal(2023)

引用 0|浏览5
暂无评分
摘要
Temporal action localization (TAL), which aims to localize actions in long untrimmed videos, requires a large number of annotated training data. However, it is expensive to obtain segment-level annotations for large-scale datasets. To overcome this challenge, a new few-shot learning method is proposed that localizes temporal actions for unseen classes with only a few training samples. In this study, a new multi-level encoder cosine-similarity alignment module is adopted that exploits the alignment of visual information at each temporal location. The proposed method arranges the video snippets that contain similar foreground action instances, and it captures the intra-class variations more implicitly. In addition, it incorporates cosine similarity in Transformer encoder layers that supports the self-attention mechanism. This emphasizes more on refined features at the higher encoder layers. Towards this objective, an episodic-based training scheme is adopted to learn the alignment of similar video snippets with a few training examples. At the test time, the learned context information is then adapted to novel classes. Experimental results show that the proposed method outperforms the state-of-the-art methods for few-shot temporal action localization with single and multiple action instances on the ActivityNet-1.3 dataset and achieves competitive results on the THUMOS-14 and HACS datasets.
更多
查看译文
关键词
Few-shot learning,Temporal action localization,Feature alignment,Cosine similarity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要