Test-Time Zero-Shot Temporal Action Localization
CVPR 2024(2024)
摘要
Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate
actions in untrimmed videos unseen during training. Existing ZS-TAL methods
involve fine-tuning a model on a large amount of annotated training data. While
effective, training-based ZS-TAL approaches assume the availability of labeled
data for supervised learning, which can be impractical in some applications.
Furthermore, the training process naturally induces a domain bias into the
learned model, which may adversely affect the model's generalization ability to
arbitrary videos. These considerations prompt us to approach the ZS-TAL problem
from a radically novel perspective, relaxing the requirement for training data.
To this aim, we introduce a novel method that performs Test-Time adaptation for
Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained
Vision and Language Model (VLM). T3AL operates in three steps. First, a
video-level pseudo-label of the action category is computed by aggregating
information from the entire video. Then, action localization is performed
adopting a novel procedure inspired by self-supervised learning. Finally,
frame-level textual descriptions extracted with a state-of-the-art captioning
model are employed for refining the action region proposals. We validate the
effectiveness of T3AL by conducting experiments on the THUMOS14 and the
ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly
outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the
benefit of a test-time adaptation approach.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要