Leveraging Temporal Contextualization for Video Action Recognition
arxiv(2024)
摘要
Pretrained vision-language models have shown effectiveness in video
understanding. However, recent studies have not sufficiently leveraged
essential temporal information from videos, simply averaging frame-wise
representations or referencing consecutive frames. We introduce Temporally
Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding
that effectively and efficiently leverages comprehensive video information. We
propose Temporal Contextualization (TC), a novel layer-wise temporal
information infusion mechanism for video that extracts core information from
each frame, interconnects relevant information across the video to summarize
into context tokens, and ultimately leverages the context tokens during the
feature encoding process. Furthermore, our Video-conditional Prompting (VP)
module manufactures context tokens to generate informative prompts in text
modality. We conduct extensive experiments in zero-shot, few-shot,
base-to-novel, and fully-supervised action recognition to validate the
superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design
choices. Code is available at https://github.com/naver-ai/tc-clip
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要