A prompt tuning method for few-shot action recognition.

2023 IEEE International Conference on Visual Communications and Image Processing (VCIP)（2023）

引用 0|浏览8

暂无评分

摘要

Vision-language pre-training models learn visual concepts from image-text or video-text pairs, which can be adopted for visual-textual tasks. In this paper, we adopt these concepts as prior knowledge to solve the unreliable problem of minimizing the loss of limited training samples in few-shot action recognition tasks. In particular, a two-stage framework of vision-language pre-training and prompt tuning is designed. In the pre-training stage, multi-modal encoding models are jointly trained on video-text pairs to learn the semantic correspondence between video and text. In the prompt tuning stage, a prompt module with instance-level bias is trained on a few video samples to utilize the pre-trained concepts for the classification task. The experimental results show that the proposed method is superior to the baseline and state-of-the-art few-shot action recognition methods on two public video benchmarks.

查看译文

关键词

Few-shot learning,Action recognition,Prompt tuning,Vision-language pre-training

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要