GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition
CoRR(2024)
摘要
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have
shown impressive performance in various visual recognition tasks. This
advancement paves the way for notable performance in Zero-Shot Egocentric
Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global
video-text matching task, which often leads to suboptimal alignment of vision
and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs,
emphasizing fine-grained concept-description alignment that capitalizes on the
rich semantic and contextual details in egocentric videos. In this paper, we
introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for
ZS-EAR, designed to enhance the fine-grained alignment of concept and
description between vision and language. Extensive experiments demonstrate
GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric
video benchmarks, i.e., EPIC-KITCHENS-100 (33.2
and CharadesEgo (31.5
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要