LALM: Long-Term Action Anticipation with Language Models
CoRR(2023)
摘要
Understanding human activity is a crucial yet intricate task in egocentric
vision, a field that focuses on capturing visual perspectives from the camera
wearer's viewpoint. While traditional methods heavily rely on representation
learning trained on extensive video data, there exists a significant
limitation: obtaining effective video representations proves challenging due to
the inherent complexity and variability in human activities.Furthermore,
exclusive dependence on video-based learning may constrain a model's capability
to generalize across long-tail classes and out-of-distribution scenarios.
In this study, we introduce a novel approach for long-term action
anticipation using language models (LALM), adept at addressing the complex
challenges of long-term activity understanding without the need for extensive
training. Our method incorporates an action recognition model to track previous
action sequences and a vision-language model to articulate relevant
environmental details. By leveraging the context provided by these past events,
we devise a prompting strategy for action anticipation using large language
models (LLMs). Moreover, we implement Maximal Marginal Relevance for example
selection to facilitate in-context learning of the LLMs. Our experimental
results demonstrate that LALM surpasses the state-of-the-art methods in the
task of long-term action anticipation on the Ego4D benchmark. We further
validate LALM on two additional benchmarks, affirming its capacity for
generalization across intricate activities with different sets of taxonomies.
These are achieved without specific fine-tuning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要