Learning Object State Changes in Videos: An Open-World Perspective
CoRR(2023)
摘要
Object State Changes (OSCs) are pivotal for video understanding. While humans
can effortlessly generalize OSC understanding from familiar to unknown objects,
current approaches are confined to a closed vocabulary. Addressing this gap, we
introduce a novel open-world formulation for the video OSC problem. The goal is
to temporally localize the three stages of an OSC -- the object's initial
state, its transitioning state, and its end state -- whether or not the object
has been observed during training. Towards this end, we develop VidOSC, a
holistic learning approach that: (1) leverages text and vision-language models
for supervisory signals to obviate manually labeling OSC training data, and (2)
abstracts fine-grained shared state representations from objects to enhance
generalization. Furthermore, we present HowToChange, the first open-world
benchmark for video OSC localization, which offers an order of magnitude
increase in the label space and annotation volume compared to the best existing
benchmark. Experimental results demonstrate the efficacy of our approach, in
both traditional closed-world and open-world scenarios.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要