GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
arxiv(2023)
摘要
We address the task of generating temporally consistent and physically
plausible images of actions and object state transformations. Given an input
image and a text prompt describing the targeted transformation, our generated
images preserve the environment and transform objects in the initial image. Our
contributions are threefold. First, we leverage a large body of instructional
videos and automatically mine a dataset of triplets of consecutive frames
corresponding to initial object states, actions, and resulting object
transformations. Second, equipped with this data, we develop and train a
conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a
variety of objects and actions and show superior performance compared to
existing methods. In particular, we introduce a quantitative evaluation where
GenHowTo achieves 88
respectively, outperforming prior work by a large margin.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要