VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk,Lijun Yu,Xiuye Gu,José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung,Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu,Krishna Somandepalli,Hassan Akbari, Yair Alon,Yong Cheng, Josh Dillon,Agrim Gupta, Meera Hahn,Anja Hauth, David Hendon, Alonso Martinez,David Minnen, Mikhail Sirotenko,Kihyuk Sohn, Xuan Yang,Hartwig Adam,Ming-Hsuan Yang,Irfan Essa,Huisheng Wang,David A. Ross,Bryan Seybold,Lu Jiang

arxiv(2023)

引用 24|浏览351
暂无评分
摘要
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络