MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers
arxiv(2023)
摘要
Recent advances in generative AI have significantly enhanced image and video
editing, particularly in the context of text prompt control. State-of-the-art
approaches predominantly rely on diffusion models to accomplish these tasks.
However, the computational demands of diffusion-based methods are substantial,
often necessitating large-scale paired datasets for training, and therefore
challenging the deployment in real applications. To address these issues, this
paper breaks down the text-based video editing task into two stages. First, we
leverage an pre-trained text-to-image diffusion model to simultaneously edit
few keyframes in an zero-shot way. Second, we introduce an efficient model
called MaskINT, which is built on non-autoregressive masked generative
transformers and specializes in frame interpolation between the edited
keyframes, using the structural guidance from intermediate frames. Experimental
results suggest that our MaskINT achieves comparable performance with
diffusion-based methodologies, while significantly improve the inference time.
This research offers a practical solution for text-based video editing and
showcases the potential of non-autoregressive masked generative transformers in
this domain.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要