FiT: Flexible Vision Transformer for Diffusion Model
CoRR(2024)
摘要
Nature is infinitely resolution-free. In the context of this reality,
existing diffusion models, such as Diffusion Transformers, often face
challenges when processing image resolutions outside of their trained domain.
To overcome this limitation, we present the Flexible Vision Transformer (FiT),
a transformer architecture specifically designed for generating images with
unrestricted resolutions and aspect ratios. Unlike traditional methods that
perceive images as static-resolution grids, FiT conceptualizes images as
sequences of dynamically-sized tokens. This perspective enables a flexible
training strategy that effortlessly adapts to diverse aspect ratios during both
training and inference phases, thus promoting resolution generalization and
eliminating biases induced by image cropping. Enhanced by a meticulously
adjusted network structure and the integration of training-free extrapolation
techniques, FiT exhibits remarkable flexibility in resolution extrapolation
generation. Comprehensive experiments demonstrate the exceptional performance
of FiT across a broad range of resolutions, showcasing its effectiveness both
within and beyond its training resolution distribution. Repository available at
https://github.com/whlzy/FiT.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要