Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
CoRR(2024)
摘要
As a dominant force in text-to-image generation tasks, Diffusion
Probabilistic Models (DPMs) face a critical challenge in controllability,
struggling to adhere strictly to complex, multi-faceted instructions. In this
work, we aim to address this alignment challenge for conditional generation
tasks. First, we provide an alternative view of state-of-the-art DPMs as a way
of inverting advanced Vision-Language Models (VLMs). With this formulation, we
naturally propose a training-free approach that bypasses the conventional
sampling process associated with DPMs. By directly optimizing images with the
supervision of discriminative VLMs, the proposed method can potentially achieve
a better text-image alignment. As proof of concept, we demonstrate the pipeline
with the pre-trained BLIP-2 model and identify several key designs for improved
image generation. To further enhance the image fidelity, a Score Distillation
Sampling module of Stable Diffusion is incorporated. By carefully balancing the
two components during optimization, our method can produce high-quality images
with near state-of-the-art performance on T2I-Compbench.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要