A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
CoRR(2024)
摘要
Aligning text-to-image diffusion model (T2I) with preference has been gaining
increasing research attention. While prior works exist on directly optimizing
T2I by preference data, these methods are developed under the bandit assumption
of a latent reward on the entire diffusion reverse chain, while ignoring the
sequential nature of the generation process. From literature, this may harm the
efficacy and efficiency of alignment. In this paper, we take on a finer dense
reward perspective and derive a tractable alignment objective that emphasizes
the initial steps of the T2I reverse chain. In particular, we introduce
temporal discounting into the DPO-style explicit-reward-free loss, to break the
temporal symmetry therein and suit the T2I generation hierarchy. In experiments
on single and multiple prompt generation, our method is competitive with strong
relevant baselines, both quantitatively and qualitatively. Further studies are
conducted to illustrate the insight of our approach.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要