Counting Guidance for High Fidelity Text-to-Image Synthesis

CoRR(2023)

引用 0|浏览21
暂无评分
摘要
Recently, the quality and performance of text-to-image generation significantly advanced due to the impressive results of diffusion models. However, text-to-image diffusion models still fail to generate high fidelity content with respect to the input prompt. One problem where text-to-diffusion models struggle is generating the exact number of objects specified in the text prompt. E.g. given a prompt "five apples and ten lemons on a table", diffusion-generated images usually contain the wrong number of objects. In this paper, we propose a method to improve diffusion models to focus on producing the correct object count given the input prompt. We adopt a counting network that performs reference-less class-agnostic counting for any given image. We calculate the gradients of the counting network and refine the predicted noise for each step. To handle multiple types of objects in the prompt, we use novel attention map guidance to obtain high-fidelity masks for each object. Finally, we guide the denoising process by the calculated gradients for each object. Through extensive experiments and evaluation, we demonstrate that our proposed guidance method greatly improves the fidelity of diffusion models to object count.
更多
查看译文
关键词
synthesis,guidance,text-to-image
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要