Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
CoRR(2024)
摘要
Text-to-image (TTI) diffusion models have demonstrated impressive results in
generating high-resolution images of complex and imaginative scenes. Recent
approaches have further extended these methods with personalization techniques
that allow them to integrate user-illustrated concepts (e.g., the user
him/herself) using a few sample image illustrations. However, the ability to
generate images with multiple interacting concepts, such as human subjects, as
well as concepts that may be entangled in one, or across multiple, image
illustrations remains illusive. In this work, we propose a concept-driven TTI
personalization framework that addresses these core challenges. We build on
existing works that learn custom tokens for user-illustrated concepts, allowing
those to interact with existing text tokens in the TTI model. However,
importantly, to disentangle and better learn the concepts in question, we
jointly learn (latent) segmentation masks that disentangle these concepts in
user-provided image illustrations. We do so by introducing an Expectation
Maximization (EM)-like optimization procedure where we alternate between
learning the custom tokens and estimating masks encompassing corresponding
concepts in user-supplied images. We obtain these masks based on
cross-attention, from within the U-Net parameterized latent diffusion model and
subsequent Dense CRF optimization. We illustrate that such joint alternating
refinement leads to the learning of better tokens for concepts and, as a
bi-product, latent masks. We illustrate the benefits of the proposed approach
qualitatively and quantitatively (through user studies) with a number of
examples and use cases that can combine up to three entangled concepts.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要