LLMGA: Multimodal Large Language Model based Generation Assistant
CoRR(2023)
摘要
In this paper, we introduce a Multimodal Large Language Model-based
Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and
proficiency in reasoning, comprehension, and response inherent in Large
Language Models (LLMs) to assist users in image generation and editing.
Diverging from existing approaches where Multimodal Large Language Models
(MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our
LLMGA provides a detailed language generation prompt for precise control over
SD. This not only augments LLM context understanding but also reduces noise in
generation prompts, yields images with more intricate and precise content, and
elevates the interpretability of the network. To this end, we curate a
comprehensive dataset comprising prompt refinement, similar image generation,
inpainting $\&$ outpainting, and visual question answering. Moreover, we
propose a two-stage training scheme. In the first stage, we train the MLLM to
grasp the properties of image generation and editing, enabling it to generate
detailed prompts. In the second stage, we optimize SD to align with the MLLM's
generation prompts. Additionally, we propose a reference-based restoration
network to alleviate texture, brightness, and contrast disparities between
generated and preserved regions during image editing. Extensive results show
that LLMGA has promising generative capabilities and can enable wider
applications in an interactive manner.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要