Guiding Instruction-based Image Editing via Multimodal Large Language Models
ICLR 2024(2023)
摘要
Instruction-based image editing improves the controllability and flexibility
of image manipulation via natural commands without elaborate descriptions or
regional masks. However, human instructions are sometimes too brief for current
methods to capture and follow. Multimodal large language models (MLLMs) show
promising capabilities in cross-modal understanding and visual-aware response
generation via LMs. We investigate how MLLMs facilitate edit instructions and
present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive
instructions and provides explicit guidance. The editing model jointly captures
this visual imagination and performs manipulation through end-to-end training.
We evaluate various aspects of Photoshop-style modification, global photo
optimization, and local editing. Extensive experimental results demonstrate
that expressive instructions are crucial to instruction-based image editing,
and our MGIE can lead to a notable improvement in automatic metrics and human
evaluation while maintaining competitive inference efficiency.
更多查看译文
关键词
image editing,multimodal large language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要