Large Language Models as Consistent Story Visualizers
CoRR(2023)
摘要
Recent generative models have demonstrated impressive capabilities in
generating realistic and visually pleasing images grounded on textual prompts.
Nevertheless, a significant challenge remains in applying these models for the
more intricate task of story visualization. Since it requires resolving
pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution,
and ensuring consistent characters and background synthesis across frames. Yet,
the emerging Large Language Model (LLM) showcases robust reasoning abilities to
navigate through ambiguous references and process extensive sequences.
Therefore, we introduce \textbf{StoryGPT-V}, which leverages the merits of the
latent diffusion (LDM) and LLM to produce images with consistent and
high-quality characters grounded on given story descriptions. First, we train a
character-aware LDM, which takes character-augmented semantic embedding as
input and includes the supervision of the cross-attention map using character
segmentation masks, aiming to enhance character generation accuracy and
faithfulness. In the second stage, we enable an alignment between the output of
LLM and the character-augmented embedding residing in the input space of the
first-stage model. This harnesses the reasoning ability of LLM to address
ambiguous references and the comprehension capability to memorize the context.
We conduct comprehensive experiments on two visual story visualization
benchmarks. Our model reports superior quantitative results and consistently
generates accurate characters of remarkable quality with low memory
consumption. Our code will be made publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要