Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
"I Can't Believe It's Not Better!" Workshop Series(2023)
Abstract
Recent advances in image tokenizers, such as VQ-VAE, have enabled
text-to-image generation using auto-regressive methods, similar to language
modeling. However, these methods have yet to leverage pre-trained language
models, despite their adaptability to various downstream tasks. In this work,
we explore this gap by adapting a pre-trained language model for
auto-regressive text-to-image generation, and find that pre-trained language
models offer limited help. We provide a two-fold explanation by analyzing
tokens from each modality. First, we demonstrate that image tokens possess
significantly different semantics compared to text tokens, rendering
pre-trained language models no more effective in modeling them than randomly
initialized ones. Second, the text tokens in the image-text datasets are too
simple compared to normal language model pre-training data, which causes the
catastrophic degradation of language models' capability.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined