Multi-modal Auto-regressive Modeling via Visual Words
arxiv(2024)
摘要
Large Language Models (LLMs), benefiting from the auto-regressive modelling
approach performed on massive unannotated texts corpora, demonstrates powerful
perceptual and reasoning capabilities. However, as for extending
auto-regressive modelling to multi-modal scenarios to build Large Multi-modal
Models (LMMs), there lies a great difficulty that the image information is
processed in the LMM as continuous visual embeddings, which cannot obtain
discrete supervised labels for classification. In this paper, we successfully
perform multi-modal auto-regressive modeling with a unified objective for the
first time. Specifically, we propose the concept of visual words, which maps
the visual features to probability distributions over LLM's vocabulary,
providing supervision information for visual modelling. We further explore the
distribution of visual features in the semantic space within LMM and the
possibility of using text embeddings to represent visual information.
Experimental results and ablation studies on 5 VQA tasks and 4 benchmark
toolkits validate the powerful performance of our proposed approach.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要