How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
arxiv(2024)
摘要
The increase in parameter size of multimodal large language models (MLLMs)
introduces significant capabilities, particularly in-context learning, where
MLLMs enhance task performance without updating pre-trained parameters. This
effectiveness, however, hinges on the appropriate selection of in-context
examples, a process that is currently biased towards visual data, overlooking
textual information. Furthermore, the area of supervised retrievers for MLLMs,
crucial for optimal in-context example selection, continues to be
uninvestigated. Our study offers an in-depth evaluation of the impact of
textual information on the unsupervised selection of in-context examples in
multimodal contexts, uncovering a notable sensitivity of retriever performance
to the employed modalities. Responding to this, we introduce a novel supervised
MLLM-retriever MSIER that employs a neural network to select examples that
enhance multimodal in-context learning efficiency. This approach is validated
through extensive testing across three distinct tasks, demonstrating the
method's effectiveness. Additionally, we investigate the influence of
modalities on our supervised retrieval method's training and pinpoint factors
contributing to our model's success. This exploration paves the way for future
advancements, highlighting the potential for refined in-context learning in
MLLMs through the strategic use of multimodal data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要