EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning
arxiv(2024)
摘要
News image captioning requires model to generate an informative caption rich
in entities, with the news image and the associated news article. Though
Multimodal Large Language Models (MLLMs) have demonstrated remarkable
capabilities in addressing various vision-language tasks, our research finds
that current MLLMs still bear limitations in handling entity information on
news image captioning task. Besides, while MLLMs have the ability to process
long inputs, generating high-quality news image captions still requires a
trade-off between sufficiency and conciseness of textual input information. To
explore the potential of MLLMs and address problems we discovered, we propose :
an Entity-Aware Multimodal Alignment based approach for news image captioning.
Our approach first aligns the MLLM through Balance Training Strategy with two
extra alignment tasks: Entity-Aware Sentence Selection task and Entity
Selection task, together with News Image Captioning task, to enhance its
capability in handling multimodal entity information. The aligned MLLM will
utilizes the additional entity-related information it explicitly extract to
supplement its textual input while generating news image captions. Our approach
achieves better results than all previous models in CIDEr score on GoodNews
dataset (72.33 -> 88.39) and NYTimes800k dataset (70.83 -> 85.61).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要