A Text-Guided Generation and Refinement Model for Image Captioning

IEEE Transactions on Multimedia(2023)

引用 1|浏览17
暂无评分
摘要
A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the “generating + refining” route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.
更多
查看译文
关键词
Attention mechanism,generating and refining decoder,image captioning,text-guided
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要