Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020（2020）

引用 55|浏览297

暂无评分

摘要

OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

查看译文

关键词

OCR-Based Image Captioning, Multimodal Attention, OCR Spatial Relationship

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要