VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation.

Neurocomputing(2019)

引用 43|浏览45
暂无评分
摘要
Recently, attribute has demonstrated its effectiveness in guiding image captioning system. However, most attributes based image captioning methods treat the attributes prediction task as a separate task and rely on a standalone stage to obtain the attributes for the given image, e.g., a pre-trained network like Fully Convolutional Neural Network (FCN) is usually adopted. Inherently, they ignore the correlation between the attribute prediction task and image representation extraction task, and at the same time increases the complexity of the image captioning system. In this paper, we aim to couple the attributes prediction stage and image representation extraction stage tightly and propose a novel and efficient image captioning framework called Visual-Densely Semantic Attention Network(VD-SAN). In particular, the whole captioning system consists of shared convolutional layers from Dense Convolutional Network (DenseNet), which are further split into a semantic attributes prediction branch and an image feature extraction branch, two semantic attention models, and a long short-term memory networks (LSTM) for caption generation. To evaluate the proposed architecture, we construct Flickr30K-ATT and MS-COCO-ATT datasets based on the original popular image caption datasets Flickr30K and MS COCO respectively, and each image from Flickr30K-ATT or MS-COCO-ATT is annotated with an attribute list in addition to the corresponding caption. Empirical results demonstrate that our captioning system can achieve significant improvements over state-of-the-art approaches.
更多
查看译文
关键词
Image caption,Semantic attributes,Convolutional neural network,Long short-term memory networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要