Delving Into Precise Attention In Image Captioning

NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V(2019)

引用 0|浏览29
暂无评分
摘要
Recent image captioning models usually directly use the output of the last convolutional layer from a pretrained CNN encoder. This intuitive design remains two weaknesses: the top layer feature is not position-sensitive which is harmful for the decoder to generate precise spatial attention for object of interest; irrelevant features will mislead the decoder into focusing irrelevant regions. To tackle these weaknesses, we propose Feature Selection and Fusion Network (FSFN). Specifically, to tackle the first weakness, Feature Fusion module is proposed to generate fine-grained and position-sensitive features by fusing multi-scale features. To handle the second weakness, Feature Selection module is proposed to select more informative features which will prevent the decoder from focusing on irrelevant regions. Extensive experiments demonstrate that our model has successfully addressed the above two weaknesses and can achieve comparable results with the state-of-the-art under cross entropy loss without any bells and whistles on MSCOCO dataset. Furthermore, our model can improve the performance under different encoders and decoders.
更多
查看译文
关键词
Image captioning, Feature selection, Feature fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要