Label-attention transformer with geometrically coherent objects for image captioning

Information Sciences(2023)

引用 6|浏览75
暂无评分
摘要
Encoder-decoder-based image captioning techniques are generally utilized to describe meaningful information present in an image. In this work, we investigate two unexplored ideas for image captioning using the transformer: 1) an object-focused label attention module (LAM), and 2) a geometrically coherent proposal (GCP) module that focuses on the scale and position of objects to benefit the transformer model by attaining better image perception. These modules demonstrate the enforcement of objects’ relevance in the surrounding environment. Furthermore, they explore the effectiveness of learning an explicit association between vision and language constructs. LAM and GCP tolerate the variation in objects’ class and its association with labels in multi-label classification. The proposed framework, label-attention transformer with geometrically coherent objects (LATGeO), acquires proposals of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using LAM. The module LAM associates the extracted objects classes to the available dictionary using self-attention layers. Object coherence is acquired in the GCP module using the localized ratio of the proposals’ geometrical features. In this study, experimentation results are performed on MSCOCO dataset. The evaluation of LATGeO on MSCOCO advocates that objects’ relevance in surroundings and their visual features binding with geometrically localized ratios and associated labels generate improved and meaningful captions.
更多
查看译文
关键词
Image captioning,Transformers,Self-attention,Label-attention,Geometrically coherent proposals,Memory-augmented-attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要