Triple-level relationship enhanced transformer for image captioning

MULTIMEDIA SYSTEMS(2023)

引用 0|浏览6
暂无评分
摘要
Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of precise expression of visual relationships. Meanwhile, the scene graph features contain object relationship information, through the multi-layer calculation, and the extracted object relationship information is of higher level and more complete, which can compensate the shortage of region features and grid features to a certain extent. Therefore, a Triple-Level Relationship Enhanced Transformer (TRET) is proposed in this paper, which can process three features in parallel. TRET can obtain and combine different levels of object relationship features to achieve the advantages of complementarity between different features. Specifically, we obtain high-level object-relational information with the help of Graph Based Attention, and achieve the fusion of low-level relational information and high-level object-relational information with the help of Cross Relationship Enhanced Attention, so as to better align the information of both modalities, visual and text. To validate our model, we conduct comprehensive experiments on the MS-COCO dataset. The results indicate that our method achieves better performance compared with the existing state-of-the-art methods and effectively enhances the ability of describing the representation of object relationships in the generated outcomes.
更多
查看译文
关键词
Image captioning,Transformer,Attention,Feature fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要