Feature fusion via multi-target learning for ancient artwork captioning.

Inf. Fusion(2023)

引用 2|浏览10
暂无评分
摘要
Image captioning has made consistent progress due to the development of computer vision and natural language processing techniques. Current research on image captioning commonly tends to the visual caption of natural images. However, these attempts are not applicable to ancient artwork captioning with different appearance attributes and complex cultural metaphors. In this work, we propose a Multi-target Learning Framework, called MLF, for generating captions for ancient artworks with ceramics as the case study. Our MLF contains three novel data-driven modules, including RTE, MTE, and MFD. To be specific, for a given image, the Regular Target Encoder (RTE) is first used to encode its regular targets related to color, textual, profile, and craftsmanship features. Second, a Metaphorical Target Encoder (MTE) is applied to encode its metaphorical targets related to cultural semantic features. Finally, a Multimodal Fused Decoder (MFD) is utilized to fuse the multimodal feature vectors from RTE and MTE separately, then decode them and generate detailed captions containing both regular and metaphorical information with the guidance by a word distribution map. Both quantitative and qualitative evaluation results on our constructed dataset demonstrate the advantages of our work.
更多
查看译文
关键词
Image captioning,Ancient artwork,Multi-target learning,Feature fusion,Multimodal learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要