Towards Bridged Vision and Language: Learning Cross-Modal Knowledge Representation for Relation Extraction
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY(2024)
摘要
In natural language processing, relation extraction (RE) is to detect and classify the semantic relationship of two given entities within a sentence. Previous RE methods consider only the textual contents and suffer performance decline in social media when texts lack contexts. Incorporating text-related visual information can supplement the missing semantics for relation extraction in social media posts. However, textual relations are usually abstract and of high-level semantics, which causes the semantic gap between visual contents and textual expressions. In this paper, we propose RECK - a neural network for relation extraction with cross-modal knowledge representations. Different from previous multimodal methods training a common subspace for all modalities, we bridge the semantic gaps by explicitly selecting knowledge paths from external knowledge through the cross-modal object-entity pairs. We further extend the paths into a knowledge graph, and adopt a graph attention network to capture the multi-grained relevant concepts which can provide higher level and key semantics information from external knowledge. Besides, we employ a cross-modal attention mechanism to align and fuse the multimodal information. Experimental results on a multimodal RE dataset show that our model achieves new state-of-the-art performance with knowledge evidence.
更多查看译文
关键词
Multimodal relation extraction,graph attention network,knowledge graphs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要