Cross-Modal-Aware Representation Learning with Syntactic Hypergraph Convolutional Network for VideoQA

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME(2023)

引用 0|浏览4
暂无评分
摘要
A key challenge in video question answering (VideoQA) is how to align textual concepts with the cross-modal visual regions accurately. Existing methods mostly rely on the alignment between individual words and relevant video regions, but individual words are generally not able to capture the complete information of a textual concept, which is often represented by the composition of several words. To address this issue, we propose to build a syntactic dependency tree for each question with an off-the-shelf tool and use it to extract meaningful word compositions (i.e., textual concept). By viewing the words and compositions as nodes and hyperedges, respectively, a hypergraph convolutional network (HCN) is built to learn the representations of textual concepts. Then, to enable cross-modal interaction of relevant concepts from different modalities, an optimal transport (OT) based alignment method is developed to establish the connection between textual concepts and their relevant visual regions. Experimental results on three benchmarks show that our method outperforms all competing baselines. Further analyses demonstrate the effectiveness of each component, and show that our model is good at modeling different levels of semantic compositions and filtering out irrelevant information.
更多
查看译文
关键词
Video question answering,syntactic hypergraph convolutional network,optimal transport
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要