Dual attentional transformer for video visual relation prediction q

Neurocomputing(2023)

引用 2|浏览28
暂无评分
摘要
Video visual relation detecti on (VidVRD) is to detect visual relations among instances as well as the tra-jectories of the corresponding subjects and objects in the video. Most current works improve the accuracy of tracking the objects but neglect the other key challenge, predicting the reliable visual relations in the videos, a vital meant for downstream tasks further. In this paper, we propose a dual attentional trans-former network (VRD-DAT) for predicting the visual relations, also known as the predicates, in multi -relation videos. Specifically, our network first respectively targets modeling action visual predicates (Act-T) and spatial locating visual relations (Spa-T) via two parallel visual transformer structures simul-taneously. Then, an attentional weighting module obtains the final precise merged visual relations. We conduct extensive experiments on two public datasets, ImageNet-VidVRD and VidOR, to demonstrate our model is capable of outperforming other state-of-the-art methods effectively on the task of video visual relation prediction. Quantitative and qualitative results also show that with more accurate visual relations, the performance of the video visual relation detection task can be further boosted.& COPY; 2023 Elsevier B.V. All rights reserved.
更多
查看译文
关键词
Video visual relation prediction,Dual attentional transformer,Video visual relation detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要