Multi-Scale Human-Object Interaction Detector

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 1|浏览14
暂无评分
摘要
Transformers are transforming the landscape of computer vision, especially for image-level recognition and instance-level detection tasks. Human-object interaction detection transformer (HOI-TR) is the first transformer-based end-to-end learning system for human-object interaction (HOI) detection; vision transformers build a simple multi-stage structure for multi-scale representation with single-scale patch and are the first patch-based transformer architecture for image-level recognition and instance-level detection. In this paper, we build a transformer-based multi-scale human-object interaction detector (MHOI), a novel method to integrate Vision and HOI detection Transformer, instead of directly incorporating two types of transformers, since the vision transformer lacks hierarchical architecture to handle the large variations in the scale of visual entities due to the single-scale patch partitioning. Specifically, MHOI embeds features of the same size (i.e., sequence length) with patches of variable scales simultaneously by utilizing overlapping convolutional patch embedding, then introduces an efficient transformer decoder that designs the query based on anchor points and essential auxiliary techniques to boost the HOI detection performance. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms prior existing methods coherently and achieves the impressive performance of 29.67 mAP on HICO-DET and 58.7 mAP on V-COCO, respectively.
更多
查看译文
关键词
Human–object interaction,vision transformer,multi-scale
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要