Learning Human-Object Interaction Detection using Interaction Points

CVPR, pp. 4115-4124, 2020.

Cited by: 1|Bibtex|Views93|Links
EI
Keywords:
convolutional neural networkskeypoint detectionobject detectioninteraction pointinteraction vectorMore(13+)
Weibo:
The interaction point and its corresponding interaction vector are first generated by the keypoint detection network

Abstract:

Understanding interactions between humans and objects is one of the fundamental problems in visual classification and an essential step towards detailed scene understanding. Human-object interaction (HOI) detection strives to localize both the human and an object as well as the identification of complex interactions between them. Most e...More

Code:

Data:

0
Introduction
  • Detailed semantic understanding of image contents, beyond instance-level recognition, is one of the fundamental problems in computer vision.
  • HOIs interacting with multiple objects (“sit on a couch and type on laptop”), multiple humans sharing the same interaction and object (“throw and catch ball”), or fine-grained interactions (“walk horse”, “feed horse” and “jump horse”).
  • These complex and diverse interaction scenarios impose significant challenges when designing an HOI detection solution.
  • Individual scores from the three streams are fused in a late fusion fashion for interaction recognition
Highlights
  • Detailed semantic understanding of image contents, beyond instance-level recognition, is one of the fundamental problems in computer vision
  • Output: kick sports ball individual scores from a human, an object, and a pairwise stream are fused in a late fusion manner for interaction recognition. We argue that such a late fusion strategy is sub-optimal since appearance features alone are insufficient to capture complex human-object interactions
  • A minor limitation of our approach is that multiple human-object interaction pairs cannot share the same interaction point
  • We propose a point-based framework for human-object interaction detection
  • The interaction point and its corresponding interaction vector are first generated by the keypoint detection network
  • Experiments are performed on two human-object interaction detection benchmarks
Methods
  • VSRL[9]* InteractNet [8] BAR [12] GPNN [25] iCAN [7] HOI w knowledge [37] DCA [34] RPNN [38] TIK [14] PMFNet [32] Ours Ours + HICO.
  • As in [4], the authors report results on three different HOI category sets: full, rare, and non-rare with two different settings of Default and Known Objects.
  • In case of Known Object setting, the approach achieves an absolute gain of 2.9% over [14] on the full set
Results
  • The interaction boxes generated by the interaction vectors are drawn
  • These interaction boxes are paired with the positive human and object bounding-boxes using interaction grouping.
  • Fig. 6 shows examples of a human performing multiple interactions.
  • A minor limitation of the approach is that multiple HOI pairs cannot share the same interaction point.
  • Such cases are rare in practice
Conclusion
  • The authors propose a point-based framework for HOI detection.
  • The authors' approach regards the HOI detection as a keypoint detection and grouping problem.
  • The interaction point and its corresponding interaction vector are first generated by the keypoint detection network.
  • The authors directly pair those interaction points with the human and object bounding boxes from object detection branch using the proposed interaction grouping scheme.
  • Experiments are performed on two HOI detection benchmarks.
  • The authors' points-based approach outperforms state-of-the-art methods on both datasets
Summary
  • Introduction:

    Detailed semantic understanding of image contents, beyond instance-level recognition, is one of the fundamental problems in computer vision.
  • HOIs interacting with multiple objects (“sit on a couch and type on laptop”), multiple humans sharing the same interaction and object (“throw and catch ball”), or fine-grained interactions (“walk horse”, “feed horse” and “jump horse”).
  • These complex and diverse interaction scenarios impose significant challenges when designing an HOI detection solution.
  • Individual scores from the three streams are fused in a late fusion fashion for interaction recognition
  • Methods:

    VSRL[9]* InteractNet [8] BAR [12] GPNN [25] iCAN [7] HOI w knowledge [37] DCA [34] RPNN [38] TIK [14] PMFNet [32] Ours Ours + HICO.
  • As in [4], the authors report results on three different HOI category sets: full, rare, and non-rare with two different settings of Default and Known Objects.
  • In case of Known Object setting, the approach achieves an absolute gain of 2.9% over [14] on the full set
  • Results:

    The interaction boxes generated by the interaction vectors are drawn
  • These interaction boxes are paired with the positive human and object bounding-boxes using interaction grouping.
  • Fig. 6 shows examples of a human performing multiple interactions.
  • A minor limitation of the approach is that multiple HOI pairs cannot share the same interaction point.
  • Such cases are rare in practice
  • Conclusion:

    The authors propose a point-based framework for HOI detection.
  • The authors' approach regards the HOI detection as a keypoint detection and grouping problem.
  • The interaction point and its corresponding interaction vector are first generated by the keypoint detection network.
  • The authors directly pair those interaction points with the human and object bounding boxes from object detection branch using the proposed interaction grouping scheme.
  • Experiments are performed on two HOI detection benchmarks.
  • The authors' points-based approach outperforms state-of-the-art methods on both datasets
Tables
  • Table1: State-of-the-art comparison (in terms of mAProle) on the V-COCO dataset. * refers to implementation of [<a class="ref-link" id="c9" href="#r9">9</a>] by [<a class="ref-link" id="c8" href="#r8">8</a>]. Our approach sets a new state-of-the-art with mAProle of 51.0 and achieves an absolute gain of 3.2% over TIK [<a class="ref-link" id="c14" href="#r14">14</a>]. The results are further improved (mAProle of 52.3) when utilizing pre-training on HICO-DET and then fine-tuning on V-COCO dataset
  • Table2: State-of-the-art comparison (in terms of mAProle) on the HICO-DET using two different settings: Default and Known Object on all three sets (full, rare, non-rare). Note that Shen et al [<a class="ref-link" id="c30" href="#r30">30</a>], InteractNet [<a class="ref-link" id="c8" href="#r8">8</a>] and GPNN [<a class="ref-link" id="c25" href="#r25">25</a>] only report results on the Default settings. For both settings, our approach provides superior performance compared to existing methods. In case of default settings, our approach achieves mAProle of 19.56 on the full set. Further, our approach obtains an absolute gain of 2.9% over TIK [<a class="ref-link" id="c14" href="#r14">14</a>] on the full set of Known Object setting
  • Table3: Impact of integrating our contributions into the baseline on V-COCO. Results are reported in terms of role mean average precision (mAProle). For fair comparison, we use the same backbone (Hourglass-104) for all the ablation experiments. Our overall architecture achieves a absolute gain of 11.4% over the baseline
  • Table4: Performance comparison (in terms of mAProle) regarding the classification capabilities of our approach for the rare and nonrare classes on the HICO-DET. We show the results with different score thresholds, used during the evaluation. Our proposed dynamic threshold inference achieves a good performance trade-off between the rare and non-rare classes
Download tables as Excel
Related work
  • Object Detection: In recent years, significant progress has been made in the field of object detection [15, 17, 19, 21, 28, 29, 35, 36], mainly due to the advances in deep convolutional neural networks (CNNs). Generally, modern object detection approaches can be divided into single-stage [17, 20, 26, 27, 33] and two-stage methods [1, 15, 28]. Two-stage object detection methods typically generate candidate object proposals and then perform classification and regression of these proposals in the second stage. On the other hand, single-stage object detection approaches work by directly classifying and regressing the default anchor box in each position. Two-stage object detectors are generally known to be more accurate whereas the main advantage of single-stage methods is their speed.

    Within object detection, recent anchor-free single-stage detectors [13, 31, 40, 41] aim at eliminating the requirement of anchor boxes and treat object detection as keypoint estimation. CornerNet [13] detects the bounding-box of an object as a pair of keypoints, the top-left corner and the bottom-right corner. ExtremeNet [41] further detects four extreme points and one center point of objects and groups the five keypoints into a bounding-box. CenterNet [40] models an object as a single point the center point of its bounding-box and is also extended to Human pose estimation [6] and 3D detection task [24]. Human-Object Interaction Detection Among existing human-object interaction (HOI) detection methods, the work of [9] is the first to explore the problem of visual semantic role labeling. The objective of this problem is to localize the agent (human) and object along with detecting the interaction between them. The work of [8] introduces a human-centric approach, called InteractNet, which extends the Faster R-CNN framework with an additional branch to learn the interaction-specific density map over target locations. Qi et al, [25] proposes to utilize graph convolution neural network and regards the HOI task as a graph structure optimization problem. Chao et al, [3] builds a multistream network that is based on the human-object regionof-interest and the pairwise interaction branch. The inputs to this multi-stream architecture are the predicted boundingboxes from the pre-trained detector (e.g., FPN [15]) and the original image. Human and object streams in such a multistream architecture are based on appearance features, extracted from the backbone network, to generate confidence predictions on the detected human and object boundingboxes. The pairwise stream, on the other hand, simply encodes the spatial relationship between the human and object by taking the union of the two boxes (human and object). Later works have extended the above mentioned multistream architecture by, e.g., introducing instance-centric attention [7], pose information [14] and deep contextual attention based on context-aware appearance features [34].
Funding
  • This work was supported by The National Key Research and Development Program of China (No 2017YFA0700800) and Beijing Academy of Artificial Intelligence (BAAI)
Reference
  • Jiale Cao, Yanwei Pang, and Xuelong Li. Triply supervised decoder networks for joint detection and segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Yuwei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
    Google ScholarLocate open access versionFindings
  • Yuwei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. HICO: A benchmark for recognizing human-object interactions in images. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Chen Gao, Yuliang Zou, and Jia-Bin Huang. iCAN: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
    Google ScholarLocate open access versionFindings
  • Georgia Gkioxari, Ross Girshick, Piotr Dollar, and Kaiming He. Detecting and recognizing human-object intaractions. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
    Findings
  • Tanmay Gupta, Alexander Schwing, and Derek Hoiem. Nofrills human-object interaction detection: Factorization, layout encodings, and training techniques. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Alexander Kolesnikov, Christoph H. Lampert, and Vittorio Ferrari. Detecting visual relationships using box attention. In arXiv preprint arXiv:1807.02136, 2018.
    Findings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Yan-Feng Wang Hao-Shu Fang, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided refinement network for object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yanwei Pang, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Efficient featurized image pyramid network for single shot detector. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Mask-Guided attention network for occluded pedestrian detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. Detecting unseen visual relations using analogies. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Charles Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas Guibas. Frustum pointnets for 3d object detection from rgbd data. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolo9000:better, faster, stronger. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Fahad Shahbaz Khan, Jiaolong Xu, Joost van de Weijer, Andrew Bagdanov, Rao Muhammad Anwer, and Antonio Lopez. Recognizing actions through action-specific person detection. IEEE Transactions on Image Processing, 24(11):4422–4432, 2015.
    Google ScholarLocate open access versionFindings
  • Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li,, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Tiancai Wang, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Learning rich features at high-speed for single-shot object detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Deep contextual attention for human-object interaction detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Wenguan Wang, Yuanlu Xu, Jianbing Shen, and Song-Chun Zhu. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. Learning compositional neural information fusion for human parsing. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan Kankanhalli. Learning to detect human-object interactions with knowledge. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Tianfei Zhou, Wenguan Wang, Siyuan Qi, Jianbing Shen, and Haibin Ling. Cascaded human-object interaction recognition. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
    Findings
  • Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments