AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Learning Human-Object Interaction Detection Using Interaction Points
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), pp.4115-4124, (2020)
- Detailed semantic understanding of image contents, beyond instance-level recognition, is one of the fundamental problems in computer vision.
- HOIs interacting with multiple objects (“sit on a couch and type on laptop”), multiple humans sharing the same interaction and object (“throw and catch ball”), or fine-grained interactions (“walk horse”, “feed horse” and “jump horse”).
- These complex and diverse interaction scenarios impose significant challenges when designing an HOI detection solution.
- Individual scores from the three streams are fused in a late fusion fashion for interaction recognition
- Detailed semantic understanding of image contents, beyond instance-level recognition, is one of the fundamental problems in computer vision
- Output: kick sports ball individual scores from a human, an object, and a pairwise stream are fused in a late fusion manner for interaction recognition. We argue that such a late fusion strategy is sub-optimal since appearance features alone are insufficient to capture complex human-object interactions
- A minor limitation of our approach is that multiple human-object interaction pairs cannot share the same interaction point
- We propose a point-based framework for human-object interaction detection
- The interaction point and its corresponding interaction vector are first generated by the keypoint detection network
- Experiments are performed on two human-object interaction detection benchmarks
- VSRL* InteractNet  BAR  GPNN  iCAN  HOI w knowledge  DCA  RPNN  TIK  PMFNet  Ours Ours + HICO.
- As in , the authors report results on three different HOI category sets: full, rare, and non-rare with two different settings of Default and Known Objects.
- In case of Known Object setting, the approach achieves an absolute gain of 2.9% over  on the full set
- The interaction boxes generated by the interaction vectors are drawn
- These interaction boxes are paired with the positive human and object bounding-boxes using interaction grouping.
- Fig. 6 shows examples of a human performing multiple interactions.
- A minor limitation of the approach is that multiple HOI pairs cannot share the same interaction point.
- Such cases are rare in practice
- The authors propose a point-based framework for HOI detection.
- The authors' approach regards the HOI detection as a keypoint detection and grouping problem.
- The interaction point and its corresponding interaction vector are first generated by the keypoint detection network.
- The authors directly pair those interaction points with the human and object bounding boxes from object detection branch using the proposed interaction grouping scheme.
- Experiments are performed on two HOI detection benchmarks.
- The authors' points-based approach outperforms state-of-the-art methods on both datasets
- Table1: State-of-the-art comparison (in terms of mAProle) on the V-COCO dataset. * refers to implementation of [<a class="ref-link" id="c9" href="#r9">9</a>] by [<a class="ref-link" id="c8" href="#r8">8</a>]. Our approach sets a new state-of-the-art with mAProle of 51.0 and achieves an absolute gain of 3.2% over TIK [<a class="ref-link" id="c14" href="#r14">14</a>]. The results are further improved (mAProle of 52.3) when utilizing pre-training on HICO-DET and then fine-tuning on V-COCO dataset
- Table2: State-of-the-art comparison (in terms of mAProle) on the HICO-DET using two different settings: Default and Known Object on all three sets (full, rare, non-rare). Note that Shen et al [<a class="ref-link" id="c30" href="#r30">30</a>], InteractNet [<a class="ref-link" id="c8" href="#r8">8</a>] and GPNN [<a class="ref-link" id="c25" href="#r25">25</a>] only report results on the Default settings. For both settings, our approach provides superior performance compared to existing methods. In case of default settings, our approach achieves mAProle of 19.56 on the full set. Further, our approach obtains an absolute gain of 2.9% over TIK [<a class="ref-link" id="c14" href="#r14">14</a>] on the full set of Known Object setting
- Table3: Impact of integrating our contributions into the baseline on V-COCO. Results are reported in terms of role mean average precision (mAProle). For fair comparison, we use the same backbone (Hourglass-104) for all the ablation experiments. Our overall architecture achieves a absolute gain of 11.4% over the baseline
- Table4: Performance comparison (in terms of mAProle) regarding the classification capabilities of our approach for the rare and nonrare classes on the HICO-DET. We show the results with different score thresholds, used during the evaluation. Our proposed dynamic threshold inference achieves a good performance trade-off between the rare and non-rare classes
- Object Detection: In recent years, significant progress has been made in the field of object detection [15, 17, 19, 21, 28, 29, 35, 36], mainly due to the advances in deep convolutional neural networks (CNNs). Generally, modern object detection approaches can be divided into single-stage [17, 20, 26, 27, 33] and two-stage methods [1, 15, 28]. Two-stage object detection methods typically generate candidate object proposals and then perform classification and regression of these proposals in the second stage. On the other hand, single-stage object detection approaches work by directly classifying and regressing the default anchor box in each position. Two-stage object detectors are generally known to be more accurate whereas the main advantage of single-stage methods is their speed.
Within object detection, recent anchor-free single-stage detectors [13, 31, 40, 41] aim at eliminating the requirement of anchor boxes and treat object detection as keypoint estimation. CornerNet  detects the bounding-box of an object as a pair of keypoints, the top-left corner and the bottom-right corner. ExtremeNet  further detects four extreme points and one center point of objects and groups the five keypoints into a bounding-box. CenterNet  models an object as a single point the center point of its bounding-box and is also extended to Human pose estimation  and 3D detection task . Human-Object Interaction Detection Among existing human-object interaction (HOI) detection methods, the work of  is the first to explore the problem of visual semantic role labeling. The objective of this problem is to localize the agent (human) and object along with detecting the interaction between them. The work of  introduces a human-centric approach, called InteractNet, which extends the Faster R-CNN framework with an additional branch to learn the interaction-specific density map over target locations. Qi et al,  proposes to utilize graph convolution neural network and regards the HOI task as a graph structure optimization problem. Chao et al,  builds a multistream network that is based on the human-object regionof-interest and the pairwise interaction branch. The inputs to this multi-stream architecture are the predicted boundingboxes from the pre-trained detector (e.g., FPN ) and the original image. Human and object streams in such a multistream architecture are based on appearance features, extracted from the backbone network, to generate confidence predictions on the detected human and object boundingboxes. The pairwise stream, on the other hand, simply encodes the spatial relationship between the human and object by taking the union of the two boxes (human and object). Later works have extended the above mentioned multistream architecture by, e.g., introducing instance-centric attention , pose information  and deep contextual attention based on context-aware appearance features .
- This work was supported by The National Key Research and Development Program of China (No 2017YFA0700800) and Beijing Academy of Artificial Intelligence (BAAI)
- Jiale Cao, Yanwei Pang, and Xuelong Li. Triply supervised decoder networks for joint detection and segmentation. In CVPR, 2019.
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
- Yuwei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
- Yuwei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. HICO: A benchmark for recognizing human-object interactions in images. In ICCV, 2015.
- Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In ICCV, 2019.
- Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.
- Chen Gao, Yuliang Zou, and Jia-Bin Huang. iCAN: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
- Georgia Gkioxari, Ross Girshick, Piotr Dollar, and Kaiming He. Detecting and recognizing human-object intaractions. In CVPR, 2018.
- Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
- Tanmay Gupta, Alexander Schwing, and Derek Hoiem. Nofrills human-object interaction detection: Factorization, layout encodings, and training techniques. In ICCV, 2019.
- Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
- Alexander Kolesnikov, Christoph H. Lampert, and Vittorio Ferrari. Detecting visual relationships using box attention. In arXiv preprint arXiv:1807.02136, 2018.
- Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
- Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Yan-Feng Wang Hao-Shu Fang, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019.
- Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
- Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
- Jing Nie, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Enriched feature guided refinement network for object detection. In ICCV, 2019.
- Yanwei Pang, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Efficient featurized image pyramid network for single shot detector. In CVPR, 2019.
- Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Mask-Guided attention network for occluded pedestrian detection. In ICCV, 2019.
- George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
- Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. Detecting unseen visual relations using analogies. In ICCV, 2019.
- Charles Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas Guibas. Frustum pointnets for 3d object detection from rgbd data. In CVPR, 2018.
- Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
- Joseph Redmon and Ali Farhadi. Yolo9000:better, faster, stronger. In CVPR, 2017.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- Fahad Shahbaz Khan, Jiaolong Xu, Joost van de Weijer, Andrew Bagdanov, Rao Muhammad Anwer, and Antonio Lopez. Recognizing actions through action-specific person detection. IEEE Transactions on Image Processing, 24(11):4422–4432, 2015.
- Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.
- Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.
- Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li,, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.
- Tiancai Wang, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Learning rich features at high-speed for single-shot object detection. In ICCV, 2019.
- Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, and Ling Shao. Deep contextual attention for human-object interaction detection. In ICCV, 2019.
- Wenguan Wang, Yuanlu Xu, Jianbing Shen, and Song-Chun Zhu. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In CVPR, 2018.
- Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. Learning compositional neural information fusion for human parsing. In ICCV, 2019.
- Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan Kankanhalli. Learning to detect human-object interactions with knowledge. In CVPR, 2019.
- Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.
- Tianfei Zhou, Wenguan Wang, Siyuan Qi, Jianbing Shen, and Haibin Ling. Cascaded human-object interaction recognition. In CVPR, 2020.
- Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
- Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.