Deep Fitting Degree Scoring Network for Monocular 3D Object Detection

CVPR, pp. 1057-1066, 2019.

Cited by: 14|Bibtex|Views16|Links
EI
Keywords:
Average PrecisionIntersection over UnionDegree of Freedom3d bounding box3d object detectionMore(15+)
Weibo:
We propose to learn a deep fitting degree scoring network for monocular 3D object detection, which aims to score fitting degree between proposals and object conclusively

Abstract:

In this paper, we propose to learn a deep fitting degree scoring network for monocular 3D object detection, which aims to score fitting degree between proposals and object conclusively. Different from most existing monocular frameworks which use tight constraint to get 3D location, our approach achieves high-precision localization through...More

Code:

Data:

0
Introduction
  • In monocular 3D object detection problem, dimension and orientation estimation are easier than location estimation, because the only available information, appearance, is strongly related to the former two sub-problems.
  • Tight constraint [32, 25] is a commonly used method in monocular 3D object detection problem, which solves the location by placing the 3D proposal in the 2D bounding box compactly.
  • Inspired by the observation that people can distinguish the quality of 3D detection results through projecting these 3D bounding boxes on the 2D image and checking the relation between projections and object, the authors believe that exploring the 3D spatial overlap between proposals and ground-truth is the key to solve the location estimation problem.
  • The authors conducted experiments on the challenging KITTI dataset and achieved state-of-the-art monocular 3D object detection performance, which demonstrates the effectiveness of the framework
Highlights
  • In monocular 3D object detection problem, dimension and orientation estimation are easier than location estimation, because the only available information, appearance, is strongly related to the former two sub-problems
  • Tight constraint [32, 25] is a commonly used method in monocular 3D object detection problem, which solves the location by placing the 3D proposal in the 2D bounding box compactly
  • Our motivation is that though the 3D location is independent of 2D appearance, drawing the projection results on the 2D image can bring additional information for the convolutional neural network (CNN) to better understand the spatial relationship between the original 3D bounding boxes and the object
  • We conducted experiments on the challenging KITTI dataset and achieved state-of-the-art monocular 3D object detection performance, which demonstrates the effectiveness of our framework
  • We have proposed a unified pipeline for monocular 3D object detection
  • Through measuring the relation between the projections and object, our Fitting Quality Network (FQNet) successfully estimates the 3D Intersection over Union (IoU) and filters the suitable candidate. Both quantitative and qualitative results have demonstrated that our proposed method outperforms the state-of-the-art monocular 3D object detection methods
Methods
  • Hard train/val 1 train/val 2 test train/val 1 train/val 2 test train/val 1 train/val 2 test IoU = 0.5 Hard.
  • Hard t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2.
  • Mono3D [8] 30.50 - 22.39 - 19.16 - 5.22 - 5.19 - 4.13 -.
  • Deep3DBox [32] - 30.02 - 23.77 - 18.83 - 9.99 - 7.71 - 5.30
Results
  • Apart from drawing the 3D detection boxes on 2D images, the authors projected the 3D detection boxes in the 3D space for better visualization.
  • As shown in Figure 8, the approach can fit the object well and achieve high-precision 3D perception in various scenes with only one monocular image as input
Conclusion
  • The authors have proposed a unified pipeline for monocular 3D object detection.
  • Through measuring the relation between the projections and object, the FQNet successfully estimates the 3D IoU and filters the suitable candidate.
  • Both quantitative and qualitative results have demonstrated that the proposed method outperforms the state-of-the-art monocular 3D object detection methods.
  • How to extend the monocular 3D object detection method for monocular 3D object tracking seems to be interesting future work
Summary
  • Introduction:

    In monocular 3D object detection problem, dimension and orientation estimation are easier than location estimation, because the only available information, appearance, is strongly related to the former two sub-problems.
  • Tight constraint [32, 25] is a commonly used method in monocular 3D object detection problem, which solves the location by placing the 3D proposal in the 2D bounding box compactly.
  • Inspired by the observation that people can distinguish the quality of 3D detection results through projecting these 3D bounding boxes on the 2D image and checking the relation between projections and object, the authors believe that exploring the 3D spatial overlap between proposals and ground-truth is the key to solve the location estimation problem.
  • The authors conducted experiments on the challenging KITTI dataset and achieved state-of-the-art monocular 3D object detection performance, which demonstrates the effectiveness of the framework
  • Methods:

    Hard train/val 1 train/val 2 test train/val 1 train/val 2 test train/val 1 train/val 2 test IoU = 0.5 Hard.
  • Hard t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2 t/v 1 t/v 2.
  • Mono3D [8] 30.50 - 22.39 - 19.16 - 5.22 - 5.19 - 4.13 -.
  • Deep3DBox [32] - 30.02 - 23.77 - 18.83 - 9.99 - 7.71 - 5.30
  • Results:

    Apart from drawing the 3D detection boxes on 2D images, the authors projected the 3D detection boxes in the 3D space for better visualization.
  • As shown in Figure 8, the approach can fit the object well and achieve high-precision 3D perception in various scenes with only one monocular image as input
  • Conclusion:

    The authors have proposed a unified pipeline for monocular 3D object detection.
  • Through measuring the relation between the projections and object, the FQNet successfully estimates the 3D IoU and filters the suitable candidate.
  • Both quantitative and qualitative results have demonstrated that the proposed method outperforms the state-of-the-art monocular 3D object detection methods.
  • How to extend the monocular 3D object detection method for monocular 3D object tracking seems to be interesting future work
Tables
  • Table1: Comparisons of the Average Orientation Similarity (AOS) with the state-of-the-art methods on the KITTI dataset
  • Table2: Comparisons of the 2D AP with the state-of-the-art methods on the KITTI Birds Eyed View validation dataset
  • Table3: Comparisons of the Average Error of dimension estimation with state-of-the-art methods on the KITTI validation dataset
  • Table4: Comparisons of the 3D AP with the state-of-the-art methods on the KITTI 3D Object validation dataset
Download tables as Excel
Related work
  • Monocular 3D Object Detection: Monocular 3D object detection is much more difficult than 2D object detection because of the ambiguities arising from 2D-3D mapping. Many methods have taken the first step, which can roughly categorize into two classes: handcrafted approaches and deep learning based approaches.

    Most of the early works belong to the handcrafted approaches, which concentrated on designing efficient handcrafted features. Payet and Todorovic [33] used image contours as basic features and proposed mid-level features, called bags of boundaries (BOBs). Fidler et al [18] extended the Deformable Part Model (DPM) and represented an object class as a deformable 3D cuboid composed of faces and parts. Pepik et al [34] included viewpoint information and part-level 3D geometry information in the DPM and achieved robust 3D object representation. Although these handcrafted methods are very carefully designed and perform well on some scenarios, their generalization ability is still limited.
Funding
  • This work was supported in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, Grant 61672306, and Grant 61572271
Reference
  • A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, and U. J. Nunes. Depthcn: Vehicle detection using 3d-lidar and convnet. In ITSC, 2017. 1
    Google ScholarLocate open access versionFindings
  • Y. Bai, Y. Lou, F. Gao, S. Wang, Y. Wu, and L.-Y. Duan. Group-sensitive triplet embedding for vehicle reidentification. TMM, 20(9):2385–2399, 2018. 1
    Google ScholarLocate open access versionFindings
  • J. Beltran, C. Guindel, F. M. Moreno, D. Cruzado, F. Garcia, and A. de la Escalera. Birdnet: a 3d object detection framework from lidar information. arXiv preprint arXiv:1805.01195, 2018. 1
    Findings
  • M. Bertozzi, A. Broggi, and A. Fascioli. Vision-based intelligent vehicles: State of the art and perspectives. Robotics and Autonomous systems, 32(1):1–16, 2000. 1
    Google ScholarLocate open access versionFindings
  • Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In ICCV, 2015. 1
    Google ScholarLocate open access versionFindings
  • X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In CVPR, 2016. 1, 2, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015. 1, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals using stereo imagery for accurate object class detection. TPAMI, 40(5):1259–1272, 2018. 1
    Google ScholarLocate open access versionFindings
  • X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • L. Del Pero, J. Bowdish, B. Kermgard, E. Hartley, and K. Barnard. Understanding bayesian rooms using composite 3d object models. In CVPR, pages 153–160, 2013. 3
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 6
    Google ScholarLocate open access versionFindings
  • L. Duan, Y. Lou, S. Wang, W. Gao, and Y. Rui. Ai oriented large-scale video management for smart city: Technologies, standards and beyond. IEEE MultiMedia, 2018. 1
    Google ScholarLocate open access versionFindings
  • L.-Y. Duan, V. Chandrasekhar, J. Chen, J. Lin, Z. Wang, T. Huang, B. Girod, and W. Gao. Overview of the mpegcdvs standard. TIP, 25(1):179–194, 2016. 1
    Google ScholarLocate open access versionFindings
  • M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In ICRA, 2017. 1
    Google ScholarLocate open access versionFindings
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 32(9):1627–1645, 2010. 2
    Google ScholarLocate open access versionFindings
  • S. Fidler, S. Dickinson, and R. Urtasun. 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In NIPS, 2012. 2
    Google ScholarLocate open access versionFindings
  • A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012. 6
    Google ScholarLocate open access versionFindings
  • S. Gidaris and N. Komodakis. Locnet: Improving localization accuracy for object detection. In CVPR, pages 789–798, 2016. 2
    Google ScholarLocate open access versionFindings
  • S. Gupta, P. Arbelaez, R. Girshick, and J. Malik. Aligning 3d models to rgb-d images of cluttered scenes. In CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 37(9):1904–1916, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Janai, F. Guney, A. Behl, and A. Geiger. Computer vision for autonomous vehicles: Problems, datasets and state-ofthe-art. arXiv preprint arXiv:1704.05519, 2017. 1
    Findings
  • J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander. Joint 3d proposal generation and object detection from view aggregation. arXiv preprint arXiv:1712.02294, 2017. 1
    Findings
  • A. Kundu, Y. Li, and J. M. Rehg. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In CVPR, 2018. 2, 5, 7
    Google ScholarLocate open access versionFindings
  • B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool. Dynamic 3d scene analysis from a moving vehicle. In CVPR, 2007. 1
    Google ScholarLocate open access versionFindings
  • S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. IJRR, 37(45):421–436, 2018. 1
    Google ScholarLocate open access versionFindings
  • B. Li. 3d fully convolutional network for vehicle detection in point cloud. In IROS, 2017. 1
    Google ScholarLocate open access versionFindings
  • B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916, 2016. 1
    Findings
  • D. Lin, S. Fidler, and R. Urtasun. Holistic scene understanding for 3d object detection with rgbd cameras. In ICCV, 2013. 1
    Google ScholarFindings
  • J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017. 1
    Findings
  • A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3d bounding box estimation using deep learning and geometry. In CVPR, 2017. 1, 2, 3, 5, 7, 8
    Google ScholarLocate open access versionFindings
  • N. Payet and S. Todorovic. From contours to 3d object detection and pose estimation. In ICCV, 2011. 2
    Google ScholarLocate open access versionFindings
  • B. Pepik, M. Stark, P. Gehler, and B. Schiele. Multi-view and 3d deformable part models. TPAMI, 37(11):2232–2245, 2015. 2
    Google ScholarLocate open access versionFindings
  • C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from rgb-d data. arXiv preprint arXiv:1711.08488, 2017. 1
    Findings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015. 2
    Google ScholarLocate open access versionFindings
  • A. Saxena, J. Driemeyer, and A. Y. Ng. Robotic grasping of novel objects using vision. IJRR, 27(2):157–173, 2008. 1
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 6
    Findings
  • S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In ECCV, 2014. 1
    Google ScholarLocate open access versionFindings
  • S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, 2016. 1
    Google ScholarLocate open access versionFindings
  • D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Robotics: Science and Systems, volume 1, 2015. 1
    Google ScholarLocate open access versionFindings
  • Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition. In CVPR, 2015. 6, 7
    Google ScholarLocate open access versionFindings
  • Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Subcategoryaware convolutional neural networks for object proposals and detection. In WACV, 2017. 1, 6, 7
    Google ScholarLocate open access versionFindings
  • Y. Xiang and S. Savarese. Object detection by 3d aspectlets and occlusion reasoning. In ICCVW, pages 530–537, 2013. 2
    Google ScholarLocate open access versionFindings
  • J. Xiao, B. Russell, and A. Torralba. Localizing 3d cuboids in single-view images. In Advances in neural information processing systems, pages 746–754, 2012. 2
    Google ScholarLocate open access versionFindings
  • B. Xu and Z. Chen. Multi-level fusion based 3d object detection from monocular images. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Y. Zeng, Y. Hu, S. Liu, J. Ye, Y. Han, X. Li, and N. Sun. Rt3d: Real-time 3-d vehicle detection in lidar point cloud for autonomous driving. IEEE Robotics and Automation Letters, 3(4):3434–3440, 2018. 1
    Google ScholarLocate open access versionFindings
  • M. Z. Zia, M. Stark, and K. Schindler. Towards scene understanding with detailed 3d object representations. IJCV, 112(2):188–203, 2015. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments