Self-supervised Sparse-to-Dense: Self-supervised Depth Completion from LiDAR and Monocular Camera

Guilherme Venturelli Cavalheiro
Guilherme Venturelli Cavalheiro

international conference on robotics and automation, 2019.

Cited by: 74|Bibtex|Views96|Links
EI
Keywords:
pixel levelLaboratory for Information & Decision Systemsdense annotationsuper resolutionVisual LearningMore(10+)
Weibo:
This framework requires only sequences of RGB and sparse depth images, and outperforms a number of existing solutions trained with semi-dense annotations

Abstract:

Depth completion, the technique of estimating a dense depth image from sparse depth measurements, has a variety of applications in robotics and autonomous driving. However, depth completion faces 3 main challenges: the irregularly spaced pattern in the sparse depth input, the difficulty in handling multiple sensor modalities (when color i...More

Code:

Data:

0
Introduction
  • Depth sensing is fundamental in a variety of robotic tasks, including obstacle avoidance, 3D mapping [1, 2], and localization [3].
  • LiDAR, given its high accuracy and long sensing range, has been integrated into a large number of robots and autonomous vehicles.
  • The LiDAR measurements are highly sparse and irregularly spaced in the image space.
  • It is a non-trivial task to improve prediction accuracy using the corresponding color image, if available, since depth and color are different sensor modalities.
  • Dense ground truth depth is generally not available, and obtaining pixel-level annotations can be both labor-intensive and non-scalable
Highlights
  • Depth sensing is fundamental in a variety of robotic tasks, including obstacle avoidance, 3D mapping [1, 2], and localization [3]
  • Dense ground truth depth is generally not available, and obtaining pixel-level annotations can be both labor-intensive and non-scalable. We address all these challenges with two contributions: (1) We develop a network architecture that is able to learn a direct mapping from the sparse depth to dense depth
  • We have developed a deep regression model for depth completion of sparse LiDAR measurements
  • Our model achieves state-of-the-art performance on the KITTI depth completion benchmark, and outperforms existing published work by a significant margin at the time of submission
  • We propose a highly scalable, model-based self-supervised training framework for depth completion networks
  • This framework requires only sequences of RGB and sparse depth images, and outperforms a number of existing solutions trained with semi-dense annotations
Methods
  • Input rmse [mm] mae [mm] irmse [1/km] imae [1/km]

    NadarayaW [4] SparseConvs [4]

    ADNN [22] IP-Basic [20] NConv-CNN [21] NN+CNN2 [4]

    Ours-d SGDU [18] Ours-RGBd ddddddd RGBd RGBd

    The authors' d-network leads prior work with a large margin in almost all metrics.
  • The authors' predicted depth images have cleaner and sharper object boundaries, which can be attributed to the fact that the network is quite deep and has large skip connections.
  • Note that all these supervised methods produce poor predictions at the top of the image, because of 2 reasons: (a) the LiDAR returns no measurements, and the input to the network is all zero at the top; (b) the 30% semi-dense annotations do not contain labels in these top regions.
  • Increasing network depth and encoders-decoders pairs, as well as a proper split of filters allocated to the RGB and the d branches (16/48 split), create small positive impact on the results
Results
  • The authors present experimental results to demonstrate the performance of the approach.
  • The authors first compare the network architecture, trained in a purely supervised fashion, against state-of-the-art published methods.
  • The authors showcase training results using the self-supervised framework, and present an empirical study on how the algorithm performs under different level of sparsity in the input depth signals.
  • The authors train the best network in a purely supervised fashion to benchmark against other published results.
  • The authors use the official error metrics for the KITTI depth completion benchmark [4], including rmse, mae, irmse, and imae.
  • The results are listed in Table 1 and visualized in Figure 4
Conclusion
  • The authors have developed a deep regression model for depth completion of sparse LiDAR measurements.
  • The authors propose a highly scalable, model-based self-supervised training framework for depth completion networks.
  • This framework requires only sequences of RGB and sparse depth images, and outperforms a number of existing solutions trained with semi-dense annotations.
  • The authors present empirical results demonstrating that depth completion errors decrease as a power function with the number of input depth measurements.
  • The authors will investigate techniques for improving the self-supervised framework, including better loss functions and taking dynamic objects into account
Summary
  • Introduction:

    Depth sensing is fundamental in a variety of robotic tasks, including obstacle avoidance, 3D mapping [1, 2], and localization [3].
  • LiDAR, given its high accuracy and long sensing range, has been integrated into a large number of robots and autonomous vehicles.
  • The LiDAR measurements are highly sparse and irregularly spaced in the image space.
  • It is a non-trivial task to improve prediction accuracy using the corresponding color image, if available, since depth and color are different sensor modalities.
  • Dense ground truth depth is generally not available, and obtaining pixel-level annotations can be both labor-intensive and non-scalable
  • Methods:

    Input rmse [mm] mae [mm] irmse [1/km] imae [1/km]

    NadarayaW [4] SparseConvs [4]

    ADNN [22] IP-Basic [20] NConv-CNN [21] NN+CNN2 [4]

    Ours-d SGDU [18] Ours-RGBd ddddddd RGBd RGBd

    The authors' d-network leads prior work with a large margin in almost all metrics.
  • The authors' predicted depth images have cleaner and sharper object boundaries, which can be attributed to the fact that the network is quite deep and has large skip connections.
  • Note that all these supervised methods produce poor predictions at the top of the image, because of 2 reasons: (a) the LiDAR returns no measurements, and the input to the network is all zero at the top; (b) the 30% semi-dense annotations do not contain labels in these top regions.
  • Increasing network depth and encoders-decoders pairs, as well as a proper split of filters allocated to the RGB and the d branches (16/48 split), create small positive impact on the results
  • Results:

    The authors present experimental results to demonstrate the performance of the approach.
  • The authors first compare the network architecture, trained in a purely supervised fashion, against state-of-the-art published methods.
  • The authors showcase training results using the self-supervised framework, and present an empirical study on how the algorithm performs under different level of sparsity in the input depth signals.
  • The authors train the best network in a purely supervised fashion to benchmark against other published results.
  • The authors use the official error metrics for the KITTI depth completion benchmark [4], including rmse, mae, irmse, and imae.
  • The results are listed in Table 1 and visualized in Figure 4
  • Conclusion:

    The authors have developed a deep regression model for depth completion of sparse LiDAR measurements.
  • The authors propose a highly scalable, model-based self-supervised training framework for depth completion networks.
  • This framework requires only sequences of RGB and sparse depth images, and outperforms a number of existing solutions trained with semi-dense annotations.
  • The authors present empirical results demonstrating that depth completion errors decrease as a power function with the number of input depth measurements.
  • The authors will investigate techniques for improving the self-supervised framework, including better loss functions and taking dynamic objects into account
Tables
  • Table1: Comparison against state-of-the-art algorithms on the test set
  • Table2: Ablation study of the network architecture for depth input. Empty cells indicate the same value as the first row of each section. See Section 6.2 for detailed discussion
  • Table3: Evaluation of the self-supervised framework on the validation set
Download tables as Excel
Related work
  • Depth completion. Depth completion is an umbrella term that covers a collection of related problems with a variety of different input modalities (e.g., relatively dense depth input [5, 6, 7] vs. sparse depth measurements [8, 9]; with color images for guidance [6, 10] vs. without [4]). The problems and solutions are usually sensor-dependent, and as a result they face vastly different levels of algorithmic challenges.

    For instance, depth completion for structured light sensor (e.g., Microsoft Kinect) [11] is sometimes also referred to as depth inpainting [12], or depth enhancement [5, 6, 7] when noise is taken into account. The task is to fill in small missing holes in the relatively dense depth images. This problem is relatively easy, since most pixels (typically over 80%) are observed. Consequently, even simple filtering-based methods [5] can provide good results. As a side note, the inpainting problem also finds close connection to depth denoising [13] and depth super-resolution [14, 15, 16, 17, 18, 19].
Funding
  • This work was supported in part by the Office of Naval Research (ONR) grant N00014-17-12670 and the NVIDIA Corporation
  • In particular, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the DGX-1 used for this research
Reference
  • R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pages 127–136. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • J. Zhang and S. Singh. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and Systems, volume 2, 2014.
    Google ScholarLocate open access versionFindings
  • R. W. Wolcott and R. M. Eustice. Fast lidar localization using multiresolution gaussian mixture maps. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 2814–2821. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. arXiv preprint arXiv:1708.06500, 2017.
    Findings
  • M. Camplani and L. Salgado. Efficient spatio-temporal hole filling strategy for kinect depth maps. In Three-dimensional image processing (3DIP) and applications Ii, volume 8290, page 82900E. International Society for Optics and Photonics, 2012.
    Google ScholarLocate open access versionFindings
  • J. Shen and S.-C. S. Cheung. Layer depth denoising and completion for structured-light rgb-d cameras. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1187–1194. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • S. Lu, X. Ren, and F. Liu. Depth enhancement via low-rank matrix completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3390–3397, 2014.
    Google ScholarLocate open access versionFindings
  • F. Ma, L. Carlone, U. Ayaz, and S. Karaman. Sparse sensing for resource-constrained depth reconstruction. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 96–103. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • F. Ma, L. Carlone, U. Ayaz, and S. Karaman. Sparse depth sensing for resourceconstrained robots. arXiv preprint arXiv:1703.01398, 2017.
    Findings
  • F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. arXiv preprint arXiv:1709.07492, 2017.
    Findings
  • Y. Zhang and T. Funkhouser. Deep depth completion of a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 175–185, 2018.
    Google ScholarLocate open access versionFindings
  • J. T. Barron and B. Poole. The fast bilateral solver. In European Conference on Computer Vision, pages 617–632.
    Google ScholarLocate open access versionFindings
  • J. Diebel and S. Thrun. An application of markov random fields to range sensing. In Advances in neural information processing systems, pages 291–298, 2006.
    Google ScholarLocate open access versionFindings
  • M. Hornácek, C. Rhemann, M. Gelautz, and C. Rother. Depth super resolution by rigid body self-similarity in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1123–1130, 2013.
    Google ScholarLocate open access versionFindings
  • J. Xie, C.-C. Chou, R. Feris, and M.-T. Sun. Single depth image super resolution and denoising via coupled dictionary learning with local constraints and shock filtering. In Multimedia and Expo (ICME), 2014 IEEE International Conference on, pages 1–6. IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • J. Lu and D. Forsyth. Sparse depth super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2245–2253, 2015.
    Google ScholarLocate open access versionFindings
  • J. Xie, R. S. Feris, and M.-T. Sun. Edge-guided single depth image super resolution. IEEE Transactions on Image Processing, 25(1):428–438, 2016.
    Google ScholarLocate open access versionFindings
  • N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, and C. Stiller. Semantically guided depth upsampling. In German Conference on Pattern Recognition, pages 37–48.
    Google ScholarLocate open access versionFindings
  • V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4452–4461, 2016.
    Google ScholarLocate open access versionFindings
  • J. Ku, A. Harakeh, and S. L. Waslander. In defense of classical image processing: Fast depth completion on the cpu. arXiv preprint arXiv:1802.00036, 2018.
    Findings
  • A. Eldesokey, M. Felsberg, and F. S. Khan. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913, 2018.
    Findings
  • N. Chodosh, C. Wang, and S. Lucey. Deep convolutional compressed sensing for lidar depth completion. arXiv preprint arXiv:1803.08949, 2018.
    Findings
  • A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
    Google ScholarLocate open access versionFindings
  • D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
    Google ScholarLocate open access versionFindings
  • I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. arXiv preprint arXiv:1612.02401, 2016.
    Findings
  • H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
    Google ScholarLocate open access versionFindings
  • T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. arXiv preprint arXiv:1704.07813, 2017.
    Findings
  • R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2018.
    Google ScholarLocate open access versionFindings
  • R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. arXiv preprint arXiv:1709.06841, 2017.
    Findings
  • O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456, 2015.
    Google ScholarLocate open access versionFindings
  • V. Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155, 2009.
    Google ScholarLocate open access versionFindings
  • M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. In Readings in computer vision, pages 726–740.
    Google ScholarLocate open access versionFindings
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
    Google ScholarFindings
  • M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Champagnat. On regression losses for deep depth estimation.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments