Digging Into Self-Supervised Monocular Depth Estimation

ICCV, pp. 3827-3837, 2019.

Cited by: 155|Bibtex|Views85|Links
EI
Keywords:
deep networkunsupervised monocular depth estimationmonocular videomonocular depth estimationdepth mapMore(4+)
Weibo:
We showed how together they give a simple and efficient model for depth estimation, which can be trained with monocular video data, stereo data, or mixed monocular and stereo data

Abstract:

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively impr...More

Code:

Data:

0
Introduction
  • The authors seek to automatically infer a dense depth image from a single color input image.
  • Estimating absolute, or even relative depth, seems ill-posed without a second input image to enable triangulation.
  • Generating high quality depth-from-color is attractive because it could inexpensively complement LIDAR sensors used in self-driving cars, and enable new single-photo applications such as image-editing and AR-compositing.
  • Collecting large and varied training datasets with accurate ground truth depth for supervised learning [55, 9] is itself a formidable challenge.
Highlights
  • We seek to automatically infer a dense depth image from a single color input image
  • We propose three architectural and loss innovations that combined, lead to large improvements in monocular depth estimation when training with monocular video, stereo pairs, or both: (1) A novel appearance matching loss to address the problem of occluded pixels that occur when using monocular supervision
  • Self-supervised depth estimation frames the learning problem as one of novel view-synthesis, by training a network to predict the appearance of a target image from the c d viewpoint of another image
  • Classical binocular and multi-view stereo methods enc dec typically address this ambiguity by enforcing smoothness od od in the depth maps, and by computing photo-consistency on er er patches when solving for per-pixel depth via global optimization e.g. [11]
  • We have presented a versatile model for self-supervised monocular depth estimation, achieving state-of-the-art depth predictions
  • We showed how together they give a simple and efficient model for depth estimation, which can be trained with monocular video data, stereo data, or mixed monocular and stereo data
Methods
  • The authors describe the depth prediction network that takes a single color input It and produces a depth map Dt.
  • [59] introduced a set of high quality depth maps for the KITTI dataset, making use of 5 consecutive frames and handling moving objects using the stereo pair.
  • This improved ground truth depth is provided for 652 of the 697 test frames contained in the Eigen test split [8].
  • Monodepth2 w/o pretraining Monodepth2 w/o pretraining + pp Monodepth2 Monodepth2 + pp Monodepth2 (1024 × 320) Monodepth2 (1024 × 320) + pp Train MMMMMM
Results
  • Existing monocular methods produce lower quality depths than the best fully-supervised models.
  • The authors make use of reflection padding, in place of zero padding, in the decoder, returning the value of the closest border pixels in the source image when samples land outside of the image boundaries.
  • The authors found that this significantly reduces the border artifacts found in existing approaches, e.g.
  • While some other monocular depth prediction works have elected not to use ImageNet pretraining, the authors show in Table 1 that even without pretraining, the authors still achieve state-of-the-art results
Conclusion
  • The authors have presented a versatile model for self-supervised monocular depth estimation, achieving state-of-the-art depth predictions.
  • The authors introduced three contributions: (i) a minimum reprojection loss, computed for each pixel, to deal.
  • MD2 MS with occlusions between frames in monocular video, (ii) an auto-masking loss to ignore confusing, stationary pixels, and (iii) a full-resolution multi-scale sampling method.
  • The authors showed how together they give a simple and efficient model for depth estimation, which can be trained with monocular video data, stereo data, or mixed monocular and stereo data
Summary
  • Introduction:

    The authors seek to automatically infer a dense depth image from a single color input image.
  • Estimating absolute, or even relative depth, seems ill-posed without a second input image to enable triangulation.
  • Generating high quality depth-from-color is attractive because it could inexpensively complement LIDAR sensors used in self-driving cars, and enable new single-photo applications such as image-editing and AR-compositing.
  • Collecting large and varied training datasets with accurate ground truth depth for supervised learning [55, 9] is itself a formidable challenge.
  • Methods:

    The authors describe the depth prediction network that takes a single color input It and produces a depth map Dt.
  • [59] introduced a set of high quality depth maps for the KITTI dataset, making use of 5 consecutive frames and handling moving objects using the stereo pair.
  • This improved ground truth depth is provided for 652 of the 697 test frames contained in the Eigen test split [8].
  • Monodepth2 w/o pretraining Monodepth2 w/o pretraining + pp Monodepth2 Monodepth2 + pp Monodepth2 (1024 × 320) Monodepth2 (1024 × 320) + pp Train MMMMMM
  • Results:

    Existing monocular methods produce lower quality depths than the best fully-supervised models.
  • The authors make use of reflection padding, in place of zero padding, in the decoder, returning the value of the closest border pixels in the source image when samples land outside of the image boundaries.
  • The authors found that this significantly reduces the border artifacts found in existing approaches, e.g.
  • While some other monocular depth prediction works have elected not to use ImageNet pretraining, the authors show in Table 1 that even without pretraining, the authors still achieve state-of-the-art results
  • Conclusion:

    The authors have presented a versatile model for self-supervised monocular depth estimation, achieving state-of-the-art depth predictions.
  • The authors introduced three contributions: (i) a minimum reprojection loss, computed for each pixel, to deal.
  • MD2 MS with occlusions between frames in monocular video, (ii) an auto-masking loss to ignore confusing, stationary pixels, and (iii) a full-resolution multi-scale sampling method.
  • The authors showed how together they give a simple and efficient model for depth estimation, which can be trained with monocular video data, stereo data, or mixed monocular and stereo data
Tables
  • Table1: Quantitative results. Comparison of our method to existing methods on KITTI 2015 [<a class="ref-link" id="c13" href="#r13">13</a>] using the Eigen split. Best results in each category are in bold; second best are underlined. All results here are presented without post-processing [<a class="ref-link" id="c15" href="#r15">15</a>]; see supplementary Section F for improved postprocessed results. While our contributions are designed for monocular training, we still gain high accuracy in the stereo-only category. We additionally show we can get higher scores at a larger 1024 × 320 resolution, similar to [<a class="ref-link" id="c47" href="#r47">47</a>] – see supplementary Section G. These high resolution numbers are bolded if they beat all other models, including our low-res versions
  • Table2: Ablation. Results for different variants of our model (Monodepth2) with monocular training on KITTI 2015 [<a class="ref-link" id="c13" href="#r13">13</a>] using the Eigen split. (a) The baseline model, with none of our contributions, performs poorly. The addition of our minimum reprojection, auto-masking and full-res multi-scale components, significantly improves performance. (b) Even without ImageNet pretrained weights, our much simpler model brings large improvements above the baseline – see also Table 1. (c) If we train with the full Eigen dataset (instead of the subset introduced for monocular training by [<a class="ref-link" id="c76" href="#r76">76</a>]) our improvement over the baseline increases
  • Table3: Make3D results. All M results benefit from median scaling, while MS uses the unmodified network prediction
  • Table4: Odometry results on the KITTI [<a class="ref-link" id="c13" href="#r13">13</a>] odometry dataset. Results show the average absolute trajectory error, and standard deviation, in meters
  • Table5: Our network architecture Here k is the kernel size, s the stride, chns the number of output channels for each layer, res is the downscaling factor for each layer relative to the input image, and input corresponds to the input of each layer where ↑ is a 2× nearest-neighbor upsampling of the layer
  • Table6: Ablation. Results for different variants of our model (Monodepth2) with monocular training (except where specified) on KITTI 2015 [<a class="ref-link" id="c13" href="#r13">13</a>]
  • Table7: KITTI improved ground truth. Comparison to existing methods on KITTI 2015 [<a class="ref-link" id="c13" href="#r13">13</a>] using 93% of the Eigen split and the improved ground truth from [<a class="ref-link" id="c59" href="#r59">59</a>]. Baseline methods were evaluated using their provided disparity files, which were either available publicly or from private communication with the authors
  • Table8: Single scale monocular evaluation. Comparison to existing monocular supervised methods on KITTI 2015 [<a class="ref-link" id="c13" href="#r13">13</a>] using the Eigen split with improved ground truth from [<a class="ref-link" id="c59" href="#r59">59</a>] using a single scale for each method. † indicates newer results from the online implementation
  • Table9: KITTI depth prediction benchmark. Comparison of our monocular plus stereo approaches to fully supervised methods on the KITTI depth prediction benchmark [<a class="ref-link" id="c27" href="#r27">27</a>]. D indicates models that were trained with ground truth depth supervision, while M and S are monocular and stereo self-supervision respectively
  • Table10: Effect of post-processing. We observe that post-processing, originally motivated only for stereo training, also brings consistent benefits to all our monocular-trained models. Interestingly, for some metrics post-processing results in a larger quantitative gain than models trained at higher resolution
  • Table11: Ablation study on the input/output resolutions of our model. †Timings for the highest resolution models comprise 10 epochs training of the 640 × 192 model and 5 epochs of the 1024 × 320 model
  • Table12: Ablation of the effect of pose networks on depth prediction. Results shown are on depth prediction on the KITTI dataset, when trained from monocular sequences only. ‘Input Frames’ indicate how many frames are fed to the pose network. ‘Shared encoder (arXiv v1)’ denotes the architecture proposed in v1 of this paper
Download tables as Excel
Related work
  • We review models that, at test time, take a single color image as input and predict the depth of each pixel as output.

    2.1. Supervised Depth Estimation Estimating depth from a single image is an inherently illposed problem as the same input image can project to multiple plausible depths. To address this, learning based methods have shown themselves capable of fitting predictive models that exploit the relationship between color images and their corresponding depth. Various approaches, such as combining local predictions [19, 55], non-parametric scene sampling [24], through to end-to-end supervised learning [9, 31, 10] have been explored. Learning based algorithms are also among some of the best performing for stereo estimation [72, 42, 60, 25] and optical flow [20, 63].

    Many of the above methods are fully supervised, requiring ground truth depth during training. However, this is challenging to acquire in varied real-world settings. As a result, there is a growing body of work that exploits weakly supervised training data, e.g. in the form of known object sizes [66], sparse ordinal depths [77, 6], supervised appearance matching terms [72, 73], or unpaired synthetic depth data [45, 2, 16, 78], all while still requiring the collection of additional depth or other annotations. Synthetic training data is an alternative [41], but it is not trivial to generate large amounts of synthetic data containing varied real-world appearance and motion. Recent work has shown that conventional structure-from-motion (SfM) pipelines can generate sparse training signal for both camera pose and depth [35, 28, 68], where SfM is typically run as a pre-processing step decoupled from learning. Recently, [65] built upon our model by incorporating noisy depth hints from traditional stereo algorithms, improving depth predictions.
Reference
  • Filippo Aleotti, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Generative adversarial networks for unsupervised monocular depth prediction. In ECCV Workshops, 2018.
    Google ScholarLocate open access versionFindings
  • Amir Atapour-Abarghouei and Toby Breckon. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • V Madhu Babu, Kaushik Das, Anima Majumdar, and Swagat Kumar. Undemon: Unsupervised deep network for depth and ego-motion estimation. In IROS, 2018.
    Google ScholarLocate open access versionFindings
  • Arunkumar Byravan and Dieter Fox. Se3-nets: Learning rigid body motion using deep neural networks. In ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Singleimage depth perception in the wild. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv, 2015.
    Google ScholarFindings
  • David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Yasutaka Furukawa and Carlos Hernandez. Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision, 2015.
    Google ScholarLocate open access versionFindings
  • Ravi Garg, Vijay Kumar BG, and Ian Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and Xiaogang Wang. Learning monocular depth by distilling cross-domain stereo networks. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Carol Barnes Hochberg and Julian E Hochberg. Familiar size and the perception of depth. The Journal of Psychology, 1952.
    Google ScholarLocate open access versionFindings
  • Derek Hoiem, Alexei A Efros, and Martial Hebert. Automatic photo pop-up. TOG, 2005.
    Google ScholarLocate open access versionFindings
  • Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. FlowNet2: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black, and Andreas Geiger. Unsupervised learning of multi-frame optical flow with occlusions. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Huaizu Jiang, Erik Learned-Miller, Gustav Larsson, Michael Maire, and Greg Shakhnarovich. Self-supervised relative depth learning for urban scene understanding. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth transfer: Depth extraction from video using non-parametric sampling. PAMI, 2014.
    Google ScholarFindings
  • Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv, 2014.
    Google ScholarFindings
  • KITTI Single Depth Evaluation Server. http://www.cvlibs.net/datasets/kitti/eval depth.php?benchmark=depth prediction.2017.
    Findings
  • Maria Klodt and Andrea Vedaldi. Supervising the new with the old: learning SFM from SFM. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Shu Kong and Charless Fowlkes. Pixel-wise attentional gating for parsimonious pixel labeling. arXiv, 2018.
    Google ScholarFindings
  • Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semisupervised deep learning for monocular depth map prediction. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016.
    Google ScholarLocate open access versionFindings
  • Bo Li, Yuchao Dai, and Mingyi He. Monocular depth estimation with hierarchical fusion of dilated cnns and softweighted-sum inference. Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. UnDeepVO: Monocular visual odometry through unsupervised deep learning. arXiv, 2017.
    Google ScholarFindings
  • Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. Deep attention-based classification network for robust depth prediction. ACCV, 2018.
    Google ScholarLocate open access versionFindings
  • Zhengqi Li and Noah Snavely. Megadepth: Learning singleview depth prediction from internet photos. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. PAMI, 2015.
    Google ScholarLocate open access versionFindings
  • Miaomiao Liu, Mathieu Salzmann, and Xuming He. Discrete-continuous depth estimation from a single image. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. arXiv, 2018.
    Google ScholarFindings
  • Yue Luo, Jimmy Ren, Mude Lin, Jiahao Pang, Wenxiu Sun, Hongsheng Li, and Liang Lin. Single view stereo matching. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What makes good synthetic training data for learning disparity and optical flow estimation? IJCV, 2018.
    Google ScholarLocate open access versionFindings
  • Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Ishit Mehta, Parikshit Sakurikar, and PJ Narayanan. Structured adversarial training for unsupervised monocular depth estimation. In 3DV, 2018.
    Google ScholarLocate open access versionFindings
  • Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. Transactions on Robotics, 2015.
    Google ScholarLocate open access versionFindings
  • Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R. Venkatesh Babu. AdaDepth: Unsupervised content congruent adaptation for depth estimation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS-W, 2017.
    Google ScholarLocate open access versionFindings
  • Sudeep Pillai, Rares Ambrus, and Adrien Gaidon. Superdepth: Self-supervised, super-resolved monocular depth estimation. In ICRA, 2019.
    Google ScholarLocate open access versionFindings
  • Andrea Pilzer, Dan Xu, Mihai Marian Puscas, Elisa Ricci, and Nicu Sebe. Unsupervised adversarial depth estimation using cycled generative networks. In 3DV, 2018.
    Google ScholarLocate open access versionFindings
  • Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. Towards real-time unsupervised monocular depth estimation on cpu. In IROS, 2018.
    Google ScholarLocate open access versionFindings
  • Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. Learning monocular depth estimation with unsupervised trinocular assumptions. In 3DV, 2018.
    Google ScholarLocate open access versionFindings
  • Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised deep learning for optical flow estimation. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. UNet: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
    Google ScholarLocate open access versionFindings
  • Ashutosh Saxena, Min Sun, and Andrew Ng. Make3d: Learning 3d scene structure from a single still image. PAMI, 2009.
    Google ScholarFindings
  • Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 2002.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant CNNs. In 3DV, 2017.
    Google ScholarLocate open access versionFindings
  • Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for learning monocular stereo. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. SfMNet: Learning of structure and motion from video. arXiv, 2017.
    Google ScholarLocate open access versionFindings
  • Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, and Wei Xu. Occlusion aware unsupervised learning of optical flow. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
    Google ScholarLocate open access versionFindings
  • Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yiran Wu, Sihao Ying, and Lianmin Zheng. Size-to-depth: A new perspective for single image depth estimation. arXiv, 2018.
    Google ScholarFindings
  • Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, and Ram Nevatia. LEGO: Learning edge with geometry all at once by watching videos. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsupervised learning of geometry with edge-aware depth-normal consistency. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network to compare image patches. JMLR, 2016.
    Google ScholarLocate open access versionFindings
  • Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Zhenyu Zhang, Chunyan Xu, Jian Yang, Ying Tai, and Liang Chen. Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. Transactions on Computational Imaging, 2017.
    Google ScholarLocate open access versionFindings
  • Tinghui Zhou, Matthew Brown, Noah Snavely, and David Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
    Google ScholarFindings
  • Daniel Zoran, Phillip Isola, Dilip Krishnan, and William T Freeman. Learning ordinal relationships for mid-level vision. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Yuliang Zou, Zelun Luo, and Jia-Bin Huang. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [76] Zhou [76]† Mahjourian
    Google ScholarFindings
  • [71] Mahjourian et al. [40] DDVO [62] Zhan et al.
    Google ScholarFindings
  • [73] Zhou et al.
    Google ScholarFindings
  • [15] Garg et al.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments