iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Lin Yen-Chen
Lin Yen-Chen
Pete Florence
Pete Florence
Jonathan T. Barron
Jonathan T. Barron
Cited by: 2|Views19
Weibo:
We present iNeRF, a framework that performs pose estimation by "inverting" a trained Neural Radiance Field

Abstract:

We present iNeRF, a framework that performs pose estimation by "inverting" a trained Neural Radiance Field (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis - synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthe...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • The authors present iNeRF, a framework that performs pose estimation by “inverting” a trained Neural Radiance Field (NeRF).
  • A set of points are sampled for use as input to the MLP which outputs a set of densities and colors
  • These values are used to approximate the image formation behind volume rendering [11] using numerical quadrature [24], producing an estimate of the color of that pixel.
  • The authors refer readers to Mildenhall et al [27]
Highlights
  • We present iNeRF, a framework that performs pose estimation by “inverting” a trained Neural Radiance Field (NeRF)
  • Neural Radiance Fields (NeRF) does this by representing a scene as a “radiance field”: a volumetric density that models the shape of the scene, and a view-dependent color that models the appearance of occupied regions of the scene, both of which lie within a bounded 3D volume
  • We show that iNeRF can improve NeRF by estimating the camera poses of images with unknown poses and using these images as additional training data for NeRF
  • When the batch size of rays is set to 1024, the percentage of < 5◦ rotation errors drops from 71% to 55%, and the percentage of < 5cm translation errors drops from 73% to 39%. This difference across datasets may be due to the fact that the LLFF use-case in NeRF uses a normalized device coordinate (NDC) space, or may be a byproduct of the difference in scene content
  • We have presented iNeRF, a framework for pose estimation that works by inverting a trained NeRF model
  • We have demonstrated that iNeRF is able to perform accurate pose estimation using gradient-based optimization
Results
  • Overlaid NeRF Rendering and Observed Image Abstract. The authors first conduct extensive experiments on the synthetic dataset from NeRF [27] and the real-world complex scenes from LLFF [26] to evaluate iNeRF for 6DoF pose estimation.
  • When the batch size of rays is set to 1024, the percentage of < 5◦ rotation errors drops from 71% to 55%, and the percentage of < 5cm translation errors drops from 73% to 39%
  • This difference across datasets may be due to the fact that the LLFF use-case in NeRF uses a normalized device coordinate (NDC) space, or may be a byproduct of the difference in scene content.
  • The authors observe a 3.6% improvement compared to the initialization
Conclusion
  • The authors have presented iNeRF, a framework for pose estimation that works by inverting a trained NeRF model.
  • The authors have demonstrated that iNeRF is able to perform accurate pose estimation using gradient-based optimization.
  • The authors have shown how iNeRF can be used to improve NeRF reconstruction quality by allowing images without known pose labels to be used when training NeRF.
  • This suggests a future research direction for jointly-optimized reconstruction and pose estimation
Summary
  • Introduction:

    The authors present iNeRF, a framework that performs pose estimation by “inverting” a trained Neural Radiance Field (NeRF).
  • A set of points are sampled for use as input to the MLP which outputs a set of densities and colors
  • These values are used to approximate the image formation behind volume rendering [11] using numerical quadrature [24], producing an estimate of the color of that pixel.
  • The authors refer readers to Mildenhall et al [27]
  • Objectives:

    The authors' goal is to solve the optimal relative transformation from an initial estimated pose T0:.
  • Results:

    Overlaid NeRF Rendering and Observed Image Abstract. The authors first conduct extensive experiments on the synthetic dataset from NeRF [27] and the real-world complex scenes from LLFF [26] to evaluate iNeRF for 6DoF pose estimation.
  • When the batch size of rays is set to 1024, the percentage of < 5◦ rotation errors drops from 71% to 55%, and the percentage of < 5cm translation errors drops from 73% to 39%
  • This difference across datasets may be due to the fact that the LLFF use-case in NeRF uses a normalized device coordinate (NDC) space, or may be a byproduct of the difference in scene content.
  • The authors observe a 3.6% improvement compared to the initialization
  • Conclusion:

    The authors have presented iNeRF, a framework for pose estimation that works by inverting a trained NeRF model.
  • The authors have demonstrated that iNeRF is able to perform accurate pose estimation using gradient-based optimization.
  • The authors have shown how iNeRF can be used to improve NeRF reconstruction quality by allowing images without known pose labels to be used when training NeRF.
  • This suggests a future research direction for jointly-optimized reconstruction and pose estimation
Tables
  • Table1: Benchmark on Fern scene. NeRFs trained with pose labels generated by iNeRF can achieve higher PSNR
  • Table2: Quantitative results for the LineMOD dataset. We report performance using the Average Recall(%) of the ADD(-S) metric. “With Real Pose Labels” refers to methods that additionally train on 15% of the test data following [<a class="ref-link" id="c2" href="#r2">2</a>] and therefore have seen real posed images, rather than only synthetic posed images
Related work
  • Neural 3D shape representations. Recently, several works have investigated representing 3D shapes implicitly with neural networks. In this formulation, the geometric or appearance properties of a 3D point x = (x, y, z) is parameterized as the output of a neural network. The advantage of this approach is that scenes with complex topologies can be represented at high resolution with low memory usage. When ground truth 3D geometry is available as supervision, neural networks can be optimized to represent the signed distance function [30] or occupancy function [25]. However, ground truth 3D shapes are hard to obtain in practice. This motivates subsequent work on relaxing this constraint by formulating differentiable rendering pipelines that allow neural 3D shape representations to be learned using only 2D images as supervision [12, 15, 16]. Niemeyer et al [28] represent a surface as a neural 3D occupancy field and texture as a neural 3D texture field. Ray intersection locations are first computed with numerical methods using the occupancy field and then provided as inputs to the texture field to output the colors. Scene Representation Networks [38] learn a neural 3D representation that outputs a feature vector and RGB color at each continuous 3D coordinate and employs a recurrent neural network to perform differentiable ray-marching. NeRF [27] shows that by taking view directions as additional inputs, a learned neural network works well in tandem with volume rendering techniques and enables photo-realistic view synthesis. NeRF in the Wild [23] extends NeRF to additionally model each image’s individual appearance and transient content, thereby allowing high-quality 3D reconstruction of landmarks using unconstrained photo collections. NSVF [17] improves NeRF by incorporating a sparse voxel octree structure into the scene representation, which accelerates rendering by allowing voxels without scene content to be omitted during rendering. Unlike NeRF and its variants, which learn to represent a scene’s structure from posed RGB images, we address the inverse problem: how to localize new observations whose camera poses are unknown, using an alreadytrained NeRF.
Funding
  • When the batch size of rays is set to 1024, the percentage of < 5◦ rotation errors drops from 71% to 55%, and the percentage of < 5cm translation errors drops from 73% to 39%
  • We report the percentage of poses that yield an ADD(-S) metric of less than 10% of the object’s diameter
  • Overall, we observe a 3.6% improvement compared to the initialization
  • Note that when the batch size is 2048, more than 70% of the data has < 5◦ and < 5 cm error after iNeRF is applied
Reference
  • Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan C Russell, and Josef Sivic. Seeing 3D chairs: exemplar partbased 2D-3D alignment using a large dataset of CAD models. CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • Eric Brachmann, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold, et al. Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. CVPR, 2016. 8
    Google ScholarLocate open access versionFindings
  • Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. NeurIPS, 2019. 2
    Google ScholarLocate open access versionFindings
  • Xu Chen, Zijian Dong, Jie Song, Andreas Geiger, and Otmar Hilliges. Category level object pose estimation via neural analysis-by-synthesis. ECCV, 2020. 2, 3
    Google ScholarLocate open access versionFindings
  • Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. The moped framework: Object recognition and pose estimation for manipulation. IJRR, 2011. 2
    Google ScholarLocate open access versionFindings
  • Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. CVPR Workshops, 2018. 8
    Google ScholarLocate open access versionFindings
  • Vittorio Ferrari, Tinne Tuytelaars, and Luc Van Gool. Simultaneous object recognition and segmentation from single or multiple model views. IJCV, 2006. 2
    Google ScholarFindings
  • Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981. 8
    Google ScholarLocate open access versionFindings
  • Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. ACCV, 2012. 6, 7
    Google ScholarLocate open access versionFindings
  • Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, et al. Bop: Benchmark for 6d object pose estimation. ECCV, 2018. 6
    Google ScholarLocate open access versionFindings
  • James T. Kajiya and Brian P. Von Herzen. Ray tracing volume densities. SIGGRAPH, 1984. 3
    Google ScholarLocate open access versionFindings
  • Hiroharu Kato, Deniz Beker, Mihai Morariu, Takahiro Ando, Toru Matsuoka, Wadim Kehl, and Adrien Gaidon. Differentiable rendering: A survey. arXiv preprint arXiv:2006.12057, 2020. 2
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015. 4
    Google ScholarLocate open access versionFindings
  • Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. ECCV, 2018. 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Chen-Hsuan Lin, Chaoyang Wang, and Simon Lucey. Sdfsrn: Learning signed distance 3d object reconstruction from static images. NeurIPS, 2020. 2
    Google ScholarLocate open access versionFindings
  • Chen-Hsuan Lin, Oliver Wang, Bryan C Russell, Eli Shechtman, Vladimir G Kim, Matthew Fisher, and Simon Lucey. Photometric mesh optimization for video-aligned 3d object reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 969–978, 2019. 2
    Google ScholarLocate open access versionFindings
  • Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. NeurIPS, 2020. 2, 8
    Google ScholarLocate open access versionFindings
  • Kevin M Lynch and Frank C Park. Modern Robotics. Cambridge University Press, 2017. 4
    Google ScholarFindings
  • Wei-Chiu Ma, Shenlong Wang, Jiayuan Gu, Sivabalan Manivasagam, Antonio Torralba, and Raquel Urtasun. Deep feedback inverse problem solver. ECCV, 2020. 2
    Google ScholarLocate open access versionFindings
  • Fabian Manhardt, Diego Martin Arroyo, Christian Rupprecht, Benjamin Busam, Tolga Birdal, Nassir Navab, and Federico Tombari. Explaining the ambiguity of object detection and 6d pose from visual data. CVPR, 2019. 8
    Google ScholarLocate open access versionFindings
  • Lucas Manuelli, Wei Gao, Peter Florence, and Russ Tedrake. kPAM: Keypoint affordances for category-level robotic manipulation. ISRR, 2019. 1
    Google ScholarLocate open access versionFindings
  • Pat Marion, Peter R Florence, Lucas Manuelli, and Russ Tedrake. Label fusion: A pipeline for generating ground truth labels for real rgbd data of cluttered scenes. ICRA, 2018. 1
    Google ScholarLocate open access versionFindings
  • Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. arXiv preprint arXiv:2008.02268, 2020. 2, 8
    Findings
  • Nelson Max. Optical models for direct volume rendering. IEEE TVCG, 1995. 3
    Google ScholarLocate open access versionFindings
  • Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM TOG, 2019. 5
    Google ScholarLocate open access versionFindings
  • Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934, 2020. 2, 3, 5, 6
    Findings
  • Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • Andrea Palazzi, Luca Bergamini, Simone Calderara, and Rita Cucchiara. End-to-end 6-DOF object pose estimation through differentiable rasterization. ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Keunhong Park, Arsalan Mousavian, Yu Xiang, and Dieter Fox. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis. 6-DOF object pose from semantic keypoints. ICRA, 2017. 2
    Google ScholarLocate open access versionFindings
  • Fred Rothganger, Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. IJCV, 2006. 2
    Google ScholarLocate open access versionFindings
  • Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. CVPR, 2020. 8
    Google ScholarLocate open access versionFindings
  • Tanner Schmidt, Richard Newcombe, and Dieter Fox. Selfsupervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters, 2016. 3
    Google ScholarLocate open access versionFindings
  • Max Schwarz, Hannes Schulz, and Sven Behnke. Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features. ICRA, 2015. 2
    Google ScholarLocate open access versionFindings
  • Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. CVPR, 2013. 3
    Google ScholarLocate open access versionFindings
  • Vincent Sitzmann, Michael Zollhofer, and Gordon Wetzstein. Scene representation networks: Continuous 3dstructure-aware neural scene representations. NeurIPS, 2019. 2
    Google ScholarLocate open access versionFindings
  • Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. ECCV, 2018. 8
    Google ScholarLocate open access versionFindings
  • Supasorn Suwajanakorn, Noah Snavely, Jonathan J Tompson, and Mohammad Norouzi. Discovery of latent 3d keypoints via end-to-end geometric reasoning. NeurIPS, 2018. 2
    Google ScholarLocate open access versionFindings
  • Richard Szeliski. Image alignment and stitching: A tutorial. Foundations and Trends® in Computer Graphics and Vision, 2006. 5
    Google ScholarLocate open access versionFindings
  • Bugra Tekin, Sudipta N Sinha, and Pascal Fua. Real-time seamless single shot 6D object pose prediction. CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. CoRL, 2018. 2
    Google ScholarFindings
  • Shubham Tulsiani and Jitendra Malik. Viewpoints and keypoints. CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Julien Valentin, Matthias Nießner, Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip HS Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • Hans Wallach. Uber visuell wahrgenommene bewegungsrichtung. Psychologische Forschung, 1935. 4
    Google ScholarLocate open access versionFindings
  • Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martın-Martın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3343–3352, 2019. 7
    Google ScholarLocate open access versionFindings
  • Gu Wang, Fabian Manhardt, Jianzhun Shao, Xiangyang Ji, Nassir Navab, and Federico Tombari. Self6d: Selfsupervised monocular 6d object pose estimation. arXiv preprint arXiv:2004.06468, 2020. 2, 4, 6, 8
    Findings
  • He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. RSS, 2018. 2
    Google ScholarLocate open access versionFindings
  • Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. Dpod: 6d pose object detector and refiner. ICCV, 2019. 6, 8
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments