Reconstruct Locally, Localize Globally: A Model Free Method for Object Pose Estimation

    Ming Cai
    Ming Cai

    CVPR, pp. 3150-3160, 2020.

    Cited by: 0|Bibtex|Views20|Links
    EI
    Keywords:
    cad modelobject instance3d objectfully convolutional networkobject pose estimationMore(11+)
    Wei bo:
    We have proposed an method that performs accurate 6dof object pose estimation from a single RGB image

    Abstract:

    Six degree-of-freedom pose estimation of a known object in a single image is a long-standing computer vision objective. It is classically posed as a correspondence problem between a known geometric model, such as a CAD model, and image locations. If a CAD model is not available, it is possible to use multi-view visual reconstruction metho...More

    Code:

    Data:

    0
    Introduction
    • The pose of an object describes the geometric relation of the object instance with respect to the capturing camera.
    • It is mathematically encoded by the Euclidean transformation between the representations of the object structure in two coordinate spaces: object-centric and camera-centric frame.
    • The task the authors are interested in is to estimate the accurate six-degree-of-freedom (6dof) pose of a previously-seen rigid object instance from an RGB image
    Highlights
    • The pose of an object describes the geometric relation of the object instance with respect to the capturing camera
    • Whereas for 3D object coordinate learning, we propose to explicitly build the constraints based on images from two viewpoints arise from an out-of-plane movement
    • We show the reason why the landmark branch is needed and how it benefits the learning of object coordinates
    • We have proposed an method that performs accurate 6dof object pose estimation from a single RGB image
    • We explore self-supervision for learning from image deformation and eliminates the need of 3D model in the system
    Methods
    • The authors first introduce the creation of the dataset the authors used in previous section. The authors conduct ablation studies to investigate the effect of each supervisory signal for the object coordinate head.
    • The authors run the methods on the two real world datasets: LINEMOD [18] and Occlusion LINEMOD [18] and compare with the state-of-the-art learning-based methods that require the 3D model in their pipeline.
    • The locations of the viewpoints are randomized to make sure the object spread over the whole image frame, with various scales.
    • The black background is replaced with real world images from NYU-Depth V2 [34] dataset.
    Results
    • If the average distance derived by a test pose is less than 10% of the object diameter, the pose estimate is considered correct.
    Conclusion
    • The authors have proposed an method that performs accurate 6dof object pose estimation from a single RGB image.
    • The authors' learning-based method implicitly encodes the object reconstruction into a network by regressing object pixel to 3D object coordinate.
    • It carries out 2D-3D correspondences for geometric pose solving at inference time.
    • The learning of the network explicitly enforces the multi-view geometric constraints for the object coordinates.
    • Our 3D model free method reduced the performance gap between approaches with and without 3D model
    Summary
    • Introduction:

      The pose of an object describes the geometric relation of the object instance with respect to the capturing camera.
    • It is mathematically encoded by the Euclidean transformation between the representations of the object structure in two coordinate spaces: object-centric and camera-centric frame.
    • The task the authors are interested in is to estimate the accurate six-degree-of-freedom (6dof) pose of a previously-seen rigid object instance from an RGB image
    • Objectives:

      The authors aim to learn the coordinates without the 3D model in a self-supervised way.
    • The authors aim to build a network to densely build these correspondences, by mapping from the RGB image pixels to 3D coordinates in the object space.
    • The authors aim to present a model-free method and propose to explore alternative supervisions
    • Methods:

      The authors first introduce the creation of the dataset the authors used in previous section. The authors conduct ablation studies to investigate the effect of each supervisory signal for the object coordinate head.
    • The authors run the methods on the two real world datasets: LINEMOD [18] and Occlusion LINEMOD [18] and compare with the state-of-the-art learning-based methods that require the 3D model in their pipeline.
    • The locations of the viewpoints are randomized to make sure the object spread over the whole image frame, with various scales.
    • The black background is replaced with real world images from NYU-Depth V2 [34] dataset.
    • Results:

      If the average distance derived by a test pose is less than 10% of the object diameter, the pose estimate is considered correct.
    • Conclusion:

      The authors have proposed an method that performs accurate 6dof object pose estimation from a single RGB image.
    • The authors' learning-based method implicitly encodes the object reconstruction into a network by regressing object pixel to 3D object coordinate.
    • It carries out 2D-3D correspondences for geometric pose solving at inference time.
    • The learning of the network explicitly enforces the multi-view geometric constraints for the object coordinates.
    • Our 3D model free method reduced the performance gap between approaches with and without 3D model
    Tables
    • Table1: The pose estimation performance of different combinations of the loss terms on test set of expo
    • Table2: LineMOD: Percentages of correct pose estimates in ADD-10. * denotes that the object is symmetric and is evaluated in ADD-S
    • Table3: Results on Occlusion LINEMOD. Note that all the methods requires the 3D model in the pipeline except ours
    Download tables as Excel
    Related work
    • Feature-based Methods and Template-based Methods: It is necessary to review how the geometry-based methods solve the 6dof pose estimation, since our method is essentially a combination of learning and geometry. Traditionally, these methods [13, 30, 32, 21] consist of two key components: feature detection plus matching, geometric pose solving plus refinement. The features, such as ORB [44, 33], SIFT [30] and FAST [43], are descriptors of the local appearance around the key-points. They are manually handcrafted to achieve invariance over viewpoint transformation and descriptiveness for matching. From these matched 2D-3D correspondences, the transformation between the camera and the object can be estimated by geometric algorithms such as [17, 56, 26, 53]. Robust fitting like [12] is applied to find the optimal pose.
    Funding
    • We gratefully acknowledge the support of the Australian Research Council through the Centre of Excellence for Robotic Vision CE140100016 and Laureate Fellowship FL130100102 to IR
    Reference
    • Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. Building rome in a day. Commun. ACM, 54(10):105–112, Oct. 2011, 3
      Google ScholarLocate open access versionFindings
    • Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6d object pose estimation using 3d object coordinates. In ECCV, 2014. 1, 2, 3
      Google ScholarLocate open access versionFindings
    • Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. DSAC - Differentiable RANSAC for Camera Localization. In CVPR, 2017. 2
      Google ScholarLocate open access versionFindings
    • Eric Brachmann, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold, and Carsten Rother. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1
      Google ScholarLocate open access versionFindings
    • Eric Brachmann and Carsten Rother. Learning Less Is More - 6D Camera Localization via 3D Surface Regression. In CVPR, 2018. 2
      Google ScholarLocate open access versionFindings
    • Mai Bui, Shadi Albarqouni, Slobodan Ilic, and Nassir Navab. Scene coordinate and correspondence learning for imagebased localization. arXiv preprint arXiv:1805.08443, 2018. 2
      Findings
    • Mai Bui, Sergey Zakharov, Shadi Albarqouni, Slobodan Ilic, and Nassir Navab. When regression meets manifold learning for object recognition and pose estimation. 2018. 1
      Google ScholarFindings
    • Ming Cai, Huangying Zhan, Chamara Saroj Weerasekera, Kejie Li, and Ian Reid. Camera relocalization by exploiting multi-view constraints for scene coordinates regression. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2019. 2
      Google ScholarLocate open access versionFindings
    • Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015. 1
      Findings
    • Daniel F. Dementhon and Larry S. Davis. Model-based object pose in 25 lines of code. International Journal of Computer Vision, 15(1):123–141, Jun 1995. 1
      Google ScholarLocate open access versionFindings
    • Thanh-Toan Do, Trung Pham, Ming Cai, and Ian D. Reid. Lienet: Real-time monocular object instance 6d pose estimation. In British Machine Vision Conference 2018, BMVC 2018, page 2, 2018. 8
      Google ScholarLocate open access versionFindings
    • Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981. 2
      Google ScholarLocate open access versionFindings
    • Iryna Gordon and David G Lowe. What and where: 3d object recognition with accurate pose. In Toward category-level object recognition, pages 67–82. Springer, 2006. 2
      Google ScholarFindings
    • Chunhui Gu and Xiaofeng Ren. Discriminative mixture-oftemplates for viewpoint classification. In European Conference on Computer Vision, pages 408–421. Springer, 2010. 3
      Google ScholarLocate open access versionFindings
    • Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. 5
      Google ScholarFindings
    • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 2, 3, 6
      Google ScholarLocate open access versionFindings
    • J. A. Hesch and S. I. Roumeliotis. A direct least-squares (dls) method for pnp. In 2011 International Conference on Computer Vision, pages 383–390, Nov 2011. 2
      Google ScholarLocate open access versionFindings
    • Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision – ACCV 2012, pages 548–562, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. 1, 2, 3, 6
      Google ScholarLocate open access versionFindings
    • Tomas Hodan, Xenophon Zabulis, Manolis Lourakis, Stepan Obdrzalek, and Jirı Matas. Detection and fine 3d pose estimation of texture-less objects in rgb-d images. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4421–4428. IEEE, 2015. 3
      Google ScholarLocate open access versionFindings
    • Yinlin Hu, Joachim Hugonot, Pascal Fua, and Mathieu Salzmann. Segmentation-driven 6d object pose estimation. In CVPR, 2019. 1, 3
      Google ScholarLocate open access versionFindings
    • Jie Tang, S. Miller, A. Singh, and P. Abbeel. A textured object recognition pipeline for color and depth image data. In 2012 IEEE International Conference on Robotics and Automation, pages 3467–3474, May 2012. 2
      Google ScholarLocate open access versionFindings
    • Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. In Proceedings of the IEEE International Conference on Computer Vision, 2017. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Alex Kendall and Roberto Cipolla. Modelling Uncertainty in Deep Learning for Camera Relocalization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2016. 3
      Google ScholarLocate open access versionFindings
    • Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, 2015. 2, 3
      Google ScholarLocate open access versionFindings
    • Vincent Lepetit, Pascal Fua, et al. Monocular modelbased 3d tracking of rigid objects: A survey. Foundations and Trends R in Computer Graphics and Vision, 1(1):1–89, 2005. 1
      Google ScholarLocate open access versionFindings
    • Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. International Journal Of Computer Vision, 81:155–166, 2009. 2
      Google ScholarLocate open access versionFindings
    • Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In European Conference on Computer Vision (ECCV), 2018. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. 6
      Google ScholarLocate open access versionFindings
    • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. 2
      Google ScholarLocate open access versionFindings
    • David G Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004. 2
      Google ScholarLocate open access versionFindings
    • Eric Marchand, Patrick Bouthemy, Francois Chaumette, and Valerie Moreau. Robust real-time visual tracking using a 2d-3d model-based approach. In Proceedings of the seventh IEEE international conference on computer vision, volume 1, pages 262–268. IEEE, 1999. 1
      Google ScholarLocate open access versionFindings
    • Manuel Martinez, Alvaro Collet, and Siddhartha S Srinivasa. Moped: A scalable and low latency object recognition and pose estimation system. In 2010 IEEE International Conference on Robotics and Automation, pages 2043–2049. IEEE, 2010. 2
      Google ScholarLocate open access versionFindings
    • Raul Mur-Artal and Juan D. Tardos. ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGBD cameras. IEEE Transactions on Robotics, 33(5):1255– 1262, 2017. 2
      Google ScholarLocate open access versionFindings
    • Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 6
      Google ScholarLocate open access versionFindings
    • Markus Oberweger, Mahdi Rad, and Vincent Lepetit. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. 3, 8
      Google ScholarLocate open access versionFindings
    • Qi Pan, Gerhard Reitmayr, and Tom Drummond. Proforma: Probabilistic feature-based on-line rapid model acquisition. In BMVC, volume 2, page 6. Citeseer, 2009. 1, 2
      Google ScholarLocate open access versionFindings
    • Qi Pan, Gerhard Reitmayr, Edward Rosten, and Tom Drummond. Rapid 3d modelling from live video. In The 33rd International Convention MIPRO, pages 252–257. IEEE, 2010. 1, 2
      Google ScholarLocate open access versionFindings
    • Kiru Park, Timothy Patten, and Markus Vincze. Pix2pose: Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Karl Pauwels, Leonardo Rubio, Javier Diaz, and Eduardo Ros. Real-time model-based rigid object pose estimation and tracking combining dense and sparse visual cues. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013. 1
      Google ScholarLocate open access versionFindings
    • Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In CVPR, 2019. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Mahdi Rad and Vincent Lepetit. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pages 3828–3836, 2017. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. 3
      Google ScholarLocate open access versionFindings
    • Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443.
      Google ScholarLocate open access versionFindings
    • Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision, 2011. 2
      Google ScholarLocate open access versionFindings
    • Johannes Lutz Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 7
      Google ScholarLocate open access versionFindings
    • Johannes Lutz Schonberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016. 7
      Google ScholarLocate open access versionFindings
    • Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In The European Conference on Computer Vision (ECCV), September 2018. 1
      Google ScholarLocate open access versionFindings
    • Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • J. Thewlis, S. Albanie, H. Bilen, and A. Vedaldi. Unsupervised learning of landmarks by exchanging descriptor vectors. In International Conference on Computer Vision, 2019. 5, 6
      Google ScholarLocate open access versionFindings
    • James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pages 5916–5925, 2017. 5, 6
      Google ScholarLocate open access versionFindings
    • Daniel Wagner, Gerhard Reitmayr, Alessandro Mulloni, Tom Drummond, and Dieter Schmalstieg. Pose tracking from natural features on mobile phones. In Proceedings of the 7th IEEE/ACM international symposium on mixed and augmented reality, pages 125–134. IEEE Computer Society, 2008. 1, 2
      Google ScholarLocate open access versionFindings
    • Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martın-Martın, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In Computer Vision and Pattern Recognition (CVPR), 2019. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Ping Wang, Guili Xu, Zhengsheng Wang, and Yuehua Cheng. An efficient solution to the perspective-three-point pose problem. Comput. Vis. Image Underst., 166(C):81–87, Jan. 2018. 2
      Google ScholarLocate open access versionFindings
    • Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. Robotics: Science and Systems (RSS), 2018. 1, 3, 8
      Google ScholarLocate open access versionFindings
    • Yang Xiao, Xuchong Qiu, Pierre-Alain Langlois, Mathieu Aubry, and Renaud Marlet. Pose from shape: Deep pose estimation for arbitrary 3D objects. In British Machine Vision Conference (BMVC), 2019. 1, 3
      Google ScholarLocate open access versionFindings
    • Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Complete solution classification for the perspective-three-point problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):930–943, Aug 2003. 2
      Google ScholarLocate open access versionFindings
    • Zhengyou Zhang. Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13(2), Oct 1994. 1
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments