Online Adaptation for Consistent Mesh Reconstruction in the Wild

NeurIPS, 2020.

Cited by: 0|Bibtex|Views15|Links
EI
Keywords:
3d reconstructionNon-rigid structure from motionnon rigid structuresingle viewcamera poseMore(11+)
Weibo:
We learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects

Abstract:

This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming tes...More

Code:

Data:

0
Introduction
  • When the authors humans try to understand the object shown in Fig. 1(a), the authors instantly recognize it as a “duck”.
  • Existing research mostly focuses on limited domains for which 3D annotations can be captured in constrained environments.
  • These approaches do not generalize well to non-rigid objects captured in naturalistic environments.
  • Due to constrained environments and limited annotations, it is nearly impossible to generalize these approaches to the 3D reconstruction of non-rigid objects from images and videos captured in the wild
Highlights
  • When we humans try to understand the object shown in Fig. 1(a), we instantly recognize it as a “duck”
  • Due to constrained environments and limited annotations, it is nearly impossible to generalize these approaches to the 3D reconstruction of non-rigid objects from images and videos captured in the wild
  • Our goal is to recover coherent sequences of mesh shapes, texture maps and camera poses from unlabeled videos, with a two-stage learning approach: (i) first, we learn a 3D mesh reconstruction model on a collection of single-view images of a category, described in Sec. 3.1; (ii) at inference time, we adapt the model to fit the sequence via temporal consistency constraints, as described in Sec. 3.2
  • We propose a method to reconstruct temporally consistent 3D meshes of deformable objects from videos captured in the wild
  • We learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects
  • The method can be applied to tasks such as bird watching, motion analysis, shape analysis, to name a few. Another important application is to simplify an artists workflow, as an initial animated and textured 3D shape can be directly derived from a video
Methods
  • The authors conduct experiments on animals, i.e., birds and zebras. The authors evaluate the contributions in two aspects: (i) the improvement of single-view mesh reconstruction, and (ii) the reconstruction of a sequence of frames via online adaptation.
  • The authors describe a new bird video dataset that the authors curate and evaluate the test-time tuned model on it in the following.
  • For test-time adaptation on videos, the authors collect a new bird video dataset for quantitative evaluation.
  • For each slow-motion video collected from the Internet, the authors apply a segmentation model [3] trained on the CUB bird dataset [42] to obtain its foreground segmentation for online adaptation
Results
  • The authors visualize the reconstructed meshes by the ACMR-vid model for video frames in Fig. 5.
  • With online adaption as discussed in Sec. 3.2, the ACMR-vid model reconstructs plausible meshes for each video frame as shown in Fig. 5(c) and (d).
  • The authors visualize the effectiveness of ARAP for online adaptation in Fig. 6.
  • Without this constraint, the reconstructed meshes are less plausible, especially from unobserved views.
Conclusion
  • The authors propose a method to reconstruct temporally consistent 3D meshes of deformable objects from videos captured in the wild.
  • The authors learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects.
  • The authors adapt this model to any unlabeled video by exploiting self-supervised signals in videos, including those of shape, texture, and part consistency.
  • Another important application is to simplify an artists workflow, as an initial animated and textured 3D shape can be directly derived from a video
Summary
  • Introduction:

    When the authors humans try to understand the object shown in Fig. 1(a), the authors instantly recognize it as a “duck”.
  • Existing research mostly focuses on limited domains for which 3D annotations can be captured in constrained environments.
  • These approaches do not generalize well to non-rigid objects captured in naturalistic environments.
  • Due to constrained environments and limited annotations, it is nearly impossible to generalize these approaches to the 3D reconstruction of non-rigid objects from images and videos captured in the wild
  • Objectives:

    The authors' goal is to recover coherent sequences of mesh shapes, texture maps and camera poses from unlabeled videos, with a two-stage learning approach: (i) first, the authors learn a 3D mesh reconstruction model on a collection of single-view images of a category, described in Sec. 3.1; (ii) at inference time, the authors adapt the model to fit the sequence via temporal consistency constraints, as described in Sec. 3.2.
  • Methods:

    The authors conduct experiments on animals, i.e., birds and zebras. The authors evaluate the contributions in two aspects: (i) the improvement of single-view mesh reconstruction, and (ii) the reconstruction of a sequence of frames via online adaptation.
  • The authors describe a new bird video dataset that the authors curate and evaluate the test-time tuned model on it in the following.
  • For test-time adaptation on videos, the authors collect a new bird video dataset for quantitative evaluation.
  • For each slow-motion video collected from the Internet, the authors apply a segmentation model [3] trained on the CUB bird dataset [42] to obtain its foreground segmentation for online adaptation
  • Results:

    The authors visualize the reconstructed meshes by the ACMR-vid model for video frames in Fig. 5.
  • With online adaption as discussed in Sec. 3.2, the ACMR-vid model reconstructs plausible meshes for each video frame as shown in Fig. 5(c) and (d).
  • The authors visualize the effectiveness of ARAP for online adaptation in Fig. 6.
  • Without this constraint, the reconstructed meshes are less plausible, especially from unobserved views.
  • Conclusion:

    The authors propose a method to reconstruct temporally consistent 3D meshes of deformable objects from videos captured in the wild.
  • The authors learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects.
  • The authors adapt this model to any unlabeled video by exploiting self-supervised signals in videos, including those of shape, texture, and part consistency.
  • Another important application is to simplify an artists workflow, as an initial animated and textured 3D shape can be directly derived from a video
Tables
  • Table1: Quantitative evaluation of mask IoU and keypoint re-projection (PCK@0.1) on the CUB dataset [<a class="ref-link" id="c42" href="#r42">42</a>]
  • Table2: Quantitative evaluation of mask re-projection accuracy on the bird video dataset. “(T)” indicates the model is test-time trained on the given video., Lc, Lt, Ls are defined in Eq 4, 5, 6 respectively
  • Table3: Evaluation on synthetic data
Download tables as Excel
Related work
  • Non-rigid structure from motion (NR-SFM). NR-SFM aims to recover the pose and 3D structure of a non-rigid object, or object deforming non-rigidly over time, solely from 2D landmarks without 3D supervision [2]. It is a highly ill-posed problem and needs to be regularized by additional shape priors [2, 54]. Recently, deep networks [19, 28] have been developed that serve as more powerful priors than the traditional approaches. However, obtaining reliable landmarks or correspondences for videos is still a bottleneck. Our method bears resemblances to deep NR-SFM [28], which jointly predicts camera pose and shape deformation. Differently from them, we reconstruct dense meshes instead of sparse keypoints, without requiring labeled correspondences from videos.
Reference
  • A. Arnab, C. Doersch, and A. Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, June 2019. 3
    Google ScholarLocate open access versionFindings
  • C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In CVPR, 2000. 1, 3
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017. 6, 7
    Google ScholarLocate open access versionFindings
  • C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • C. Doersch and A. Zisserman. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. In NeurIPS, 2019. 3
    Google ScholarLocate open access versionFindings
  • Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • P. Guo and R. Farrell. Aligned to the object, not to the image: A unified pose-aligned representation for fine-grained recognition. In WACV, 2019. 7
    Google ScholarLocate open access versionFindings
  • M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt. Deepcap: Monocular human performance capture using weak supervision. In CVPR, 2020. 5
    Google ScholarFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 8
    Google ScholarLocate open access versionFindings
  • P. Henderson and V. Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In
    Google ScholarLocate open access versionFindings
  • J. F. Hughes, A. Van Dam, J. D. Foley, M. McGuire, S. K. Feiner, and D. F. Sklar. Computer graphics: principles and practice. Pearson Education, 2014. 3
    Google ScholarFindings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 8
    Findings
  • A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-specific mesh reconstruction from image collections. In ECCV, 2018. 1, 2, 3, 4, 6, 7, 8, 9
    Google ScholarLocate open access versionFindings
  • A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik. Learning 3d human dynamics from video. In CVPR, 2019. 1, 3
    Google ScholarLocate open access versionFindings
  • H. Kato and T. Harada. Learning view priors for single-view 3d reconstruction. In CVPR, 2019. 2, 6
    Google ScholarLocate open access versionFindings
  • H. Kato and T. Harada. Self-supervised learning of 3d objects from natural images. arXiv preprint arXiv:1911.08850, 2019. 2
    Findings
  • H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • A. Khoreva, A. Rohrbach, and B. Schiele. Video object segmentation with language referring expressions. In ACCV, 207
    Google ScholarLocate open access versionFindings
  • C. Kong and S. Lucey. Deep non-rigid structure from motion. In ICCV, 203, 5
    Google ScholarLocate open access versionFindings
  • N. Kulkarni, A. Gupta, D. F. Fouhey, and S. Tulsiani. Articulation-aware canonical surface mapping. In
    Google ScholarLocate open access versionFindings
  • N. Kulkarni, A. Gupta, and S. Tulsiani. Canonical surface mapping via geometric cycle consistency. In ICCV, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • X. Li, S. Liu, K. Kim, S. De Mello, V. Jampani, M.-H. Yang, and J. Kautz. Self-supervised single-view 3d reconstruction via semantic consistency. arXiv preprint arXiv:2003.06473, 2020. 1, 2, 3, 4, 6, 7
    Findings
  • X. Li, S. Liu, S. D. Mello, X. Wang, J. Kautz, and M.-H. Yang. Joint-task self-supervised learning for temporal correspondence. In NeurIPS, 2019. 5, 6
    Google ScholarLocate open access versionFindings
  • C.-H. Lin, O. Wang, B. C. Russell, E. Shechtman, V. G. Kim, M. Fisher, and S. Lucey. Photometric mesh optimization for video-aligned 3d object reconstruction. In CVPR, 2019. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In ICCV, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 2015. 3, 4
    Google ScholarLocate open access versionFindings
  • X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf. Consistent video depth estimation. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH), 39(4), 2020. 3
    Google ScholarLocate open access versionFindings
  • D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi. C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In ICCV, 2019. 1, 3, 5
    Google ScholarLocate open access versionFindings
  • J. Pan, X. Han, W. Chen, J. Tang, and K. Jia. Deep mesh reconstruction from single rgb images via topology modification networks. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine. Sfv: Reinforcement learning of physical skills from videos. ACM Trans. Graph., 37(6), Nov. 2018. 3
    Google ScholarLocate open access versionFindings
  • J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017. 7
    Findings
  • D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In NeurIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • H. Rhodin, N. Robertini, D. Casas, C. Richardt, H.-P. Seidel, and C. Theobalt. General automatic human shape and motion capture using volumetric contour cues. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • K. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming, T. Brill, D. Hoeferlin, and D. Burnsides. Civilian american and european surface anthropom- etry resource (caesar) final report. Tech. Rep. AFRL-HEWP-TR-2002-0169, US Air Force Research Laboratory, 2002. 4
    Google ScholarFindings
  • T. B.-W. M. J. B. Silvia Zuffi, Angjoo Kanazawa. Three-d safari: Learning to estimate zebra pose, shape, and texture from images "in the wild". In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, 2007. 5
    Google ScholarLocate open access versionFindings
  • Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training for out-of-distribution generalization. arXiv preprint arXiv:1909.13231, 2019. 2
    Findings
  • L. Tran and X. Liu. On learning 3d face morphable model from in-the-wild images. TPAMI, 2019. 3
    Google ScholarLocate open access versionFindings
  • H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki. Self-supervised learning of motion capture. In NeurIPS, 2017. 3
    Google ScholarLocate open access versionFindings
  • T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset, 2011. 7, 9
    Google ScholarFindings
  • B. Wandt, H. Ackermann, and B. Rosenhahn. 3d reconstruction of human motion from monocular image sequences. TPAMI, 2016. 3
    Google ScholarLocate open access versionFindings
  • N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • C. Wen, Y. Zhang, Z. Li, and Y. Fu. Pixel2mesh++: Multi-view 3d mesh generation via deformation. In
    Google ScholarLocate open access versionFindings
  • O. Wiles and A. Zisserman. Silnet: Single-and multi-view reconstruction by learning from silhouettes. arXiv preprint arXiv:1711.07888, 2017. 2
    Findings
  • S. Wu, C. Rupprecht, and A. Vedaldi. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In CVPR, 2020. 1, 2
    Google ScholarLocate open access versionFindings
  • Y. Wu and K. He. Group normalization. In ECCV, 2018. 8
    Google ScholarLocate open access versionFindings
  • X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Y. Zhang, P. Felsen, A. Kanazawa, and J. Malik. Predicting 3d human dynamics from video. In ICCV, 2019. 1, 3
    Google ScholarLocate open access versionFindings
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 6
    Google ScholarLocate open access versionFindings
  • X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • R. Zhu, C. Wang, C.-H. Lin, Z. Wang, and S. Lucey. Object-centric photometric bundle adjustment with deep shape prior. In WACV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Y. Zhu, D. Huang, F. De La Torre, and S. Lucey. Complex non-rigid motion 3d reconstruction by union of subspaces. In CVPR, 2014. 3
    Google ScholarLocate open access versionFindings
  • S. Zuffi, A. Kanazawa, T. Berger-Wolf, and M. J. Black. Three-d safari: Learning to estimate zebra pose, shape, and texture from images "in the wild". In ICCV, 2019. 4, 7
    Google ScholarLocate open access versionFindings
  • S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. In CVPR, 2017. 4
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments