Online Adaptation for Consistent Mesh Reconstruction in the Wild
NeurIPS, 2020.
EI
Keywords:
Weibo:
Abstract:
This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming tes...More
Code:
Data:
Introduction
- When the authors humans try to understand the object shown in Fig. 1(a), the authors instantly recognize it as a “duck”.
- Existing research mostly focuses on limited domains for which 3D annotations can be captured in constrained environments.
- These approaches do not generalize well to non-rigid objects captured in naturalistic environments.
- Due to constrained environments and limited annotations, it is nearly impossible to generalize these approaches to the 3D reconstruction of non-rigid objects from images and videos captured in the wild
Highlights
- When we humans try to understand the object shown in Fig. 1(a), we instantly recognize it as a “duck”
- Due to constrained environments and limited annotations, it is nearly impossible to generalize these approaches to the 3D reconstruction of non-rigid objects from images and videos captured in the wild
- Our goal is to recover coherent sequences of mesh shapes, texture maps and camera poses from unlabeled videos, with a two-stage learning approach: (i) first, we learn a 3D mesh reconstruction model on a collection of single-view images of a category, described in Sec. 3.1; (ii) at inference time, we adapt the model to fit the sequence via temporal consistency constraints, as described in Sec. 3.2
- We propose a method to reconstruct temporally consistent 3D meshes of deformable objects from videos captured in the wild
- We learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects
- The method can be applied to tasks such as bird watching, motion analysis, shape analysis, to name a few. Another important application is to simplify an artists workflow, as an initial animated and textured 3D shape can be directly derived from a video
Methods
- The authors conduct experiments on animals, i.e., birds and zebras. The authors evaluate the contributions in two aspects: (i) the improvement of single-view mesh reconstruction, and (ii) the reconstruction of a sequence of frames via online adaptation.
- The authors describe a new bird video dataset that the authors curate and evaluate the test-time tuned model on it in the following.
- For test-time adaptation on videos, the authors collect a new bird video dataset for quantitative evaluation.
- For each slow-motion video collected from the Internet, the authors apply a segmentation model [3] trained on the CUB bird dataset [42] to obtain its foreground segmentation for online adaptation
Results
- The authors visualize the reconstructed meshes by the ACMR-vid model for video frames in Fig. 5.
- With online adaption as discussed in Sec. 3.2, the ACMR-vid model reconstructs plausible meshes for each video frame as shown in Fig. 5(c) and (d).
- The authors visualize the effectiveness of ARAP for online adaptation in Fig. 6.
- Without this constraint, the reconstructed meshes are less plausible, especially from unobserved views.
Conclusion
- The authors propose a method to reconstruct temporally consistent 3D meshes of deformable objects from videos captured in the wild.
- The authors learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects.
- The authors adapt this model to any unlabeled video by exploiting self-supervised signals in videos, including those of shape, texture, and part consistency.
- Another important application is to simplify an artists workflow, as an initial animated and textured 3D shape can be directly derived from a video
Summary
Introduction:
When the authors humans try to understand the object shown in Fig. 1(a), the authors instantly recognize it as a “duck”.- Existing research mostly focuses on limited domains for which 3D annotations can be captured in constrained environments.
- These approaches do not generalize well to non-rigid objects captured in naturalistic environments.
- Due to constrained environments and limited annotations, it is nearly impossible to generalize these approaches to the 3D reconstruction of non-rigid objects from images and videos captured in the wild
Objectives:
The authors' goal is to recover coherent sequences of mesh shapes, texture maps and camera poses from unlabeled videos, with a two-stage learning approach: (i) first, the authors learn a 3D mesh reconstruction model on a collection of single-view images of a category, described in Sec. 3.1; (ii) at inference time, the authors adapt the model to fit the sequence via temporal consistency constraints, as described in Sec. 3.2.Methods:
The authors conduct experiments on animals, i.e., birds and zebras. The authors evaluate the contributions in two aspects: (i) the improvement of single-view mesh reconstruction, and (ii) the reconstruction of a sequence of frames via online adaptation.- The authors describe a new bird video dataset that the authors curate and evaluate the test-time tuned model on it in the following.
- For test-time adaptation on videos, the authors collect a new bird video dataset for quantitative evaluation.
- For each slow-motion video collected from the Internet, the authors apply a segmentation model [3] trained on the CUB bird dataset [42] to obtain its foreground segmentation for online adaptation
Results:
The authors visualize the reconstructed meshes by the ACMR-vid model for video frames in Fig. 5.- With online adaption as discussed in Sec. 3.2, the ACMR-vid model reconstructs plausible meshes for each video frame as shown in Fig. 5(c) and (d).
- The authors visualize the effectiveness of ARAP for online adaptation in Fig. 6.
- Without this constraint, the reconstructed meshes are less plausible, especially from unobserved views.
Conclusion:
The authors propose a method to reconstruct temporally consistent 3D meshes of deformable objects from videos captured in the wild.- The authors learn a category-specific 3D mesh reconstruction model that jointly predicts the shape, texture, and camera pose from single-view images, which is capable of capturing asymmetric non-rigid motion deformation of objects.
- The authors adapt this model to any unlabeled video by exploiting self-supervised signals in videos, including those of shape, texture, and part consistency.
- Another important application is to simplify an artists workflow, as an initial animated and textured 3D shape can be directly derived from a video
Tables
- Table1: Quantitative evaluation of mask IoU and keypoint re-projection (PCK@0.1) on the CUB dataset [<a class="ref-link" id="c42" href="#r42">42</a>]
- Table2: Quantitative evaluation of mask re-projection accuracy on the bird video dataset. “(T)” indicates the model is test-time trained on the given video., Lc, Lt, Ls are defined in Eq 4, 5, 6 respectively
- Table3: Evaluation on synthetic data
Related work
- Non-rigid structure from motion (NR-SFM). NR-SFM aims to recover the pose and 3D structure of a non-rigid object, or object deforming non-rigidly over time, solely from 2D landmarks without 3D supervision [2]. It is a highly ill-posed problem and needs to be regularized by additional shape priors [2, 54]. Recently, deep networks [19, 28] have been developed that serve as more powerful priors than the traditional approaches. However, obtaining reliable landmarks or correspondences for videos is still a bottleneck. Our method bears resemblances to deep NR-SFM [28], which jointly predicts camera pose and shape deformation. Differently from them, we reconstruct dense meshes instead of sparse keypoints, without requiring labeled correspondences from videos.
Reference
- A. Arnab, C. Doersch, and A. Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, June 2019. 3
- C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In CVPR, 2000. 1, 3
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017. 6, 7
- C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016. 1, 2
- C. Doersch and A. Zisserman. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. In NeurIPS, 2019. 3
- Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, 2018. 3
- P. Guo and R. Farrell. Aligned to the object, not to the image: A unified pose-aligned representation for fine-grained recognition. In WACV, 2019. 7
- M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt. Deepcap: Monocular human performance capture using weak supervision. In CVPR, 2020. 5
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 8
- P. Henderson and V. Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In
- J. F. Hughes, A. Van Dam, J. D. Foley, M. McGuire, S. K. Feiner, and D. F. Sklar. Computer graphics: principles and practice. Pearson Education, 2014. 3
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 8
- A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-specific mesh reconstruction from image collections. In ECCV, 2018. 1, 2, 3, 4, 6, 7, 8, 9
- A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik. Learning 3d human dynamics from video. In CVPR, 2019. 1, 3
- H. Kato and T. Harada. Learning view priors for single-view 3d reconstruction. In CVPR, 2019. 2, 6
- H. Kato and T. Harada. Self-supervised learning of 3d objects from natural images. arXiv preprint arXiv:1911.08850, 2019. 2
- H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018. 1, 2, 3
- A. Khoreva, A. Rohrbach, and B. Schiele. Video object segmentation with language referring expressions. In ACCV, 207
- C. Kong and S. Lucey. Deep non-rigid structure from motion. In ICCV, 203, 5
- N. Kulkarni, A. Gupta, D. F. Fouhey, and S. Tulsiani. Articulation-aware canonical surface mapping. In
- N. Kulkarni, A. Gupta, and S. Tulsiani. Canonical surface mapping via geometric cycle consistency. In ICCV, 2019. 1, 2
- X. Li, S. Liu, K. Kim, S. De Mello, V. Jampani, M.-H. Yang, and J. Kautz. Self-supervised single-view 3d reconstruction via semantic consistency. arXiv preprint arXiv:2003.06473, 2020. 1, 2, 3, 4, 6, 7
- X. Li, S. Liu, S. D. Mello, X. Wang, J. Kautz, and M.-H. Yang. Joint-task self-supervised learning for temporal correspondence. In NeurIPS, 2019. 5, 6
- C.-H. Lin, O. Wang, B. C. Russell, E. Shechtman, V. G. Kim, M. Fisher, and S. Lucey. Photometric mesh optimization for video-aligned 3d object reconstruction. In CVPR, 2019. 1, 2, 3
- S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In ICCV, 2019. 2, 3
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 2015. 3, 4
- X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf. Consistent video depth estimation. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH), 39(4), 2020. 3
- D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi. C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In ICCV, 2019. 1, 3, 5
- J. Pan, X. Han, W. Chen, J. Tang, and K. Jia. Deep mesh reconstruction from single rgb images via topology modification networks. In ICCV, 2019. 2
- D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, 2019. 3
- X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine. Sfv: Reinforcement learning of physical skills from videos. ACM Trans. Graph., 37(6), Nov. 2018. 3
- J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017. 7
- D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In NeurIPS, 2016. 2
- H. Rhodin, N. Robertini, D. Casas, C. Richardt, H.-P. Seidel, and C. Theobalt. General automatic human shape and motion capture using volumetric contour cues. In ECCV, 2016. 3
- K. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming, T. Brill, D. Hoeferlin, and D. Burnsides. Civilian american and european surface anthropom- etry resource (caesar) final report. Tech. Rep. AFRL-HEWP-TR-2002-0169, US Air Force Research Laboratory, 2002. 4
- T. B.-W. M. J. B. Silvia Zuffi, Angjoo Kanazawa. Three-d safari: Learning to estimate zebra pose, shape, and texture from images "in the wild". In ICCV, 2019. 3
- O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, 2007. 5
- Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training for out-of-distribution generalization. arXiv preprint arXiv:1909.13231, 2019. 2
- L. Tran and X. Liu. On learning 3d face morphable model from in-the-wild images. TPAMI, 2019. 3
- H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki. Self-supervised learning of motion capture. In NeurIPS, 2017. 3
- T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018. 3
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset, 2011. 7, 9
- B. Wandt, H. Ackermann, and B. Rosenhahn. 3d reconstruction of human motion from monocular image sequences. TPAMI, 2016. 3
- N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018. 2
- C. Wen, Y. Zhang, Z. Li, and Y. Fu. Pixel2mesh++: Multi-view 3d mesh generation via deformation. In
- O. Wiles and A. Zisserman. Silnet: Single-and multi-view reconstruction by learning from silhouettes. arXiv preprint arXiv:1711.07888, 2017. 2
- S. Wu, C. Rupprecht, and A. Vedaldi. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In CVPR, 2020. 1, 2
- Y. Wu and K. He. Group normalization. In ECCV, 2018. 8
- X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016. 2
- J. Y. Zhang, P. Felsen, A. Kanazawa, and J. Malik. Predicting 3d human dynamics from video. In ICCV, 2019. 1, 3
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 6
- X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR, 2016. 3
- R. Zhu, C. Wang, C.-H. Lin, Z. Wang, and S. Lucey. Object-centric photometric bundle adjustment with deep shape prior. In WACV, 2018. 2
- Y. Zhu, D. Huang, F. De La Torre, and S. Lucey. Complex non-rigid motion 3d reconstruction by union of subspaces. In CVPR, 2014. 3
- S. Zuffi, A. Kanazawa, T. Berger-Wolf, and M. J. Black. Three-d safari: Learning to estimate zebra pose, shape, and texture from images "in the wild". In ICCV, 2019. 4, 7
- S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. In CVPR, 2017. 4
Tags
Comments