FReeNet: Multi-Identity Face Reenactment

Jiangning Zhang
Jiangning Zhang
Xianfang Zeng
Xianfang Zeng
Yusu Pan
Yusu Pan
Liang Liu
Liang Liu

CVPR, pp. 5325-5334, 2020.

Cited by: 0|Bibtex|Views46|Links
EI
Keywords:
Unified Landmark ConverterNatural Science Foundation of Chinafacial animationRadboud Faces DatabaseStructural SimilarityMore(15+)
Weibo:
We propose a novel FReeNet to address the multi-identity face reenactment task, which aims at transferring facial expressions from source persons to target persons while keeping the identity and pose consistency to the reference images

Abstract:

This paper presents a novel multi-identity face reenactment framework, named FReeNet, to transfer facial expressions from an arbitrary source face to a target face with a shared model. The proposed FReeNet consists of two parts: Unified Landmark Converter (ULC) and Geometry-aware Generator (GAG). The ULC adopts an encode-decoder architect...More

Code:

Data:

Introduction
  • Face reenactment is a task to transfer the facial expression from one source face to a target face, which has vast promising applications such as film-making, facial animations, and augmented reality.
  • The methods firstly capture the facial movement of a source video that will be fitted into a parametric space over the predefined model and render the target video through morphing.
  • These techniques are famous for the animation of computer graphics (CG) avatars in both games and movies [26] because they have high-quality and high-resolution face reenactment ability.
  • These methods generally suffer from big-budget model making and are computationally expensive
Highlights
  • Face reenactment is a task to transfer the facial expression from one source face to a target face, which has vast promising applications such as film-making, facial animations, and augmented reality
  • Th0e2i9r138 217 facial expressions and movements are transferred to three reference images and reenact high-quality target face im- The rest faces are reenacted by our approach
  • The generated faces of our method are photorealistic and expression-alike, where the facial appearances and contours are consistent with the reference images
  • We choose Structural Similarity and Frechet Inception Distance metrics to evaluate the effectiveness of our proposed method on the Radboud Faces Database dataset quantitatively
  • From the comparison results shown in Table 1, the proposed Geometry-aware Generator outperforms the baseline in two metrics
  • We propose a novel FReeNet to address the multi-identity face reenactment task, which aims at transferring facial expressions from source persons to target persons while keeping the identity and pose consistency to the reference images
Results
  • The authors conduct and discuss a series of qualitative experiments on the RaFD and Multi-PIE datasets to demonstrate the high quality of generated images and the flexibility of the proposed framework.
  • From the comparison results shown in Table 1, the proposed GAG outperforms the baseline in two metrics
  • Both the two models can not keep the identity consistent for no landmark adaptation operation, which the authors call an identity shift problem.
  • The authors analyze it reasonable for that the FID metric judges both variety and reality of the image, and GAG model can generate more various images because various contour-inconsistent landmark images of other persons are used for one identity.
  • The reason can be intuitively found from 4.5 that the TP loss boosts the quality of the reenacted faces for having more facial details
Conclusion
  • The authors propose a novel FReeNet to address the multi-identity face reenactment task, which aims at transferring facial expressions from source persons to target persons while keeping the identity and pose consistency to the reference images.
  • A ULC module is proposed to effectively convert the expression of an arbitrary source person to the target person in the latent landmark space.
  • The GAG module input the reference image and the converted landmark image to reenact photorealistic target image.
  • The authors' approach can be transferred to other domains, such as gesture migration or posture migration of the body
Summary
  • Introduction:

    Face reenactment is a task to transfer the facial expression from one source face to a target face, which has vast promising applications such as film-making, facial animations, and augmented reality.
  • The methods firstly capture the facial movement of a source video that will be fitted into a parametric space over the predefined model and render the target video through morphing.
  • These techniques are famous for the animation of computer graphics (CG) avatars in both games and movies [26] because they have high-quality and high-resolution face reenactment ability.
  • These methods generally suffer from big-budget model making and are computationally expensive
  • Results:

    The authors conduct and discuss a series of qualitative experiments on the RaFD and Multi-PIE datasets to demonstrate the high quality of generated images and the flexibility of the proposed framework.
  • From the comparison results shown in Table 1, the proposed GAG outperforms the baseline in two metrics
  • Both the two models can not keep the identity consistent for no landmark adaptation operation, which the authors call an identity shift problem.
  • The authors analyze it reasonable for that the FID metric judges both variety and reality of the image, and GAG model can generate more various images because various contour-inconsistent landmark images of other persons are used for one identity.
  • The reason can be intuitively found from 4.5 that the TP loss boosts the quality of the reenacted faces for having more facial details
  • Conclusion:

    The authors propose a novel FReeNet to address the multi-identity face reenactment task, which aims at transferring facial expressions from source persons to target persons while keeping the identity and pose consistency to the reference images.
  • A ULC module is proposed to effectively convert the expression of an arbitrary source person to the target person in the latent landmark space.
  • The GAG module input the reference image and the converted landmark image to reenact photorealistic target image.
  • The authors' approach can be transferred to other domains, such as gesture migration or posture migration of the body
Tables
  • Table1: Metric evaluation results of the reproduced baseline and our method with different components on the RaFD dataset. Missing entry (-) means that the model is not evaluated by the metric
  • Table2: Parameter and speed comparisons of different models when learning all transformations among n persons. Missing entry (-) means that the model has no corresponding component
  • Table3: ACE results of different loss terms on the RaFD dataset
Download tables as Excel
Related work
  • Image Synthesis. Driven by remarkable generation effects of GAN [7], researchers have achieved excellent results in various domains, such as image translation [27, 12, 43, 35], person image synthesis [3, 23], and face generation [2, 26, 15, 16]. Mehdi et al [24] designed a cGAN structure to condition on to both the generator and discriminator for a more controllable generation of attributes. Subsequently, the Pix2Pix [12] achieved incredible results in paired image translation tasks by using L1 and adversarial losses between the generated image and the ground-truth. Zhu et al [43] subsequently proposed a new cycle consistency loss for unpaired image translation between two domains, which dramatically reduces the requirement of the data annotation. DualGAN [42] analogously learns two translators from one domain to the other and hence can solve general-purpose image-to-image translation tasks. Furthermore, StarGAN [2] proposed a unified model for multidomain facial attribute transferring and expression synthesis. Recently, some methods can generate vivid faces directly from the latent input code. Tero et al [15] described a new progressive growing training methodology for the face generation from an underlying code. StyleGAN [16] proposed a style-based generator that embeds the latent input code into an intermediate latent space, which can control the weights of image features at different scales and synthesize extremely naturalistic face images. However, using latent code as the input is an uncontrollable generation process that is not suitable for the face synthesis task, and they have limited scalability in handling the many-to-many face reenactment task. Our method introduces a landmark space for expression transferring among multiple persons and uses converted landmark images as the guidance to reenact target faces, which is not the same as the existing methods.
Funding
  • This work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant No 61836015 and Key R&D Program Project of Zhejiang Province (2019C01004)
Reference
  • Volker Blanz, Thomas Vetter, et al. A morphable model for the synthesis of 3d faces. In Siggraph, volume 99, pages 187–194, 1999. 3
    Google ScholarLocate open access versionFindings
  • Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, June 2018. 2
    Google ScholarLocate open access versionFindings
  • Haoye Dong, Xiaodan Liang, Ke Gong, Hanjiang Lai, Jia Zhu, and Jian Yin. Soft-gated warping-gan for pose-guided person image synthesis. In NeurIPS, pages 474–484, 2018. 2
    Google ScholarLocate open access versionFindings
  • Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In CVPR, pages 379–388, 2018. 1, 3
    Google ScholarLocate open access versionFindings
  • Pablo Garrido, Michael Zollhofer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Perez, and Christian Theobalt. Reconstruction of personalized 3d face rigs from monocular video. ACM TOG, 35(3):28, 2016. 1
    Google ScholarLocate open access versionFindings
  • Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. Warp-guided gans for single-photo facial animation. In SIGGRAPH Asia 2018 Technical Papers, page 231. ACM, 2018. 1
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014. 2
    Google ScholarLocate open access versionFindings
  • Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010. 1
    Google ScholarLocate open access versionFindings
  • Xiaojie Guo, Siyuan Li, Jiawan Zhang, Jiayi Ma, Lin Ma, Wei Liu, and Haibin Ling. Pfld: A practical facial landmark detector. arXiv preprint arXiv:1902.10859, 2011, 2, 3, 5
    Findings
  • Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017. 5
    Google ScholarLocate open access versionFindings
  • Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. 1, 3
    Google ScholarFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 1, 2, 4, 5, 7
    Google ScholarLocate open access versionFindings
  • Xiaohan Jin, Ye Qi, and Shangxuan Wu. Cyclegan face-off. arXiv preprint arXiv:1712.03451, 2017. 1, 3
    Findings
  • Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711. Springer, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. 2
    Findings
  • Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018. 2
    Findings
  • Hyeongwoo Kim, Pablo Carrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick Perez, Christian Richardt, Michael Zollhofer, and Christian Theobalt. Deep video portraits. ACM TOG, 37(4):163, 2018. 1
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5
    Findings
  • Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel HJ Wigboldus, Skyler T Hawk, and AD Van Knippenberg. Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377–1388, 2010. 5
    Google ScholarLocate open access versionFindings
  • Gary B. Huang Erik Learned-Miller. Labeled faces in the wild: Updates and new reporting procedures. Technical Report UM-CS-2014-003, University of Massachusetts, Amherst, May 2014. 1, 3
    Google ScholarFindings
  • Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In CVPR, pages 3317–3326, 2017. 1, 3
    Google ScholarLocate open access versionFindings
  • Luming Ma and Zhigang Deng. Real-time hierarchical facial performance capture. In ACM SIGGRAPH, page 11. ACM, 2019. 3
    Google ScholarLocate open access versionFindings
  • Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In NeurIPS, pages 406–416, 2017. 2
    Google ScholarLocate open access versionFindings
  • Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2
    Findings
  • Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsgan: Subject agnostic face swapping and reenactment. In ICCV, pages 7184–7193, 2019. 2
    Google ScholarLocate open access versionFindings
  • Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, pages 818–833, 2018. 1, 2, 3
    Google ScholarFindings
  • Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 2
    Findings
  • Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 5
    Findings
  • Lingxiao Song, Zhihe Lu, Ran He, Zhenan Sun, and Tieniu Tan. Geometry guided adversarial facial expression synthesis. In ACM, pages 627–635. ACM, 2018. 1
    Google ScholarLocate open access versionFindings
  • Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing obama: learning lip sync from audio. ACM TOG, 36(4):95, 2017. 1
    Google ScholarLocate open access versionFindings
  • Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In CVPR, pages 2387–2395, 2016. 3
    Google ScholarLocate open access versionFindings
  • Justus Thies, Michael Zollhofer, Christian Theobalt, Marc Stamminger, and Matthias Nießner. Headon: Real-time reenactment of human portrait videos. ACM TOG, 37(4):164, 2018. 1
    Google ScholarLocate open access versionFindings
  • Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popovic. Face transfer with multilinear models. In ACM TOG, volume 24, pages 426–433. ACM, 2005. 1
    Google ScholarLocate open access versionFindings
  • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 5
    Google ScholarLocate open access versionFindings
  • Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In ECCV, pages 670–686, 2018. 1, 7
    Google ScholarLocate open access versionFindings
  • Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A boundary-aware face alignment algorithm. In CVPR, pages 2129–2138, 2018. 1, 3
    Google ScholarLocate open access versionFindings
  • Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and Chen Change Loy. Reenactgan: Learning to reenact faces via boundary transfer. In ECCV, pages 603–619, 2018. 1, 2, 3, 7
    Google ScholarLocate open access versionFindings
  • Runze Xu, Zhiming Zhou, Weinan Zhang, and Yong Yu. Face transfer with generative adversarial network. arXiv preprint arXiv:1710.06090, 2017. 1, 2, 7
    Findings
  • Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In CVPR, pages 5525–5533, 2016. 1, 3
    Google ScholarLocate open access versionFindings
  • Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, pages 2868–2876. IEEE, 2017. 2
    Google ScholarLocate open access versionFindings
  • Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In ICCV, 2017. 1, 2, 4
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments