StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

international conference on computer vision, 2017.

Cited by: 1200|Bibtex|Views116|Links
EI
Keywords:
generative adversarial networksobject partrefinement processVariational Autoencodersstacked generative adversarial networksMore(11+)
Weibo:
We propose Stacked Generative Adversarial Networks with Conditioning Augmentation for synthesizing photo-realistic images

Abstract:

Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto- image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we prop...More

Code:

Data:

0
Introduction
  • Generating photo-realistic images from text is an important problem and has tremendous applications, including photo-editing, computer-aided design, etc.
  • Reed et al only succeeded in generating plausible 64⇥64 images conditioned on text descriptions [26], which usually lack details and vivid object parts, e.g., beaks and eyes of birds.
  • They were unable to synthesize higher resolution (e.g., 128⇥128) images without providing additional annotations of objects [24]
Highlights
  • Generating photo-realistic images from text is an important problem and has tremendous applications, including photo-editing, computer-aided design, etc
  • Reed et al only succeeded in generating plausible 64⇥64 images conditioned on text descriptions [26], which usually lack details and vivid object parts, e.g., beaks and eyes of birds
  • The contribution of the proposed method is threefold: (1) We propose a novel Stacked Generative Adversarial Networks for synthesizing photo-realistic images from text descriptions
  • To generate high-resolution images with photo-realistic details, we propose a simple yet effective Stacked Generative Adversarial Networks
  • We propose Stacked Generative Adversarial Networks (StackGAN) with Conditioning Augmentation for synthesizing photo-realistic images
  • Stage-II Generative Adversarial Networks corrects the defects in Stage-I results and adds more details, yielding higher resolution images with better image quality
Methods
  • The authors conduct extensive quantitative and qualitative evaluations.
  • Two state-of-the-art methods on text-to-image synthesis, GAN-INT-CLS [26] and GAWWN [24], are compared.
  • Results by the two compared methods are generated using the code released by their authors.
  • The authors directly train Stage-The author GANs for generating 64⇥64 and 256⇥256 images to investigate whether the proposed stacked structure and Conditioning Augmentation are beneficial.
  • The authors modify the StackGAN to generate 128⇥128 and 256⇥256 images to investigate whether larger images by the method result in higher image quality.
  • The authors investigate whether inputting text at both stages of StackGAN is useful
Results
  • Evaluation metrics

    It is difficult to evaluate the performance of generative models (e.g., GAN).
  • Where x denotes one generated sample, and y is the label predicted by the Inception model [30].
  • The intuition behind this metric is that good models should generate diverse but meaningful images.
  • As suggested in [29], the authors evaluate this metric on a large number of samples (i.e., 30k randomly selected samples) for each model
Conclusion
  • The authors propose Stacked Generative Adversarial Networks (StackGAN) with Conditioning Augmentation for synthesizing photo-realistic images.
  • The proposed method decomposes the text-to-image synthesis to a novel sketch-refinement process.
  • Stage-The author GANs sketches the object following basic color and shape constraints from given text descriptions.
  • Stage-II GAN corrects the defects in Stage-The author results and adds more details, yielding higher resolution images with better image quality.
  • Compared to existing text-to-image generative models, the method generates higher resolution images (e.g., 256⇥256) with more photo-realistic details and diversity
Summary
  • Introduction:

    Generating photo-realistic images from text is an important problem and has tremendous applications, including photo-editing, computer-aided design, etc.
  • Reed et al only succeeded in generating plausible 64⇥64 images conditioned on text descriptions [26], which usually lack details and vivid object parts, e.g., beaks and eyes of birds.
  • They were unable to synthesize higher resolution (e.g., 128⇥128) images without providing additional annotations of objects [24]
  • Methods:

    The authors conduct extensive quantitative and qualitative evaluations.
  • Two state-of-the-art methods on text-to-image synthesis, GAN-INT-CLS [26] and GAWWN [24], are compared.
  • Results by the two compared methods are generated using the code released by their authors.
  • The authors directly train Stage-The author GANs for generating 64⇥64 and 256⇥256 images to investigate whether the proposed stacked structure and Conditioning Augmentation are beneficial.
  • The authors modify the StackGAN to generate 128⇥128 and 256⇥256 images to investigate whether larger images by the method result in higher image quality.
  • The authors investigate whether inputting text at both stages of StackGAN is useful
  • Results:

    Evaluation metrics

    It is difficult to evaluate the performance of generative models (e.g., GAN).
  • Where x denotes one generated sample, and y is the label predicted by the Inception model [30].
  • The intuition behind this metric is that good models should generate diverse but meaningful images.
  • As suggested in [29], the authors evaluate this metric on a large number of samples (i.e., 30k randomly selected samples) for each model
  • Conclusion:

    The authors propose Stacked Generative Adversarial Networks (StackGAN) with Conditioning Augmentation for synthesizing photo-realistic images.
  • The proposed method decomposes the text-to-image synthesis to a novel sketch-refinement process.
  • Stage-The author GANs sketches the object following basic color and shape constraints from given text descriptions.
  • Stage-II GAN corrects the defects in Stage-The author results and adds more details, yielding higher resolution images with better image quality.
  • Compared to existing text-to-image generative models, the method generates higher resolution images (e.g., 256⇥256) with more photo-realistic details and diversity
Tables
  • Table1: Inception scores and average human ranks of our StackGAN, GAWWN [<a class="ref-link" id="c24" href="#r24">24</a>], and GAN-INT-CLS [<a class="ref-link" id="c26" href="#r26">26</a>] on CUB, Oxford102, and MS-COCO datasets
  • Table2: Inception scores calculated with 30,000 samples generated by different baseline models of our StackGAN
Download tables as Excel
Related work
  • Generative image modeling is a fundamental problem in computer vision. There has been remarkable progress in this direction with the emergence of deep learning techniques. Variational Autoencoders (VAE) [13, 28] formulated the problem with probabilistic graphical models whose goal was to maximize the lower bound of data likelihood. Autoregressive models (e.g., PixelRNN) [33] that utilized neural networks to model the conditional distribution of the pixel space have also generated appealing synthetic images. Recently, Generative Adversarial Networks (GAN) [8] have shown promising performance for generating sharper images. But training instability makes it hard for GAN models to generate high-resolution (e.g., 256⇥256) images. Several techniques [23, 29, 18, 1, 3] have been proposed to stabilize the training process and generate compelling results. An energy-based GAN [38] has also been proposed for more stable training behavior.
Funding
  • Proposes Stacked Generative Adversarial Networks to generate 256⇥256 photo-realistic images conditioned on text descriptions
  • Introduces a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold
  • Proposes a novel Stacked Generative Adversarial Networks for synthesizing photo-realistic images from text descriptions
  • Proposes a simple yet effective Stacked Generative Adversarial Networks
  • Introduces a Conditioning Augmentation technique to produce additional conditioning variables c
Reference
  • M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2011, 2
    Google ScholarLocate open access versionFindings
  • C. Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2013
    Findings
  • J. Gauthier. Conditional generative adversarial networks for convolutional face generation. Technical report, 2015. 3
    Google ScholarFindings
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 4
    Google ScholarLocate open access versionFindings
  • X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 5
    Google ScholarLocate open access versionFindings
  • P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014. 2, 3
    Google ScholarLocate open access versionFindings
  • A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016. 3
    Google ScholarLocate open access versionFindings
  • C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5
    Google ScholarLocate open access versionFindings
  • E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016. 2
    Google ScholarLocate open access versionFindings
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014. 3
    Findings
  • A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICCVGIP, 2008. 5
    Google ScholarLocate open access versionFindings
  • A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In ICML, 2017. 2
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016. 1, 2, 3, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep representations of fine-grained visual descriptions. In CVPR, 2016. 3, 5
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. Technical report, 2016. 2
    Google ScholarFindings
  • D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014. 2
    Google ScholarLocate open access versionFindings
  • T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016. 2, 5
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • C. K. Snderby, J. Caballero, L. Theis, W. Shi, and F. Huszar. Amortised map inference for image super-resolution. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Y. Taigman, A. Polyak, and L. Wolf. Unsupervised crossdomain image generation. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016. 2
    Google ScholarLocate open access versionFindings
  • A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. 5
    Google ScholarFindings
  • X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments