Self-Attention Generative Adversarial Networks

arXiv: Machine Learning, Volume abs/1805.08318, 2018.

Cited by: 809|Bibtex|Views115|Links
EI
Keywords:
image generationlong range dependencyspectral normalizationself attention generative adversarial networksGenerative Adversarial NetworksMore(6+)
Weibo:
It is important to understand that Inception score has serious limitations—it is intended primarily to ensure that the model generates samples that can be confidently recognized as belonging to a specific class, and that the model generates samples from many classes, not necessar...

Abstract:

In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can b...More

Code:

Data:

0
Introduction
  • Image synthesis is an important problem in computer vision. There has been remarkable progress in this direction with the emergence of Generative Adversarial Networks (GANs) [5].
  • Since the convolution operator has a local receptive field, long range dependencies can only be processed after passing through several convolutional layers
  • This could prevent learning about long-term dependencies for a variety of reasons: a small model may not be able to represent them, optimization algorithms may have trouble discovering parameter values that carefully coordinate multiple layers to capture these dependencies, and these parameterizations may be statistically brittle and prone to failure when applied to previously unseen inputs.
  • The self-attention module calculates response at a position as a weighted sum of the features at all positions, where the weights – or attention vectors – are calculated with only a small computational cost
Highlights
  • Image synthesis is an important problem in computer vision
  • It is important to understand that Inception score has serious limitations—it is intended primarily to ensure that the model generates samples that can be confidently recognized as belonging to a specific class, and that the model generates samples from many classes, not necessarily to assess realism of details or intra-class diversity
  • 50k samples are randomly generated for each model to compute the Inception score and Fréchet Inception distance
  • We proposed Self-Attention Generative Adversarial Networks (SAGANs), which incorporate a self-attention mechanism into the Generative Adversarial Networks framework
  • We show that spectral normalization applied to the generator stabilizes Generative Adversarial Networks training and that two-timescale update rule speeds up training of regularized discriminators
  • Self-Attention Generative Adversarial Network achieves the state-of-the-art performance on class-conditional image generation on ImageNet
Methods
  • To evaluate the proposed methods, the authors conducted extensive experiments on the LSVRC2012 (ImageNet) dataset [25].
  • In Section 5.1, the authors present experiments designed to evaluate the effectiveness of the two proposed techniques for stabilizing GAN training.
  • The proposed self-attention mechanism is investigated in Section 5.2.
  • SAGAN is compared with state-of-the-art methods [19, 17] on image generation in Section 5.3
Results
  • The authors choose the Inception score (IS) [26] and the Fréchet Inception distance (FID) [8] for quantitative evaluation.
  • FID is a more principled and comprehensive metric, and has been shown to be more consistent with human evaluation in assessing the realism and variation of the generated samples [8].
  • FID calculates the Wasserstein-2 distance between the generated images and the real images in the feature space of an Inception-v3 network.
  • 50k samples are randomly generated for each model to compute the Inception score and FID
Conclusion
  • The authors proposed Self-Attention Generative Adversarial Networks (SAGANs), which incorporate a self-attention mechanism into the GAN framework.
  • The self-attention module is effective in modeling long-range dependencies.
  • The authors show that spectral normalization applied to the generator stabilizes GAN training and that TTUR speeds up training of regularized discriminators.
  • SAGAN achieves the state-of-the-art performance on class-conditional image generation on ImageNet
Summary
  • Image synthesis is an important problem in computer vision. There has been remarkable progress in this direction with the emergence of Generative Adversarial Networks (GANs) [5].
  • We propose Self-Attention Generative Adversarial Networks (SAGANs), which introduce a self-attention mechanism into convolutional GANs. The self-attention module is complementary to convolutions and helps with modeling long range, multi-level dependencies across image regions.
  • We propose enforcing good conditioning of GAN generators using the spectral normalization technique that has previously been applied only to the discriminator [16].
  • Combined with the projection-based discriminator [17], the spectrally normalized model greatly improves class-conditional image generation on ImageNet. Attention Models.
  • Miyato et al [16] originally proposed stabilizing the training of GANs by applying spectral normalization to the discriminator network.
  • Experiments are conducted to evaluate the effectiveness of the proposed stabilization techniques, i.e., applying spectral normalization (SN) to the generator and utilizing imbalanced learning rates (TTUR).
  • As shown in the middle sub-figures of Figure 3, adding SN to both the generator and the discriminator greatly stabilized our model “SN on G/D”, even when it was trained with 1:1 balanced updates.
  • When we apply the imbalanced learning rates to train the discriminator and the generator, the quality of images generated by our model “SN on G/D+TTUR” improves monotonically during the whole training process.
  • In the rest of experiments, all models use spectral normalization for both the generator and discriminator and use the imbalanced learning rates to train the generator and the discriminator with 1:1 updates.
  • The attention mechanism gives more power to both generator and discriminator to directly model the long-range dependencies in the feature maps.
  • The comparison of our SAGAN and the baseline model without attention (2nd column of Table 1) demonstrate the effectiveness of the proposed self-attention mechanism.
  • The training is not stable when we replace the self-attention block with the residual block in 8×8 feature maps, which leads to a significant decrease in performance (e.g., FID increases from 22.98 to 42.13).
  • SAGAN is compared with state-of-the-art GAN models [19, 17] for class conditional image generation on ImageNet. As shown in Table 2, our proposed SAGAN achieves the best Inception score and FID.
  • The lower FID (18.65) achieved by SAGAN indicates that SAGAN can better approximate the original image distribution by using the self-attention module to model the global dependencies between image regions.
  • We show that spectral normalization applied to the generator stabilizes GAN training and that TTUR speeds up training of regularized discriminators.
  • SAGAN achieves the state-of-the-art performance on class-conditional image generation on ImageNet. We thank Surya Bhupatiraju for feedback on drafts of this article.
Tables
  • Table1: Comparison of Self-Attention and Residual block on GANs. These blocks are added into different layers of the network. All models have been trained for one million iterations, and the best Inception scores (IS) and Fréchet Inception distance (FID) are reported
  • Table2: Comparison of the proposed SAGAN with state-of-the-art GAN models [<a class="ref-link" id="c19" href="#r19">19</a>, <a class="ref-link" id="c17" href="#r17">17</a>] for class conditional image generation on ImageNet. FID of SNGAN-projection is calculated from officially released weights
Download tables as Excel
Related work
  • Generative Adversarial Networks. GANs have achieved great success in various image generation tasks, including image-to-image translation [9, 40, 29, 14], image super-resolution [12, 28] and textto-image synthesis [24, 23, 37]. Despite this success, the training of GANs is known to be unstable and sensitive to the choices of hyper-parameters. Several works have attempted to stabilize the GAN training dynamics and improve the sample diversity by designing new network architectures [22, 37, 10], modifying the learning objectives and dynamics [1, 27, 15, 3, 39], adding regularization methods [7, 16] and introducing heuristic tricks [26, 19]. Recently, Miyato et al [16] proposed limiting the spectral norm of the weight matrices in the discriminator in order to constrain the Lipschitz constant of the discriminator function. Combined with the projection-based discriminator [17], the spectrally normalized model greatly improves class-conditional image generation on ImageNet.
Funding
  • Proposes the Self-Attention Generative Adversarial Network which allows attention-driven, long-range dependency modeling for image generation tasks
  • Proposes Self-Attention Generative Adversarial Networks , which introduce a self-attention mechanism into convolutional GANs
  • Proposes enforcing good conditioning of GAN generators using the spectral normalization technique that has previously been applied only to the discriminator
Reference
  • M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv:1701.07875, 2017.
    Findings
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
    Findings
  • T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • J. Cheng, L. Dong, and M. Lapata. Long short-term memory-networks for machine reading. In
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6629–6640, 2017.
    Google ScholarLocate open access versionFindings
  • P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • J. H. Lim and J. C. Ye. Geometric gan. arXiv:1705.02894, 2017.
    Findings
  • M. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • T. Miyato and M. Koyama. cgans with projection discriminator. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Odena, J. Buckman, C. Olsson, T. B. Brown, C. Olah, C. Raffel, and I. Goodfellow. Is generator conditioning causally related to gan performance? In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit. A decomposable attention model for natural language inference. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • N. Parmar, A. Vaswani, J. Uszkoreit, Łukasz Kaiser, N. Shazeer, and A. Ku. Image transformer. arXiv:1802.05751, 2018.
    Findings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • T. Salimans, H. Zhang, A. Radford, and D. N. Metaxas. Improving gans using optimal transport. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised map inference for image super-resolution. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • D. Tran, R. Ranganath, and D. M. Blei. Deep and hierarchical implicit models. arXiv:1702.08896, 2017.
    Findings
  • A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv:1706.03762, 2017.
    Findings
  • X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv: 1710.10916, 2017.
    Findings
  • J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments