AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

computer vision and pattern recognition, 2018.

Cited by: 362|Bibtex|Views154|Links
EI
Keywords:
deep attentional multimodal similarity modelimage generationcoco datasetattentional generative networkimage text matching lossMore(12+)
Weibo:
Extensive experimental results demonstrate the effectiveness of the proposed attention mechanisms in the Attentional Generative Adversarial Network, which is especially critical for text-to-image generation for complex scenes

Abstract:

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation. With a novel attentional generative network, the AttnGAN can synthesize fine-grained details at different sub-regions of the image by paying attentions to the rel...More

Code:

Data:

0
Introduction
  • Generating images according to natural language descriptions is a fundamental problem in many applications, such as art generation and computer-aided design.
  • A commonly used approach is to encode the whole text description into a global sentence vector as the condition for GAN-based image generation [20, 18, 36, 37].
  • 10:short 0:this only on the global sentence vector lacks important finegrained information at the word level, and prevents the generation of high quality images
  • This problem becomes even more severe when generating complex scenes such as those in the COCO dataset [14].
  • The DAMSM provides an additional fine-grained image-text matching loss for training the generator
Highlights
  • Generating images according to natural language descriptions is a fundamental problem in many applications, such as art generation and computer-aided design
  • The first component is an attentional generative network, in which an attention mechanism is developed for the generator to draw different sub-regions of the image by focusing on words that are most relevant to the sub-region being drawn
  • Since the inception score cannot reflect whether the generated image is well conditioned on the given text description, we propose to use R-precision, a common evaluation metric for ranking retrieval results, as a complementary evaluation metric for the text-to-image synthesis task
  • We present a deep attentional multimodal similarity model to compute the finegrained image-text matching loss for training the generator of the Attentional Generative Adversarial Network
  • Our Attentional Generative Adversarial Network significantly outperforms state-of-the-art Generative Adversarial Networks models, boosting the best reported inception score by 14.14% on the CUB dataset and 170.25% on the more challenging COCO dataset
  • Extensive experimental results demonstrate the effectiveness of the proposed attention mechanisms in the Attentional Generative Adversarial Network, which is especially critical for text-to-image generation for complex scenes
Methods
  • On the COCO dataset, by increasing the value of λ from 0.1 to 50, the AttnGAN1 achieves both high inception score and R-precision rate
  • This comparison demonstrates that properly increasing the weight of LDAMSM helps to generate higher quality images that are better conditioned on given text descriptions.
  • In the experiments, the authors do not observe any collapsed nonsensical mode in the visualization of AttnGAN-generated images
  • It indicates that, with extra supervision, the fine-grained image-text matching loss helps to stabilize the training process of the AttnGAN.
  • Its inception score and R-precision drops to 3.98 and 10.37%, respectively, which further demonstrates the effectiveness of the proposed LDAMSM
Results
  • Since the inception score cannot reflect whether the generated image is well conditioned on the given text description, the authors propose to use R-precision, a common evaluation metric for ranking retrieval results, as a complementary evaluation metric for the text-to-image synthesis task.
  • The authors rank candidate text descriptions for each image in descending similarity and find the top r relevant descriptions for computing the R-precision.
  • To compute the inception score and the R-precision, each model generates 30,000 images from randomly selected unseen text descriptions.
Conclusion
  • An Attentional Generative Adversarial Network, named AttnGAN, is proposed for fine-grained textto-image synthesis.
  • The authors build a novel attentional generative network for the AttnGAN to generate high quality image through a multi-stage process.
  • The authors present a deep attentional multimodal similarity model to compute the finegrained image-text matching loss for training the generator of the AttnGAN.
  • Extensive experimental results demonstrate the effectiveness of the proposed attention mechanisms in the AttnGAN, which is especially critical for text-to-image generation for complex scenes
Summary
  • Introduction:

    Generating images according to natural language descriptions is a fundamental problem in many applications, such as art generation and computer-aided design.
  • A commonly used approach is to encode the whole text description into a global sentence vector as the condition for GAN-based image generation [20, 18, 36, 37].
  • 10:short 0:this only on the global sentence vector lacks important finegrained information at the word level, and prevents the generation of high quality images
  • This problem becomes even more severe when generating complex scenes such as those in the COCO dataset [14].
  • The DAMSM provides an additional fine-grained image-text matching loss for training the generator
  • Methods:

    On the COCO dataset, by increasing the value of λ from 0.1 to 50, the AttnGAN1 achieves both high inception score and R-precision rate
  • This comparison demonstrates that properly increasing the weight of LDAMSM helps to generate higher quality images that are better conditioned on given text descriptions.
  • In the experiments, the authors do not observe any collapsed nonsensical mode in the visualization of AttnGAN-generated images
  • It indicates that, with extra supervision, the fine-grained image-text matching loss helps to stabilize the training process of the AttnGAN.
  • Its inception score and R-precision drops to 3.98 and 10.37%, respectively, which further demonstrates the effectiveness of the proposed LDAMSM
  • Results:

    Since the inception score cannot reflect whether the generated image is well conditioned on the given text description, the authors propose to use R-precision, a common evaluation metric for ranking retrieval results, as a complementary evaluation metric for the text-to-image synthesis task.
  • The authors rank candidate text descriptions for each image in descending similarity and find the top r relevant descriptions for computing the R-precision.
  • To compute the inception score and the R-precision, each model generates 30,000 images from randomly selected unseen text descriptions.
  • Conclusion:

    An Attentional Generative Adversarial Network, named AttnGAN, is proposed for fine-grained textto-image synthesis.
  • The authors build a novel attentional generative network for the AttnGAN to generate high quality image through a multi-stage process.
  • The authors present a deep attentional multimodal similarity model to compute the finegrained image-text matching loss for training the generator of the AttnGAN.
  • Extensive experimental results demonstrate the effectiveness of the proposed attention mechanisms in the AttnGAN, which is especially critical for text-to-image generation for complex scenes
Tables
  • Table1: Statistics of datasets
  • Table2: The best inception score and the corresponding Rprecision rate of each AttnGAN model on CUB (top six rows) and COCO (the last row) test sets. More results in Figure 3
  • Table3: Inception scores by state-of-the-art GAN models [<a class="ref-link" id="c20" href="#r20">20</a>, <a class="ref-link" id="c18" href="#r18">18</a>, <a class="ref-link" id="c36" href="#r36">36</a>, <a class="ref-link" id="c37" href="#r37">37</a>, <a class="ref-link" id="c16" href="#r16">16</a>] and our AttnGAN on CUB and COCO test sets
Download tables as Excel
Related work
  • Generating high resolution images from text descriptions, though very challenging, is important for many practical applications such as art generation and computeraided design. Recently, great progress has been achieved in this direction with the emergence of deep generative models [12, 27, 6]. Mansimov et al [15] built the alignDRAW model, extending the Deep Recurrent Attention Writer (DRAW) [7] to iteratively draw image patches while attending to the relevant words in the caption. Nguyen et al [16] proposed an approximate Langevin approach to generate images from captions. Reed et al [21] used conditional PixelCNN [27] to synthesize images from text with a multi-scale model structure. Compared with other deep generative models, Generative Adversarial Networks (GANs) [6] have shown great performance for generating sharper samples [17, 3, 23, 13, 10, 35, 24, 34, 39, 40]. Reed et al [20] first showed that the conditional GAN was capable of synthesizing plausible images from text descriptions. Their follow-up work [18] also demonstrated that GAN was able to generate better samples by incorporating additional conditions (e.g., object locations). Zhang et al [36, 37] stacked several GANs for text-to-image synthesis and used different GANs to generate images of different sizes. However, all of their GANs are conditioned on the global sentence vector, missing fine-grained word level information for image generation.
Funding
  • Proposes an Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation
Reference
  • A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. VQA: visual question answering. IJCV, 123(1):4–31, 2017. 1
    Google ScholarLocate open access versionFindings
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014. 2
    Findings
  • E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015. 2
    Google ScholarLocate open access versionFindings
  • H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In CVPR, 2015. 1, 4
    Google ScholarLocate open access versionFindings
  • Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. 1, 2
    Google ScholarLocate open access versionFindings
  • K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. In ICML, 2015. 2
    Google ScholarLocate open access versionFindings
  • X. He, L. Deng, and W. Chou. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, 25(5):14–36, 2004
    Google ScholarLocate open access versionFindings
  • P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. 4
    Google ScholarFindings
  • P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, 5(3):257–265, 1997. 4
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014. 2
    Google ScholarLocate open access versionFindings
  • C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image superresolution using a generative adversarial network. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 201, 5
    Google ScholarLocate open access versionFindings
  • E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016. 2
    Google ScholarLocate open access versionFindings
  • A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017. 2, 5, 7
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 2, 8
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016. 1, 2, 5, 7
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep representations of fine-grained visual descriptions. In CVPR, 2016. 1, 6
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016. 1, 2, 5, 7
    Google ScholarLocate open access versionFindings
  • S. E. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale autoregressive density estimation. In ICML, 2017. 2
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015. 4
    Google ScholarLocate open access versionFindings
  • T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016. 2, 5
    Google ScholarLocate open access versionFindings
  • T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving gans using optimal transport. In ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Trans. Signal Processing, 45(11):2673–2681, 1997. 4
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 4
    Google ScholarLocate open access versionFindings
  • A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv:1706.03762, 2017. 2
    Findings
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR2011-001, California Institute of Technology, 2011. 5
    Google ScholarFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • H. Zhang and K. Dana. Multi-style generative network for real-time transfer. arXiv:1703.06953, 2017. 1
    Findings
  • H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In CVPR, 2018. 1
    Google ScholarLocate open access versionFindings
  • H. Zhang and V. M. Patel. Densely connected pyramid dehazing network. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • H. Zhang, V. Sindagi, and V. M. Patel. Image de-raining using a conditional generative adversarial network. arXiv:1701.05957, 2017. 2
    Findings
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017. 1, 2, 3, 5, 7
    Google ScholarLocate open access versionFindings
  • H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv: 1710.10916, 2017. 1, 2, 3, 5, 7, 8
    Findings
  • Z. Zhang, Y. Xie, F. Xing, M. Mcgough, and L. Yang. Mdnet: A semantically and visually interpretable medical image diagnosis network. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Z. Zhang, Y. Xie, and L. Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Z. Zhang, L. Yang, and Y. Zheng. Translating and segmenting multimodal medical volumes with cycle- and shape-consistency generative adversarial network. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018. 1
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments