# Adversarial Feature Learning

ICLR, Volume abs/1605.09782, 2017.

EI

Weibo:

Abstract:

The ability of the Generative Adversarial Networks (GANs) framework to learn generative models mapping from simple latent distributions to arbitrarily complex data distributions has been demonstrated empirically, with compelling results showing generators learn to "linearize semantics" in the latent space of such models. Intuitively, su...More

Code:

Data:

Introduction

- Deep convolutional networks have become a staple of the modern computer vision pipeline.
- In other perceptual domains such as natural language processing or speech recognition, deep networks have proven highly effective as well (Bahdanau et al, 2015; Sutskever et al, 2014; Vinyals et al, 2015; Graves et al, 2013)
- All of these recent results rely on a supervisory signal from large-scale databases of hand-labeled data, ignoring much of the useful information present in the structure of the data itself.
- When trained on databases of natural images, GANs produce impressive results (Radford et al, 2016; Denton et al, 2015)

Highlights

- Deep convolutional networks have become a staple of the modern computer vision pipeline
- We show that the Bidirectional Generative Adversarial Networks objective forces the encoder E to do exactly this: in order to fool the discriminator at a particular z, the encoder must invert the generator at that z, such that E(G(z)) = z
- We evaluate the feature learning capabilities of Bidirectional Generative Adversarial Networks by first training them unsupervised as described in Section 3.4, transferring the encoder’s learned feature representations for use in auxiliary supervised learning tasks
- To demonstrate that Bidirectional Generative Adversarial Networks are able to learn meaningful feature representations both on arbitrary data vectors, where the model is agnostic to any underlying structure, as well as very high-dimensional and complex distributions, we evaluate on both permutation-invariant MNIST (LeCun et al, 1998) and on the high-resolution natural images of ImageNet (Russakovsky et al, 2015)
- Besides the Bidirectional Generative Adversarial Networks framework presented above, we considered alternative approaches to learning feature representations using different Generative Adversarial Networks variants
- We report results on each of these tasks in Table 3, comparing Bidirectional Generative Adversarial Networks with contemporary approaches to unsupervised (Krähenbühl et al, 2016) and self-supervised (Doersch et al, 2015; Agrawal et al, 2015; Wang & Gupta, 2015; Pathak et al, 2016) feature learning in the visual domain, as well as the baselines discussed in Section 4.1

Methods

- Besides the BiGAN framework presented above, the authors considered alternative approaches to learning feature representations using different GAN variants.

Discriminator The discriminator D in a standard GAN takes data samples x ∼ pX as input, making its learned intermediate representations natural candidates as feature representations for related tasks. - A drawback of this approach is that, unlike the encoder in a BiGAN, the latent regressor encoder E is trained only on generated samples G(z), and never “sees” real data x ∼ pX.
- While this may not be an issue in the theoretical optimum where pG(x) = pX(x) exactly – i.e., G perfectly generates the data distribution pX – in practice, for highly complex data distributions pX, such as the distribution of natural images, the generator will almost never achieve this perfect result.
- The fact that the real data x are never input to this type of encoder limits its utility as a feature representation for related tasks, as shown later

Results

- The authors evaluate the feature learning capabilities of BiGANs by first training them unsupervised as described in Section 3.4, transferring the encoder’s learned feature representations for use in auxiliary supervised learning tasks.
- To demonstrate that BiGANs are able to learn meaningful feature representations both on arbitrary data vectors, where the model is agnostic to any underlying structure, as well as very high-dimensional and complex distributions, the authors evaluate on both permutation-invariant MNIST (LeCun et al, 1998) and on the high-resolution natural images of ImageNet (Russakovsky et al, 2015).
- The BiGAN discriminator D(x, z) takes data x as its initial input, and at each linear layer thereafter, the latent representation z is transformed using a learned linear transformation to the hidden layer dimension and added to the non-linearity input

Conclusion

- Despite making no assumptions about the underlying structure of the data, the BiGAN unsupervised feature learning framework offers a representation competitive with existing self-supervised and even weakly supervised feature learning approaches for visual feature learning, while still being a purely generative model with the ability to sample data x and predict latent representation z.
- BiGANs outperform the discriminator (D) and latent regressor (LR) baselines discussed in Section 4.1, confirming the intuition that these approaches may not perform well in the regime of highly complex data distributions such as that of natural images.
- Existing self-supervised approaches have shown impressive performance and far tended to outshine purely unsupervised approaches in the complex domain of high-resolution images, purely unsupervised approaches to feature learning or pre-training have several potential benefits.

Summary

- Deep convolutional networks have become a staple of the modern computer vision pipeline.
- In addition to the generator G from the standard GAN framework (Goodfellow et al, 2014), BiGAN includes an encoder E which maps data x to latent representations z.
- The encoder induces a distribution pE(z|x) = δ(z − E(x)) mapping data points x into the latent feature space of the generative model.
- We highlight some of the appealing theoretical properties of BiGANs. Definitions Let pGZ(x, z) := pG(x|z)pZ(z) and pEX(x, z) := pE(z|x)pX(x) be the joint distributions modeled by the generator and encoder respectively.
- An important difference is that BiGAN optimizes a Jensen-Shannon divergence between a joint distribution over both data X and latent features Z.
- Theorem 3 The encoder and generator objective given an optimal discriminator C(E, G) := maxD V (D, E, G) can be rewritten as an 0 autoencoder loss function
- To demonstrate that BiGANs are able to learn meaningful feature representations both on arbitrary data vectors, where the model is agnostic to any underlying structure, as well as very high-dimensional and complex distributions, we evaluate on both permutation-invariant MNIST (LeCun et al, 1998) and on the high-resolution natural images of ImageNet (Russakovsky et al, 2015).
- We report results on each of these tasks in Table 3, comparing BiGANs with contemporary approaches to unsupervised (Krähenbühl et al, 2016) and self-supervised (Doersch et al, 2015; Agrawal et al, 2015; Wang & Gupta, 2015; Pathak et al, 2016) feature learning in the visual domain, as well as the baselines discussed in Section 4.1.
- Despite making no assumptions about the underlying structure of the data, the BiGAN unsupervised feature learning framework offers a representation competitive with existing self-supervised and even weakly supervised feature learning approaches for visual feature learning, while still being a purely generative model with the ability to sample data x and predict latent representation z.
- BiGANs outperform the discriminator (D) and latent regressor (LR) baselines discussed in Section 4.1, confirming our intuition that these approaches may not perform well in the regime of highly complex data distributions such as that of natural images.
- We note that the results presented here constitute only a preliminary exploration of the space of model architectures possible under the BiGAN framework, and we expect results to improve significantly with advancements in generative image models and discriminative convolutional networks alike.

- Table1: One Nearest Neighbors (1NN) classification accuracy (%) on the permutation-invariant MNIST (<a class="ref-link" id="cLecun_et+al_1998_a" href="#rLecun_et+al_1998_a">LeCun et al, 1998</a>) test set in the feature space learned by BiGAN, Latent Regressor (LR), Joint Latent Regressor (JLR), and an autoencoder (AE) using an 1 or 2 distance
- Table2: Classification accuracy (%) for the ImageNet LSVRC (<a class="ref-link" id="cRussakovsky_et+al_2015_a" href="#rRussakovsky_et+al_2015_a">Russakovsky et al, 2015</a>) validation set with various portions of the network frozen, or reinitialized and trained from scratch, following the evaluation from <a class="ref-link" id="cNoroozi_2016_a" href="#rNoroozi_2016_a"><a class="ref-link" id="cNoroozi_2016_a" href="#rNoroozi_2016_a">Noroozi & Favaro (2016</a></a>). In, e.g., the conv3 column, the first three layers – conv1 through conv3 – are transferred and frozen, and the last layers – conv4, conv5, and fully connected layers – are reinitialized and trained fully supervised for ImageNet classification. BiGAN is competitive with these contemporary visual feature learning methods, despite its generality. (*Results from <a class="ref-link" id="cNoroozi_2016_a" href="#rNoroozi_2016_a"><a class="ref-link" id="cNoroozi_2016_a" href="#rNoroozi_2016_a">Noroozi & Favaro (2016</a></a>) are not directly comparable to those of the other methods as a different base convnet architecture with larger intermediate feature maps is used.)
- Table3: Classification and Fast R-CNN (Girshick, 2015) detection results for the PASCAL VOC 2007 (Everingham et al, 2014) test set, and FCN (<a class="ref-link" id="cLong_et+al_2015_a" href="#rLong_et+al_2015_a">Long et al, 2015</a>) segmentation results on the PASCAL VOC 2012 validation set, under the standard mean average precision (mAP) or mean intersection over union (mIU) metrics for each task. Classification models are trained with various portions of the AlexNet (<a class="ref-link" id="cKrizhevsky_et+al_2012_a" href="#rKrizhevsky_et+al_2012_a">Krizhevsky et al, 2012</a>) model frozen. In the fc8 column, only the linear classifier (a multinomial logistic regression) is learned – in the case of BiGAN, on top of randomly initialized fully connected (FC) layers fc6 and fc7. In the fc6-8 column, all three FC layers are trained fully supervised with all convolution layers frozen. Finally, in the all column, the entire network is “fine-tuned”. BiGAN outperforms other unsupervised (unsup.) feature learning approaches, including the GAN-based baselines described in Section 4.1, and despite its generality, is competitive with contemporary self-supervised (self-sup.) feature learning approaches specific to the visual domain

Funding

- This work was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Artificial Intelligence Research laboratory

Reference

- Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In ICCV, 2015.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS, 2015.
- Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv:1606.00704, 2016.
- Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes challenge: A retrospective. IJCV, 2014.
- Ross Girshick. Fast R-CNN. In ICCV, 2015.
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In ICML, 2013.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
- Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006.
- Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Philipp Krähenbühl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializations of convolutional neural networks. In ICLR, 2016.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 1998.
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
- Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
- Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
- Ali Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In CVPR Workshops, 2014.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Fei-Fei Li. ImageNet large scale visual recognition challenge. IJCV, 2015.
- Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep Boltzmann machines. In AISTATS, 2009. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688, 2016. Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton.
- Grammar as a foreign language. In NIPS, 2015. Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015. Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.

Tags

Comments