Context Encoders: Feature Learning by Inpainting
CVPR, pp. 2536-2544, 2016.
Our context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders – a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders...More
PPT (Upload PPT)
- The authors' visual world is very diverse, yet highly structured, and humans have an uncanny ability to make sense of this structure.
- Consider the image shown in Figure 1a.
- Some of them can even draw it, as shown on Figure 1b.
- This ability comes from the fact that natural images, despite their diversity, are highly structured.
- The authors humans are able to understand this structure and make visual predictions even when seeing only parts of the scene.
- Our visual world is very diverse, yet highly structured, and humans have an uncanny ability to make sense of this structure
- Some of us can even draw it, as shown on Figure 1b. This ability comes from the fact that natural images, despite their diversity, are highly structured
- We introduce context encoders: convolutional neural networks that predict missing parts of a scene from their surroundings
- In Section 5.1, we present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions
- In Section 5.2, we demonstrate the transferability of our learned features to other tasks, using context encoders as a pretraining step for image classification, object detection, and semantic segmentation
- Our context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision
- Random block To prevent the network from latching on the the constant boundary of the masked region, the authors randomize the masking process.
- The random block masking still has sharp boundaries convolutional features could latch onto.
- Random region To completely remove those boundaries, the authors experimented with removing arbitrary shapes from images, obtained from random masks in the PASCAL.
- VOC 2012 dataset 
- The authors deform those shapes and paste in arbitrary places in the other images, again covering up to.
- The authors evaluate the encoder features for their semantic quality and transferability to other image understanding tasks.
- In Section 5.1, the authors present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions.
- In Section 5.2, the authors demonstrate the transferability of the learned features to other tasks, using context encoders as a pretraining step for image classification, object detection, and semantic segmentation.
- The authors compare the results on these tasks with those of other unsupervised or self-supervised methods, demonstrating that the approach outperforms previous methods
- The authors' context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision.
- Our visual world is very diverse, yet highly structured, and humans have an uncanny ability to make sense of this structure.
- (d) Context Encoder (L2 + Adversarial loss) that it is possible to learn and predict this structure using convolutional neural networks (CNNs), a class of models that have recently shown success across a variety of image understanding tasks.
- We show that encoding just the context of an image patch and using the resulting feature to retrieve nearest neighbor contexts from a dataset produces patches which are semantically similar to the original patch.
- We further validate the quality of the learned feature representation by fine-tuning the encoder for a variety of image understanding tasks, including classification, object detection, and semantic segmentation.
- The context encoder can be useful as a better visual feature for computing nearest neighbors in nonparametric inpainting methods.
- Doersch et al  used the task of predicting the relative positions of neighboring patches within an image as a way to train an unsupervised deep feature representations.
- We train our context encoders using an adversary jointly with reconstruction loss for generating inpainting results.
- While this works quite well for inpainting, the network learns low level image features than latch on to the boundary of the central mask.
- In Section 5.1, we present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions.
- In Section 5.2, we demonstrate the transferability of our learned features to other tasks, using context encoders as a pretraining step for image classification, object detection, and semantic segmentation.
- We train context encoders with the joint loss function defined in Equation (3) for the task of inpainting the missing region.
- Figure 7 shows inpainting results for context encoder trained with random region corruption using reconstruction loss.
- Context encoders are competitive with concurrent self-supervised feature learning methods [7, 39] and significantly outperform autoencoders and Agrawal et al .
- Convolutional networks  (FCNs) were proposed as an end-to-end learnable method of predicting a semantic label at each pixel of an image, using a convolutional network pre-trained for ImageNet classification.
- We replace the classification pre-trained network used in the FCN method with our context encoders, afterwards following the FCN training and evaluation procedure for direct comparison with their original CaffeNet-based result.
- Our context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision.
- Table1: Semantic Inpainting accuracy for Paris StreetView dataset on held-out images. NN inpainting is basis for [<a class="ref-link" id="c19" href="#r19">19</a>]
- Table2: Quantitative comparison for classification, detection and semantic segmentation. Classification and Fast-RCNN Detection results are on the PASCAL VOC 2007 test set. Semantic segmentation results are on the PASCAL VOC 2012 validation set from the FCN evaluation described in Section 5.2.3, using the additional training data from [<a class="ref-link" id="c18" href="#r18">18</a>], and removing overlapping images from the validation set [<a class="ref-link" id="c28" href="#r28">28</a>]
- Computer vision has made tremendous progress on semantic image understanding tasks such as classification, object detection, and segmentation in the past decade. Recently, Convolutional Neural Networks (CNNs) [13, 27] have greatly advanced the performance in these tasks [15, 26, 28]. The success of such models on image classification paved the way to tackle harder problems, including unsupervised understanding and generation of natural images. We briefly review the related work in each of the sub-fields pertaining to this paper.
Unsupervised learning CNNs trained for ImageNet  classification with over a million labeled examples learn features which generalize very well across tasks . However, whether such semantically informative and generalizable features can be learned from raw images alone, without any labels, remains an open question. Some of the earliest work in deep unsupervised learning are autoencoders [3, 20]. Along similar lines, denoising autoencoders  reconstruct the image from local corruptions, to make encoding robust to such corruptions. While context encoders could be thought of as a variant of denoising autoencoders, the corruption applied to the model’s input is spatially much larger, requiring more semantic information to undo.
- This work was supported in part by DARPA, AFRL, Intel, DoD MURI award N000141110688, NSF awards IIS1212798, IIS-1427425, and IIS-1536003, the Berkeley Vision and Learning Center and Berkeley Deep Drive
Study subjects and analysis
We now evaluate the encoder features for their semantic quality and transferability to other image understanding tasks. We experiment with images from two datasets: Paris StreetView  and ImageNet  without using any of the accompanying labels. In Section 5.1, we present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions
- P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. ICCV, 2015. 2, 7, 8
- C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics, 2009. 3, 6
- Y. Bengio. Learning deep architectures for ai. Foundations and trends in Machine Learning, 2009. 1, 2
- M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Computer graphics and interactive techniques, 2000. 3
- R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008. 3
- C. Doersch, A. Gupta, and A. A. Efros. Context as supervisory signal: Discovering objects with predictable context. In ECCV, 2014. 2
- C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. ICCV, 2015. 2, 3, 7, 8
- C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? ACM Transactions on Graphics, 2012. 6
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. ICML, 2014. 2
- A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. CVPR, 2015. 3, 4
- A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In ICCV, 1999. 3, 6
- M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes challenge: A retrospective. IJCV, 2014. 6, 7, 8
- K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 1980. 2
- R. Girshick. Fast r-cnn. ICCV, 2015. 7
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 2
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. 2, 3, 4
- R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally coherent metrics. ICCV, 2015. 2
- B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011. 8
- J. Hays and A. A. Efros. Scene completion using millions of photographs. SIGGRAPH, 2007. 3, 6
- G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006. 1, 2
- D. Jayaraman and K. Grauman. Learning image representations tied to ego-motion. In ICCV, 2015. 2
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014. 6
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015. 6
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014. 3
- P. Krahenbuhl, C. Doersch, J. Donahue, and T. Darrell. Datadependent initializations of convolutional neural networks. ICLR, 2016. 8
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2, 3, 5, 7, 8
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 2, 4, 8
- T. Malisiewicz and A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, 2009. 2
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 2, 3
- A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in brain research, 2006. 3
- S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 2005. 3
- A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016. 3, 5, 6
- V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei. Learning temporal embeddings for complex video analysis. ICCV, 2015. 2
- M. Ranzato, V. Mnih, J. M. Susskind, and G. E. Hinton. Modeling natural images using gated mrfs. PAMI, 2013. 3
- S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial expression recognition. In ECCV, 2012. 3
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. 2, 6
- P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008. 2
- X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. ICCV, 2015. 2, 7, 8
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 4