Context Encoders: Feature Learning by Inpainting

CVPR, pp. 2536-2544, 2016.

Cited by: 2623|Views542
EI
Weibo:
Our context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision

Abstract:

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders – a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders...More

Code:

Data:

0
Introduction
  • The authors' visual world is very diverse, yet highly structured, and humans have an uncanny ability to make sense of this structure.
  • Consider the image shown in Figure 1a.
  • Some of them can even draw it, as shown on Figure 1b.
  • This ability comes from the fact that natural images, despite their diversity, are highly structured.
  • The authors humans are able to understand this structure and make visual predictions even when seeing only parts of the scene.
Highlights
  • Our visual world is very diverse, yet highly structured, and humans have an uncanny ability to make sense of this structure
  • Some of us can even draw it, as shown on Figure 1b. This ability comes from the fact that natural images, despite their diversity, are highly structured
  • We introduce context encoders: convolutional neural networks that predict missing parts of a scene from their surroundings
  • In Section 5.1, we present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions
  • In Section 5.2, we demonstrate the transferability of our learned features to other tasks, using context encoders as a pretraining step for image classification, object detection, and semantic segmentation
  • Our context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision
Methods
  • Random block To prevent the network from latching on the the constant boundary of the masked region, the authors randomize the masking process.
  • The random block masking still has sharp boundaries convolutional features could latch onto.
  • Random region To completely remove those boundaries, the authors experimented with removing arbitrary shapes from images, obtained from random masks in the PASCAL.
  • VOC 2012 dataset [12]
  • The authors deform those shapes and paste in arbitrary places in the other images, again covering up to.
Results
  • The authors evaluate the encoder features for their semantic quality and transferability to other image understanding tasks.
  • In Section 5.1, the authors present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions.
  • In Section 5.2, the authors demonstrate the transferability of the learned features to other tasks, using context encoders as a pretraining step for image classification, object detection, and semantic segmentation.
  • The authors compare the results on these tasks with those of other unsupervised or self-supervised methods, demonstrating that the approach outperforms previous methods
Conclusion
  • The authors' context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision.
Summary
  • Our visual world is very diverse, yet highly structured, and humans have an uncanny ability to make sense of this structure.
  • (d) Context Encoder (L2 + Adversarial loss) that it is possible to learn and predict this structure using convolutional neural networks (CNNs), a class of models that have recently shown success across a variety of image understanding tasks.
  • We show that encoding just the context of an image patch and using the resulting feature to retrieve nearest neighbor contexts from a dataset produces patches which are semantically similar to the original patch.
  • We further validate the quality of the learned feature representation by fine-tuning the encoder for a variety of image understanding tasks, including classification, object detection, and semantic segmentation.
  • The context encoder can be useful as a better visual feature for computing nearest neighbors in nonparametric inpainting methods.
  • Doersch et al [7] used the task of predicting the relative positions of neighboring patches within an image as a way to train an unsupervised deep feature representations.
  • We train our context encoders using an adversary jointly with reconstruction loss for generating inpainting results.
  • While this works quite well for inpainting, the network learns low level image features than latch on to the boundary of the central mask.
  • In Section 5.1, we present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions.
  • In Section 5.2, we demonstrate the transferability of our learned features to other tasks, using context encoders as a pretraining step for image classification, object detection, and semantic segmentation.
  • We train context encoders with the joint loss function defined in Equation (3) for the task of inpainting the missing region.
  • Figure 7 shows inpainting results for context encoder trained with random region corruption using reconstruction loss.
  • Context encoders are competitive with concurrent self-supervised feature learning methods [7, 39] and significantly outperform autoencoders and Agrawal et al [1].
  • Convolutional networks [28] (FCNs) were proposed as an end-to-end learnable method of predicting a semantic label at each pixel of an image, using a convolutional network pre-trained for ImageNet classification.
  • We replace the classification pre-trained network used in the FCN method with our context encoders, afterwards following the FCN training and evaluation procedure for direct comparison with their original CaffeNet-based result.
  • Our context encoders trained to generate images conditioned on context advance the state of the art in semantic inpainting, at the same time learn feature representations that are competitive with other models trained with auxiliary supervision.
Tables
  • Table1: Semantic Inpainting accuracy for Paris StreetView dataset on held-out images. NN inpainting is basis for [<a class="ref-link" id="c19" href="#r19">19</a>]
  • Table2: Quantitative comparison for classification, detection and semantic segmentation. Classification and Fast-RCNN Detection results are on the PASCAL VOC 2007 test set. Semantic segmentation results are on the PASCAL VOC 2012 validation set from the FCN evaluation described in Section 5.2.3, using the additional training data from [<a class="ref-link" id="c18" href="#r18">18</a>], and removing overlapping images from the validation set [<a class="ref-link" id="c28" href="#r28">28</a>]
Download tables as Excel
Related work
  • Computer vision has made tremendous progress on semantic image understanding tasks such as classification, object detection, and segmentation in the past decade. Recently, Convolutional Neural Networks (CNNs) [13, 27] have greatly advanced the performance in these tasks [15, 26, 28]. The success of such models on image classification paved the way to tackle harder problems, including unsupervised understanding and generation of natural images. We briefly review the related work in each of the sub-fields pertaining to this paper.

    Unsupervised learning CNNs trained for ImageNet [37] classification with over a million labeled examples learn features which generalize very well across tasks [9]. However, whether such semantically informative and generalizable features can be learned from raw images alone, without any labels, remains an open question. Some of the earliest work in deep unsupervised learning are autoencoders [3, 20]. Along similar lines, denoising autoencoders [38] reconstruct the image from local corruptions, to make encoding robust to such corruptions. While context encoders could be thought of as a variant of denoising autoencoders, the corruption applied to the model’s input is spatially much larger, requiring more semantic information to undo.
Funding
  • This work was supported in part by DARPA, AFRL, Intel, DoD MURI award N000141110688, NSF awards IIS1212798, IIS-1427425, and IIS-1536003, the Berkeley Vision and Learning Center and Berkeley Deep Drive
Study subjects and analysis
datasets: 2
We now evaluate the encoder features for their semantic quality and transferability to other image understanding tasks. We experiment with images from two datasets: Paris StreetView [8] and ImageNet [37] without using any of the accompanying labels. In Section 5.1, we present visualizations demonstrating the ability of the context encoder to fill in semantic details of images with missing regions

Reference
  • P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. ICCV, 2015. 2, 7, 8
    Google ScholarLocate open access versionFindings
  • C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics, 2009. 3, 6
    Google ScholarLocate open access versionFindings
  • Y. Bengio. Learning deep architectures for ai. Foundations and trends in Machine Learning, 2009. 1, 2
    Google ScholarLocate open access versionFindings
  • M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In Computer graphics and interactive techniques, 2000. 3
    Google ScholarLocate open access versionFindings
  • R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008. 3
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros. Context as supervisory signal: Discovering objects with predictable context. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. ICCV, 2015. 2, 3, 7, 8
    Google ScholarLocate open access versionFindings
  • C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? ACM Transactions on Graphics, 2012. 6
    Google ScholarLocate open access versionFindings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. ICML, 2014. 2
    Google ScholarLocate open access versionFindings
  • A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. CVPR, 2015. 3, 4
    Google ScholarLocate open access versionFindings
  • A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In ICCV, 1999. 3, 6
    Google ScholarLocate open access versionFindings
  • M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes challenge: A retrospective. IJCV, 2014. 6, 7, 8
    Google ScholarLocate open access versionFindings
  • K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 1980. 2
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast r-cnn. ICCV, 2015. 7
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised learning of spatiotemporally coherent metrics. ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011. 8
    Google ScholarLocate open access versionFindings
  • J. Hays and A. A. Efros. Scene completion using millions of photographs. SIGGRAPH, 2007. 3, 6
    Google ScholarLocate open access versionFindings
  • G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 2006. 1, 2
    Google ScholarLocate open access versionFindings
  • D. Jayaraman and K. Grauman. Learning image representations tied to ego-motion. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014. 6
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015. 6
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014. 3
    Google ScholarLocate open access versionFindings
  • P. Krahenbuhl, C. Doersch, J. Donahue, and T. Darrell. Datadependent initializations of convolutional neural networks. ICLR, 2016. 8
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2, 3, 5, 7, 8
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2
    Google ScholarFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 2, 4, 8
    Google ScholarLocate open access versionFindings
  • T. Malisiewicz and A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, 2009. 2
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 2, 3
    Google ScholarLocate open access versionFindings
  • A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in brain research, 2006. 3
    Google ScholarLocate open access versionFindings
  • S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 2005. 3
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016. 3, 5, 6
    Google ScholarLocate open access versionFindings
  • V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei. Learning temporal embeddings for complex video analysis. ICCV, 2015. 2
    Google ScholarFindings
  • M. Ranzato, V. Mnih, J. M. Susskind, and G. E. Hinton. Modeling natural images using gated mrfs. PAMI, 2013. 3
    Google ScholarLocate open access versionFindings
  • S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial expression recognition. In ECCV, 2012. 3
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. 2, 6
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008. 2
    Google ScholarLocate open access versionFindings
  • X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. ICCV, 2015. 2, 7, 8
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 4
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments