## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Spatial Transformer Networks

Annual Conference on Neural Information Processing Systems, (2015)

EI

Keywords

Abstract

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipul...More

Code:

Data:

Introduction

- The landscape of computer vision has been drastically altered and pushed forward through the adoption of a fast, scalable, end-to-end learning framework, the Convolutional Neural Network (CNN) [18].
- Due to the typically small spatial support for max-pooling (e.g. 2 × 2 pixels) this spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps in a CNN are not invariant to large transformations of the input data [5, 19].
- This limitation of CNNs is due to having only a limited, pre-defined pooling mechanism for dealing with variations in the spatial arrangement of data

Highlights

- Over recent years, the landscape of computer vision has been drastically altered and pushed forward through the adoption of a fast, scalable, end-to-end learning framework, the Convolutional Neural Network (CNN) [18]
- In this work we introduce the Spatial Transformer module, that can be included into a standard neural network architecture to provide spatial transformation capabilities
- We begin with experiments where we train different neural network models to classify MNIST data that has been distorted in various ways: rotation (R); rotation, scale and translation (RTS); projective transformation (P); elastic warping (E) – note that elastic warping is destructive and cannot be inverted in some cases
- In this paper we introduced a new self-contained module for neural networks – the spatial transformer
- This module can be dropped into a network and perform explicit spatial transformations of features, opening up new ways for neural networks to model data, and is learnt in an end-toend fashion, without making any changes to the loss function
- While we only explore feed-forward networks in this work, early experiments show spatial transformers to be powerful in recurrent models, and useful for tasks requiring the disentangling of object reference frames

Methods

- The authors explore the use of spatial transformer networks on a number of supervised learning tasks.
- 4.1 the authors begin with experiments on distorted versions of the MNIST handwriting dataset, showing the ability of spatial transformers to improve classification performance through actively transforming the input images.
- 4.3, the authors investigate the use of multiple parallel spatial transformers for fine-grained classification, showing state-of-the-art performance on CUB-200-2011 birds dataset [28] by automatically discovering object parts and learning to attend to them.
- The authors use the MNIST handwriting dataset as a testbed for exploring the range of transformations to which a network can learn invariance to by using a spatial transformer.
- The authors begin with experiments where the authors train different neural network models to classify MNIST data that has been distorted in various ways: rotation (R); rotation, scale and translation (RTS); projective transformation (P); elastic warping (E) – note that elastic warping is destructive and cannot be inverted in some cases.
- All networks have approximately the same number of parameters, are trained with identical optimisation schemes, and all with three weight layers in the classification network

Conclusion

- In this paper the authors introduced a new self-contained module for neural networks – the spatial transformer.
- This module can be dropped into a network and perform explicit spatial transformations of features, opening up new ways for neural networks to model data, and is learnt in an end-toend fashion, without making any changes to the loss function.
- While CNNs provide an incredibly strong baseline, the authors see gains in accuracy using spatial transformers across multiple tasks, resulting in state-of-the-art performance.
- While the authors only explore feed-forward networks in this work, early experiments show spatial transformers to be powerful in recurrent models, and useful for tasks requiring the disentangling of object reference frames

- Table1: Left: The percentage errors for different models on different distorted MNIST datasets. The different distorted MNIST datasets we test are TC: translated and cluttered, R: rotated, RTS: rotated, translated, and scaled, P: projective distortion, E: elastic distortion. All the models used for each experiment have the same number of parameters, and same base structure for all experiments. Right: Some example test images where a spatial transformer network correctly classifies the digit but a CNN fails. (a) The inputs to the networks. (b) The transformations predicted by the spatial transformers, visualised by the grid Tθ(G). (c) The outputs of the spatial transformers. E and RTS examples use thin plate spline spatial transformers (ST-CNN TPS), while R examples use affine spatial transformers (ST-CNN Aff) with the angles of the affine transformations given. For videos showing animations of these experiments and more see https://goo.gl/qdEhUu
- Table2: Left: The sequence error (%) for SVHN multi-digit recognition on crops of 64×64 pixels (64px), and inflated crops of 128 × 128 (128px) which include more background. *The best reported result from [<a class="ref-link" id="c1" href="#r1">1</a>] uses model averaging and Monte Carlo averaging, whereas the results from other models are from a single forward pass of a single model. Right: (a) The schematic of the ST-CNN Multi model. The transformations of each spatial transformer (ST) are applied to the convolutional feature map produced by the previous layer. (b) The result of the composition of the affine transformations predicted by the four spatial transformers in ST-CNN Multi, visualised on the input image
- Table3: Left: The accuracy (%) on CUB-200-2011 bird classification dataset. Spatial transformer networks with two spatial transformers (2×ST-CNN) and four spatial transformers (4×ST-CNN) in parallel outperform other models. 448px resolution images can be used with the ST-CNN without an increase in computational cost due to downsampling to 224px after the transformers. Right: The transformation predicted by the spatial transformers of 2×ST-CNN (top row) and 4×ST-CNN (bottom row) on the input image. Notably for the 2×STCNN, one of the transformers (shown in red) learns to detect heads, while the other (shown in green) detects the body, and similarly for the 4×ST-CNN

Related work

- In this section we discuss the prior work related to the paper, covering the central ideas of modelling transformations with neural networks [12, 13, 27], learning and analysing transformation-invariant representations [3, 5, 8, 17, 19, 25], as well as attention and detection mechanisms for feature selection [1, 6, 9, 11, 23].

Early work by Hinton [12] looked at assigning canonical frames of reference to object parts, a theme which recurred in [13] where 2D affine transformations were modeled to create a generative model composed of transformed parts. The targets of the generative training scheme are the transformed input images, with the transformations between input images and targets given as an additional input to the network. The result is a generative model which can learn to generate transformed images of objects by composing parts. The notion of a composition of transformed parts is taken further by Tieleman [27], where learnt parts are explicitly affine-transformed, with the transform predicted by the network. Such generative capsule models are able to learn discriminative features for classification from transformation supervision.

Funding

- We consider a strong baseline CNN model – an Inception architecture with batch normalisation [15] pre-trained on ImageNet [22] and fine-tuned on CUB – which by itself achieves state-of-the-art accuracy of 82.3% (previous best result is 81.0% [24])
- The 4×ST-CNN achieves an accuracy of 84.1%, outperforming the baseline by 1.8%

Study subjects and analysis

species: 200

In this section, we use a spatial transformer network with multiple transformers in parallel to perform fine-grained bird classification. We evaluate our models on the CUB-200-2011 birds dataset [28], containing 6k training images and 5.8k test images, covering 200 species of birds. The birds appear at a range of scales and orientations, are not tightly cropped, and require detailed texture and shape analysis to distinguish

Reference

- J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. ICLR, 2015.
- S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. BMVC., 2014.
- J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE PAMI, 35(8):1872–1886, 2013.
- M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks for texture recognition and segmentation. In CVPR, 2015.
- T. S. Cohen and M. Welling. Transformation properties of learned visual representations. ICLR, 2015.
- D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.
- B. J. Frey and N. Jojic. Fast, large-scale transformation-invariant clustering. In NIPS, 2001.
- R. Gens and P. M. Domingos. Deep symmetry networks. In NIPS, 2014.
- R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv:1312.6082, 2013.
- K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation. ICML, 2015.
- G. E. Hinton. A parallel computation that assigns canonical object-based frames of reference. In IJCAI, 1981.
- G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN. 2011.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
- M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. NIPS DLW, 2014.
- A. Kanazawa, A. Sharma, and D. Jacobs. Locally scale-invariant convolutional neural networks. In NIPS, 2014.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. CVPR, 2015.
- T. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. arXiv:1504.07889, 2015.
- Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS DLW, 2011.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.
- P. Sermanet, A. Frome, and E. Real. Attention for fine-grained categorization. arXiv:1412.7054, 2014.
- M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. arXiv:1504.08289, 2015.
- K. Sohn and H. Lee. Learning invariant representations with local transformations. arXiv:1206.6418, 2012.
- M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, 2014.
- T. Tieleman. Optimizing Neural Networks that Generate Images. PhD thesis, University of Toronto, 2014.
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset. 2011.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.
- X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and accurate approximations of nonlinear convolutional networks. arXiv:1411.4229, 2014.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn