Selective Kernel Networks

    CVPR, 2019.

    Cited by: 78|Bibtex|Views10|Links
    EI
    Keywords:
    receptive field sizerf sizenon-classical RFsselective kernel networksmulti scale informationMore(5+)
    Wei bo:
    We propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information

    Abstract:

    In standard Convolutional Neural Networks (CNNs), the receptive fields of artificial neurons in each layer are designed to share the same size. It is well-known in the neuroscience community that the receptive field size of visual cortical neurons are modulated by the stimulus, which has been rarely considered in constructing CNNs. We p...More

    Code:

    Data:

    0
    Introduction
    • The local receptive fields (RFs) of neurons in the primary visual cortex (V1) of cats [14] have inspired the construction of Convolutional Neural Networks (CNNs) [26] in the last century, and it continues to inspire mordern CNN structure construction
    • It is well-known that in the visual cortex, the RF sizes of neurons in the same area (e.g., V1 region) are different, which enables the neurons to collect multi-scale spatial information in the same processing stage.
    • That linear aggregation approach may be insufficient to provide neurons powerful adaptation ability
    Highlights
    • The local receptive fields (RFs) of neurons in the primary visual cortex (V1) of cats [14] have inspired the construction of Convolutional Neural Networks (CNNs) [26] in the last century, and it continues to inspire mordern Convolutional Neural Networks structure construction
    • Some other receptive fields properties of cortical neurons have not been emphasized in designing Convolutional Neural Networks, and one such property is the adaptive changing of receptive fields size
    • The size of non-classical receptive field is related to the contrast of the stimulus: the smaller the contrast, the larger the effective non-classical receptive field size [37]
    • Fuse: As stated in Introduction, our goal is to enable neurons to adaptively adjust their receptive fields sizes according to the stimulus content
    • Inspired by the adaptive receptive field (RF) sizes of neurons in visual cortex, we propose Selective Kernel Networks (SKNets) with a novel Selective Kernel (SK) convolution, to improve the efficiency and effectiveness of object recognition by adaptive kernel selection in a soft-attention manner
    • We discover several meaningful behaviors of kernel selection across channel, depth and category, and empirically validate the effective adaption of receptive fields sizes for Selective Kernel Networks, which leads to a better understanding of its mechanism
    Methods
    • Split: For any given feature map X ∈ RH′×W ′×C′ , by default the authors first conduct two transformations F : X → U ∈ RH×W ×C and F : X → U ∈ RH×W ×C with kernel sizes 3 and 5, respectively
    • Note that both F and F are composed of efficient grouped/depthwise convolutions, Batch Normalization [15] and ReLU [29] function in sequence.
    • The conventional convolution with a 5×5 kernel is replaced with the dilated convolution with a 3×3 kernel and dilation size 2
    Results
    • On the ImageNet and CIFAR benchmarks, the authors empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity.
    Conclusion
    • Inspired by the adaptive receptive field (RF) sizes of neurons in visual cortex, the authors propose Selective Kernel Networks (SKNets) with a novel Selective Kernel (SK) convolution, to improve the efficiency and effectiveness of object recognition by adaptive kernel selection in a soft-attention manner.
    • The authors discover several meaningful behaviors of kernel selection across channel, depth and category, and empirically validate the effective adaption of RF sizes for SKNets, which leads to a better understanding of its mechanism.
    • The authors hope it may inspire the study of architectural design and search in the future
    Summary
    • Introduction:

      The local receptive fields (RFs) of neurons in the primary visual cortex (V1) of cats [14] have inspired the construction of Convolutional Neural Networks (CNNs) [26] in the last century, and it continues to inspire mordern CNN structure construction
    • It is well-known that in the visual cortex, the RF sizes of neurons in the same area (e.g., V1 region) are different, which enables the neurons to collect multi-scale spatial information in the same processing stage.
    • That linear aggregation approach may be insufficient to provide neurons powerful adaptation ability
    • Methods:

      Split: For any given feature map X ∈ RH′×W ′×C′ , by default the authors first conduct two transformations F : X → U ∈ RH×W ×C and F : X → U ∈ RH×W ×C with kernel sizes 3 and 5, respectively
    • Note that both F and F are composed of efficient grouped/depthwise convolutions, Batch Normalization [15] and ReLU [29] function in sequence.
    • The conventional convolution with a 5×5 kernel is replaced with the dilated convolution with a 3×3 kernel and dilation size 2
    • Results:

      On the ImageNet and CIFAR benchmarks, the authors empirically show that SKNet outperforms the existing state-of-the-art architectures with lower model complexity.
    • Conclusion:

      Inspired by the adaptive receptive field (RF) sizes of neurons in visual cortex, the authors propose Selective Kernel Networks (SKNets) with a novel Selective Kernel (SK) convolution, to improve the efficiency and effectiveness of object recognition by adaptive kernel selection in a soft-attention manner.
    • The authors discover several meaningful behaviors of kernel selection across channel, depth and category, and empirically validate the effective adaption of RF sizes for SKNets, which leads to a better understanding of its mechanism.
    • The authors hope it may inspire the study of architectural design and search in the future
    Tables
    • Table1: The three columns refer to ResNeXt-50 with a 32×4d template, SENet-50 based on the ResNeXt-50 backbone and the corresponding SKNet-50, respectively. Inside the brackets are the general shape of a residual block, including filter sizes and feature dimensionalities. The number of stacked blocks on each stage is presented outside the brackets. “G = 32” suggests the grouped convolution. The inner brackets following by f c indicates the output dimension of the two fully connected layers in an SE module. #P denotes the number of parameter and the definition of FLOPs follow [<a class="ref-link" id="c54" href="#r54">54</a>], i.e., the number of multiply-adds
    • Table2: Comparisons to the state-of-the-arts under roughly identical complexity. 224× denotes the single 224×224 crop for evaluation, and likewise 320×. Note that SENets/SKNets are all based on the corresponding ResNeXt backbones
    • Table3: Comparisons on ImageNet validation set when the computational cost of model with more depth/width/cardinality is increased to match that of SKNet. The numbers in brackets denote the gains of performance
    • Table4: Single 224×224 crop top-1 error rates (%) by variants of lightweight models on ImageNet validation set
    • Table5: Top-1 errors (%, average of 10 runs) on CIFAR. SENet-29 and SKNet-29 are all based on ResNeXt-29, 16×32d
    • Table6: Results of SKNet-50 with different settings in the second branch, while the setting of the first kernel is fixed. “Resulted kernel” in the last column means the approximate kernel size with dilated convolution
    • Table7: Results of SKNet-50 with different combinations of multiple kernels. Single 224×224 crop is utilized for evaluation
    Download tables as Excel
    Related work
    • Multi-branch convolutional networks. Highway networks [39] introduces the bypassing paths along with gating units. The two-branch architecture eases the difficulty to training networks with hundreds of layers. The idea is also used in ResNet [9, 10], but the bypassing path is the pure identity mapping. Besides the identity mapping, the shake-shake networks [7] and multi-residual networks [1] extend the major transformation with more identical paths. The deep neural decision forests [21] form the tree-structural multi-branch principle with learned splitting functions. FractalNets [25] and Multilevel ResNets [52] are designed in such a way that the multiple paths can be expanded fractally and recursively. The InceptionNets [42, 15, 43, 41] carefully configure each branch with customized kernel filters, in order to aggregate more informative and multifarious features. Please note that the proposed SKNets follow the idea of InceptionNets with various filters for multiple branches, but differ in at least two important aspects: 1) the schemes of SKNets are much simpler without heavy customized design and 2) an adaptive selection mechanism for these multiple branches is utilized to realize adaptive RF sizes of neurons.
    Funding
    • This work was supported by the National Science Fund of China under Grant No U1713208, Program for Changjiang Scholars and National Natural Science Foundation of China, Grant no. 61836014
    Reference
    • M. Abdi and S. Nahavandi. Multi-residual networks. arxiv preprint. arXiv preprint arXiv:1609.05672, 2016.
      Findings
    • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
      Findings
    • J. Carreira, H. Madeira, and J. G. Silva. Xception: A technique for the experimental evaluation of dependability in modern computers. Transactions on Software Engineering, 1998.
      Google ScholarLocate open access versionFindings
    • D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai. Person search via a mask-guided two-stream cnn model. arXiv preprint arXiv:1807.08107, 2018.
      Findings
    • Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In NIPS, 2017.
      Google ScholarLocate open access versionFindings
    • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017.
      Findings
    • X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
      Findings
    • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
      Google ScholarLocate open access versionFindings
    • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
      Findings
    • J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
      Findings
    • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 1962.
      Google ScholarLocate open access versionFindings
    • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
      Findings
    • L. Itti and C. Koch. Computational modelling of visual attention. Nature Reviews Neuroscience, 2001.
      Google ScholarLocate open access versionFindings
    • L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. TPAMI, 1998.
      Google ScholarLocate open access versionFindings
    • M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
      Google ScholarLocate open access versionFindings
    • Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016.
      Google ScholarLocate open access versionFindings
    • P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo. Deep neural decision forests. In ICCV, 2015.
      Google ScholarLocate open access versionFindings
    • A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
      Google ScholarFindings
    • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
      Google ScholarLocate open access versionFindings
    • H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, 2010.
      Google ScholarLocate open access versionFindings
    • G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
      Findings
    • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989.
      Google ScholarLocate open access versionFindings
    • N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018.
      Findings
    • V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.
      Google ScholarLocate open access versionFindings
    • V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
      Google ScholarLocate open access versionFindings
    • J. Nelson and B. Frost. Orientation-selective inhibition from beyond the classic visual receptive field. Brain Research, 1978.
      Google ScholarLocate open access versionFindings
    • B. A. Olshausen, C. H. Anderson, and D. C. Van Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 1993.
      Google ScholarLocate open access versionFindings
    • J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
      Findings
    • M. W. Pettet and C. D. Gilbert. Dynamic changes in receptive-field size in cat primary visual cortex. Proceedings of the National Academy of Sciences, 1992.
      Google ScholarLocate open access versionFindings
    • A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
      Findings
    • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
      Google ScholarLocate open access versionFindings
    • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
      Google ScholarLocate open access versionFindings
    • M. P. Sceniak, D. L. Ringach, M. J. Hawken, and R. Shapley. Contrast’s effect on spatial summation by macaque v1 neurons. Nature Neuroscience, 1999.
      Google ScholarLocate open access versionFindings
    • L. Spillmann, B. Dresp-Langley, and C.-h. Tseng. Beyond the classical receptive field: the effect of contextual stimuli. Journal of Vision, 2015.
      Google ScholarLocate open access versionFindings
    • R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
      Findings
    • K. Sun, M. Li, D. Liu, and J. Wang. Igcv3: Interleaved lowrank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178, 2018.
      Findings
    • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.
      Google ScholarLocate open access versionFindings
    • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
      Google ScholarLocate open access versionFindings
    • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. arXiv preprint arXiv:1704.06904, 2017.
      Findings
    • S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. arXiv preprint arXiv:1807.06521, 2018.
      Findings
    • G. Xie, J. Wang, T. Zhang, J. Lai, R. Hong, and G.-J. Qi. Igcv 2: Interleaved structured sparse convolutional neural networks. arXiv preprint arXiv:1804.06202, 2018.
      Findings
    • S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • K. Xu, D. Li, N. Cassimatis, and X. Wang. Lcanet: Endto-end lipreading with cascaded attention-ctc. In International Conference on Automatic Face & Gesture Recognition, 2018.
      Google ScholarLocate open access versionFindings
    • Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016.
      Google ScholarLocate open access versionFindings
    • F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
      Findings
    • F. Yu, V. Koltun, and T. A. Funkhouser. Dilated residual networks. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • K. Zhang, M. Sun, X. Han, X. Yuan, L. Guo, and T. Liu. Residual networks of residual networks: Multilevel residual networks. Transactions on Circuits and Systems for Video Technology, 2017.
      Google ScholarLocate open access versionFindings
    • T. Zhang, G.-J. Qi, B. Xiao, and J. Wang. Interleaved group convolutions. In CVPR, 2017.
      Google ScholarLocate open access versionFindings
    • X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arXiv preprint arXiv:1707.01083, 2017.
      Findings
    • Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution using very deep residual channel attention networks. arXiv preprint arXiv:1807.02758, 2018.
      Findings
    Your rating :
    0

     

    Tags
    Comments