Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks

national conference on artificial intelligence, 2020.

Cited by: 1|Bibtex|Views138|Links
Keywords:
empirical Rademacher complexitybinary dropoutneural networkconventional binarygeneralization abilityMore(13+)
Weibo:
We investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks and propose a feature distortion method for addressing the aforementioned problem

Abstract:

Deep neural networks often consist of a great number of trainable parameters for extracting powerful features from given datasets. On one hand, massive trainable parameters significantly enhance the performance of these deep networks. On the other hand, they bring the problem of over-fitting. To this end, dropout based methods disable s...More

Code:

Data:

0
Introduction
Highlights
  • The superiority of deep neural networks, especially convolutional neural networks (CNNs) has been well demonstrated in a large variety of computer vision tasks including image recognition (Krizhevsky, Sutskever, and Hinton 2012; He et al 2016a; Wang et al 2018a), object detection (Ren et al 2015; Redmon et al 2016), video analysis (Feichtenhofer, Pinz, and Zisserman 2016), Natural Language Processing (Wang, Li, and Smola 2019) etc
  • The proposed feature map distortion method is superior to the compared methods by a large margin on both two datasets
  • CNN trained with the help of the proposed method achieves an accuracy of 85.24%, which improves the performance of the state-of-the-art RDdrop method with 2.13% and 1.58% on CIFAR-10 and CIFAR-100 dataset, respectively
  • The baseline ResNet-56 traps in overfitting problem and achieves a higher training accuracy but lower test accuracy, while the proposed feature map distortion method overcome this problem and achieves a higher test accuracy, which shows the improvement of model generalization ability
  • Note that our method achieves a better performance than DropBlock with p in a larger range, which demonstrate the superior of feature map distortion
  • The feature map distortion improve the accuracy from 76.80% to 77.71% compared to the conventional dropout method
Methods
Results
  • The test accuracies on both CIFAR-10 and CIFAR-100 are summarized in Table 1.
  • CNN trained with the help of the proposed method achieves an accuracy of 85.24%, which improves the performance of the state-of-the-art RDdrop method with 2.13% and 1.58% on CIFAR-10 and CIFAR-100 dataset, respectively
  • It shows that the proposed feature map distortion method can reduce the empirical Rademacher complexity effectively while preserve the representation power of the model, resulting in a better test performance.
  • The results demonstrate that the method can simultaneously increase the generalization ability and preserving the useful information of original features
Conclusion
  • Dropout based methods have been successfully used for enhancing the generalization ability of deep neural networks.
  • Eliminating some of units in neural networks can be seen as a heuristic approach for minimizing the gap between expected and empirical risks of the resulting network, which is not the optimal one in practice.
  • The authors propose to embed distortions onto feature maps of the given deep neural network by exploiting the Rademacher complexity.
  • Extensive experimental results show that the feature distortion technique can be embedded into mainstream deep networks to achieve better performance on benchmark datasets over conventional approaches
Summary
  • Introduction:

    The superiority of deep neural networks, especially convolutional neural networks (CNNs) has been well demonstrated in a large variety of computer vision tasks including image recognition (Krizhevsky, Sutskever, and Hinton 2012; He et al 2016a; Wang et al 2018a), object detection (Ren et al 2015; Redmon et al 2016), video analysis (Feichtenhofer, Pinz, and Zisserman 2016), Natural Language Processing (Wang, Li, and Smola 2019) etc.
  • The empirical risk should be closed to the expected risk
  • To this end, (Hinton et al 2012) first proposed the conventional binary dropout approach, which reduces the co-adaptation of neurons by stochastically dropping part of them in the training phase.
  • This operation can be either regarded as a model ensemble technique or a data augmentation method, which significantly enhances the performance of the resulting network on the test set
  • Objectives:

    Instead of fixing the value of perturbation, the authors aim to learn the distortion of the feature map by reducing the ERC of the network.
  • The authors' goal is to reduce the first term in Eq (11) related to ERC while constraining the intensity of distortion εli
  • Methods:

    The CNN model trained without extra regularization tricks is used as the baseline model.
  • The authors compare the method with the widely used dropout method (Hinton et al 2012) and several state-ofthe-art variants, including Vardrop (Kingma, Salimans, and Welling 2015), Sparse Vardrop (Molchanov, Ashukha, and Vetrov 2017) and RDdrop (Zhai and Wang 2018)
  • Results:

    The test accuracies on both CIFAR-10 and CIFAR-100 are summarized in Table 1.
  • CNN trained with the help of the proposed method achieves an accuracy of 85.24%, which improves the performance of the state-of-the-art RDdrop method with 2.13% and 1.58% on CIFAR-10 and CIFAR-100 dataset, respectively
  • It shows that the proposed feature map distortion method can reduce the empirical Rademacher complexity effectively while preserve the representation power of the model, resulting in a better test performance.
  • The results demonstrate that the method can simultaneously increase the generalization ability and preserving the useful information of original features
  • Conclusion:

    Dropout based methods have been successfully used for enhancing the generalization ability of deep neural networks.
  • Eliminating some of units in neural networks can be seen as a heuristic approach for minimizing the gap between expected and empirical risks of the resulting network, which is not the optimal one in practice.
  • The authors propose to embed distortions onto feature maps of the given deep neural network by exploiting the Rademacher complexity.
  • Extensive experimental results show that the feature distortion technique can be embedded into mainstream deep networks to achieve better performance on benchmark datasets over conventional approaches
Tables
  • Table1: Accuracies of conventional CNNs on CIFAR-10 and CIFAR-100 datasets
  • Table2: Accuracies of ResNet-56 on CIFAR10 and CIFAR-100 dataset
  • Table3: Accuracies of ResNet-50 on ImageNet dataset
Download tables as Excel
Funding
  • This work is supported by National Natural Science Foundation of China under Grant No 61876007, 61872012 and Australian Research Council under Project DE-180101438
Reference
  • [Ba and Frey 2013] Ba, J., and Frey, B. 2013. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, 3084–3092.
    Google ScholarLocate open access versionFindings
  • [Cubuk et al. 2018] Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.
    Findings
  • [Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
    Google ScholarLocate open access versionFindings
  • [DeVries and Taylor 2017] DeVries, T., and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
    Findings
  • [Feichtenhofer, Pinz, and Zisserman 2016] Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1933–1941.
    Google ScholarLocate open access versionFindings
  • [Ghiasi, Lin, and Le 2018] Ghiasi, G.; Lin, T.-Y.; and Le, Q. V. 2018. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, 10727–10737.
    Google ScholarLocate open access versionFindings
  • [Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
    Google ScholarLocate open access versionFindings
  • [Hanneke 2016] Hanneke, S. 2016. The optimal sample complexity of pac learning. The Journal of Machine Learning Research 17(1):1319–1333.
    Google ScholarLocate open access versionFindings
  • [He et al. 2016a] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
    Google ScholarLocate open access versionFindings
  • [He et al. 2016b] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016b. Identity mappings in deep residual networks. In European conference on computer vision, 630–645. Springer.
    Google ScholarLocate open access versionFindings
  • [Hinton et al. 2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
    Findings
  • [Kawaguchi, Kaelbling, and Bengio 2017] Kawaguchi, K.; Kaelbling, L. P.; and Bengio, Y. 2017. Generalization in deep learning. arXiv preprint arXiv:1710.05468.
    Findings
  • [Kingma, Salimans, and Welling 2015] Kingma, D. P.; Salimans, T.; and Welling, M. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, 2575–2583.
    Google ScholarLocate open access versionFindings
  • [Koltchinskii, Panchenko, and others 2002] Koltchinskii, V.; Panchenko, D.; et al. 2002. Empirical margin distributions and bounding the generalization error of combined classifiers. The
    Google ScholarLocate open access versionFindings
  • [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
    Google ScholarLocate open access versionFindings
  • [Larsson, Maire, and Shakhnarovich 2016] Larsson, G.; Maire, M.; and Shakhnarovich, G. 20Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648.
    Findings
  • [Molchanov, Ashukha, and Vetrov 2017] Molchanov, D.; Ashukha, A.; and Vetrov, D. 20Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2498–2507. JMLR. org.
    Google ScholarLocate open access versionFindings
  • [Redmon et al. 2016] Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
    Google ScholarLocate open access versionFindings
  • [Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
    Google ScholarLocate open access versionFindings
  • [Sontag 1998] Sontag, E. D. 1998. Vc dimension of neural networks. NATO ASI Series F Computer and Systems Sciences 168:69–96.
    Google ScholarLocate open access versionFindings
  • [Srivastava et al. 2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958.
    Google ScholarLocate open access versionFindings
  • [Szegedy et al. 2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826.
    Google ScholarLocate open access versionFindings
  • [Tompson et al. 2015] Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; and Bregler, C. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 648–656.
    Google ScholarLocate open access versionFindings
  • [Wan et al. 2013] Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In International conference on machine learning, 1058– 1066.
    Google ScholarLocate open access versionFindings
  • [Wang et al. 2018a] Wang, Y.; Xu, C.; Chunjing, X.; Xu, C.; and Tao, D. 2018a. Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, 1608–1618.
    Google ScholarLocate open access versionFindings
  • [Wang et al. 2018b] Wang, Y.; Xu, C.; Xu, C.; and Tao, D. 2018b. Packing convolutional neural networks in the frequency domain. IEEE transactions on pattern analysis and machine intelligence.
    Google ScholarFindings
  • [Wang, Li, and Smola 2019] Wang, C.; Li, M.; and Smola, A. J. 2019. Language models with transformers. CoRR abs/1904.09408.
    Findings
  • [Zhai and Wang 2018] Zhai, K., and Wang, H. 2018. Adaptive dropout with rademacher complexity regularization.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments