Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions

ICLR, 2020.

Cited by: 6|Bibtex|Views322|Links
EI
Keywords:
Adversarial Examples Detection of adversarial attacks
Weibo:
To further explain the success of Capsule Network, we qualitatively showed that the success of the reconstructive attack was highly related to the visual similarity between the target class and the source class for the Capsule Network

Abstract:

Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstruct...More

Code:

Data:

Introduction
  • Adversarial examples (Szegedy et al, 2013) are inputs that are designed by an adversary to cause a machine learning system to make a misclassification.
  • In this paper the authors develop methods for detecting adversarial examples by making use of class-conditional reconstruction networks.
  • These sub-networks, first proposed by Sabour et al (2017) as part of a Capsule Network (CapsNet), allow a model to produce a reconstruction of its input based on the identity and instantiation parameters of the winning capsule.
Highlights
  • Adversarial examples (Szegedy et al, 2013) are inputs that are designed by an adversary to cause a machine learning system to make a misclassification
  • Previous work (Carlini & Wagner, 2017a; Hosseini et al, 2019) used the True Positive Rate to measure the proportion of adversarial examples that are detected, which alone is insufficient to measure the ability of different detection mechanism because the unsuccessful adversarial examples do not have to be detected
  • We have presented a class-conditional reconstruction-based detection method that does not rely on a specific predefined adversarial attack
  • Compared to convolutional neural networks-based models, we showed that the Capsule Network was able to detect adversarial examples with greater accuracy on all the datasets we explored
  • To further explain the success of Capsule Network, we qualitatively showed that the success of the reconstructive attack was highly related to the visual similarity between the target class and the source class for the Capsule Network
  • We showed that images generated by this reconstructive attack to attack the Capsule Network are not typically adversarial, i.e. many of the resultant attacks resemble members of the target class even with a small ∞ norm bound
Methods
  • The authors first demonstrate how reconstruction networks can detect standard white and black-box attacks in addition to naturally corrupted images.
  • The authors introduce the “reconstructive attack”, which is designed to circumvent the defense and show that it is a more powerful attack in this setting.
  • Based on this finding, the authors qualitatively study the kind of misclassifications caused by the reconstructive attack and argue that they suggest that CapsNets learn features that are better aligned with human perception.
Results
  • EVALUATION METRICS

    The authors use Success Rate to measure the success of attacks. For targeted attacks, the success rate St is defined as the proportion of inputs which are classified target class, St = 1 N N i (f (xi )

    ti), while the success rate for untargeted attacks is defined as the proportion of inputs which are misclassified, Su (xi) yi).

    Previous work (Carlini & Wagner, 2017a; Hosseini et al, 2019) used the True Positive Rate to measure the proportion of adversarial examples that are detected, which alone is insufficient to measure the ability of different detection mechanism because the unsuccessful adversarial examples do not have to be detected.
  • The authors use Success Rate to measure the success of attacks.
  • The success rate St is defined as the proportion of inputs which are classified target class, St = 1 N N i (f.
  • Ti), while the success rate for untargeted attacks is defined as the proportion of inputs which are misclassified, Su yi).
  • In this paper, the authors propose to use the Undetected Rate: the proportion of attacks that are successful and undetected to evaluate the detection mechanism.
Conclusion
  • The authors' detection mechanism relies on a similarity metric between the reconstruction and the input.
  • The authors showed that images generated by this reconstructive attack to attack the CapsNet are not typically adversarial, i.e. many of the resultant attacks resemble members of the target class even with a small ∞ norm bound
  • These are not the case for the CNN-based models.
  • The extensive qualitative studies indicate that the capsule model relies on visual features similar to those used by humans
  • The authors believe this is a step towards solving the true problem posed by adversarial examples
Summary
  • Introduction:

    Adversarial examples (Szegedy et al, 2013) are inputs that are designed by an adversary to cause a machine learning system to make a misclassification.
  • In this paper the authors develop methods for detecting adversarial examples by making use of class-conditional reconstruction networks.
  • These sub-networks, first proposed by Sabour et al (2017) as part of a Capsule Network (CapsNet), allow a model to produce a reconstruction of its input based on the identity and instantiation parameters of the winning capsule.
  • Methods:

    The authors first demonstrate how reconstruction networks can detect standard white and black-box attacks in addition to naturally corrupted images.
  • The authors introduce the “reconstructive attack”, which is designed to circumvent the defense and show that it is a more powerful attack in this setting.
  • Based on this finding, the authors qualitatively study the kind of misclassifications caused by the reconstructive attack and argue that they suggest that CapsNets learn features that are better aligned with human perception.
  • Results:

    EVALUATION METRICS

    The authors use Success Rate to measure the success of attacks. For targeted attacks, the success rate St is defined as the proportion of inputs which are classified target class, St = 1 N N i (f (xi )

    ti), while the success rate for untargeted attacks is defined as the proportion of inputs which are misclassified, Su (xi) yi).

    Previous work (Carlini & Wagner, 2017a; Hosseini et al, 2019) used the True Positive Rate to measure the proportion of adversarial examples that are detected, which alone is insufficient to measure the ability of different detection mechanism because the unsuccessful adversarial examples do not have to be detected.
  • The authors use Success Rate to measure the success of attacks.
  • The success rate St is defined as the proportion of inputs which are classified target class, St = 1 N N i (f.
  • Ti), while the success rate for untargeted attacks is defined as the proportion of inputs which are misclassified, Su yi).
  • In this paper, the authors propose to use the Undetected Rate: the proportion of attacks that are successful and undetected to evaluate the detection mechanism.
  • Conclusion:

    The authors' detection mechanism relies on a similarity metric between the reconstruction and the input.
  • The authors showed that images generated by this reconstructive attack to attack the CapsNet are not typically adversarial, i.e. many of the resultant attacks resemble members of the target class even with a small ∞ norm bound
  • These are not the case for the CNN-based models.
  • The extensive qualitative studies indicate that the capsule model relies on visual features similar to those used by humans
  • The authors believe this is a step towards solving the true problem posed by adversarial examples
Tables
  • Table1: Success Rate / Undetected Rate of white-box targeted and untargeted attacks on the MNIST dataset. In the table, St/Rt is shown for targeted attacks and Su/Ru is presented for untargeted attacks. A smaller success rate and undetected rate means a stronger defense model. Full results for FashionMNIST and SVHN can be seen in Table 5 in the Appendix
  • Table2: Error Rate/Undetected Rate on the Corrupted MNIST dataset. A smaller error rate and undetected rate means a better defense model
  • Table3: Success rate and the worst case undetected rate of white-box targeted and untargeted reconstructive attacks. St/Rt is shown for targeted attacks and Su/Ru is presented for untargeted attacks. The worst case undetected rate is reported via tuning the hyperparameter β in Eqn 1 and Eqn 2. The best defense models are shown in bold (smaller success rate and undetected rate is better). All the numbers are shown in %. A full table with more attacks can be seen in Table 6 in Appendix
  • Table4: Error rate of each model when the input are clean test images in each dataset
  • Table5: Success rate and undetected rate of white-box targeted and untargeted attacks. In the table, St/Rt is shown for targeted attacks and Su/Ru is presented for untargeted attacks
  • Table6: Success rate and the worst case undetected rate of white-box targeted and untargeted reconstructive attacks. Below St/Rt is shown for targeted attacks and Su/Ru is presented for untargeted attacks
  • Table7: Success rate and undetected rate of black-box targeted and untargeted attacks. In the table, St/Rt is shown for targeted attacks and Su/Ru is presented for untargeted attacks. All the numbers are shown in %
Download tables as Excel
Related work
  • Adversarial examples were first introduced in (Biggio et al, 2013; Szegedy et al, 2013), where a given image was modified by following the gradient of a classifier’s output with respect to the image’s pixels. Goodfellow et al (2014) then developed the more efficient Fast Gradient Sign method (FGSM), which can change the label of the input image X with a similarly imperceptible perturbation that is constructed by taking an step in the direction of the gradient. Later, the Basic Iterative Method (BIM) (Kurakin et al, 2016) and Projected Gradient Descent (Madry et al, 2017) can generate stronger attacks improved on FGSM by taking multiple steps in the direction of the gradient. In addition, Carlini & Wagner (2017b) proposed another iterative optimization-based method to construct strong adversarial examples with small perturbations.

    An early approach to reducing vulnerability to adversarial examples was proposed by (Goodfellow et al, 2014), where a network was trained on both clean images and adversarially perturbed ones. Since then, there has been a constant “arms race” between better attacks and better defenses; Kurakin et al (2018) provide an overview of this field. However, many defenses against adversarial examples have been demonstrated to be an effect of “obfuscated gradients” and can be further circumvented under the white-box setting (Athalye et al, 2018).
Funding
  • Proposes the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error
  • Finds that CapsNets always perform better than convolutional networks
  • Develops methods for detecting adversarial examples by making use of class-conditional reconstruction networks
  • Proposes using the reconstruction sub-network in a CapsNet as an attack-independent detection mechanism
  • Reconstructs a given input from the pose parameters of the winning capsule and detect adversarial examples by comparing the difference between the reconstruction distributions for natural and adversarial images. extends this detection mechanism to standard convolutional neural networks and show its effectiveness against black box and white box attacks on three image datasets; MNIST, FashionMNIST and SVHN
Reference
  • Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. ACM, 2017a.
    Google ScholarLocate open access versionFindings
  • Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017b.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
    Google ScholarLocate open access versionFindings
  • Reuben Feinman, Ryan R. Curtin, Saurabh Shintre, and Andrew B. Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
    Findings
  • Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, and George E. Dahl. Motivating the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732, 2018a.
    Findings
  • Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. In International Conference on Learning Representations, 2018b.
    Google ScholarLocate open access versionFindings
  • Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku. Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960, 2017.
    Findings
  • Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Yao Qin, and David Berthelot. Evaluation methodology for attacks against confidence thresholding models. 2018.
    Google ScholarFindings
  • Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
    Findings
  • Dan Hendrycks and Kevin Gimpel. Early methods for detecting adversarial images. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Hossein Hosseini, Sreeram Kannan, and Radha Poovendran. Are odds really odd? bypassing statistical detection of adversarial examples. arXiv preprint arXiv:1907.12138, 2019.
    Findings
  • Saumya Jetley, Nicholas A. Lord, and Philip H. S. Torr. With friends like these, who needs adversaries? In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
    Google ScholarFindings
  • Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al. Adversarial attacks and defences competition. arXiv preprint arXiv:1804.00097, 2018.
    Findings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Xin Li and Fuxin Li. Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
    Google ScholarLocate open access versionFindings
  • Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Felix Michels, Tobias Uelwer, Eric Upschulte, and Stefan Harmeling. On the vulnerability of capsule networks to adversarial attacks. arXiv preprint arXiv:1906.03612, 2019.
    Findings
  • Norman Mu and Justin Gilmer. Mnist-c: A robustness benchmark for computer vision. In ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 5, 2011.
    Google ScholarLocate open access versionFindings
  • Jathushan Rajasegaran, Vinoj Jayasundara, Sandaru Jayasekara, Hirunima Jayasekara, Suranga Seneviratne, and Ranga Rodrigo. Deepcaps: Going deeper with capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10725–10733, 2019.
    Google ScholarLocate open access versionFindings
  • Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. In International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866, 2017.
    Google ScholarLocate open access versionFindings
  • Lukas Schott, Jonas Rauber, Wieland Brendel, and Matthias Bethge. Robust perception through analysis by synthesis. arXiv preprint arXiv:1805.09190, 2018.
    Findings
  • Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2013.
    Google ScholarLocate open access versionFindings
  • Lucas Theis, Aaron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
    Findings
  • Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments