Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations

ACL, pp. 4157-4165, 2020.

Cited by: 2|Bibtex|Views75|Links
EI
Keywords:
faulty decision mechanismnatural language inferenceneural modelinconsistent explanationinconsistent natural languageMore(4+)
Weibo:
We introduced a generic framework for identifying such inconsistencies and showed that the best existing model on e-SNLI can generate a significant number of inconsistencies

Abstract:

To increase trust in artificial intelligence systems, a growing amount of works are enhancing these systems with the capability of producing natural language explanations that support their predictions. In this work, we show that such appealing frameworks are nonetheless prone to generating inconsistent explanations, such as "A dog is a...More

Code:

Data:

Introduction
  • In order to explain the predictions produced by accurate yet black-box neural models, a growing number of works propose extending these models with natural language explanation generation modules, obtaining models that explain themselves in human language (Hendricks et al, 2016; Camburu et al, 2018; Park et al, 2018; Kim et al, 2018; Ling et al, 2017).

    In this work, the authors first draw attention to the fact that such models, while appealing, are prone to generating inconsistent explanations.
  • Inconsistent explanations reveal at least one of the following undesired behaviors: (i) at least one of the explanations is not faithfully describing the decision mechanism of the model, or the model relied on a faulty decision mechanism for at least one of the instances.
  • A pair of inconsistent explanations does not necessarily imply that the model relies on a faulty decision mechanism, because the explanations may not faithfully describe the decision mechanism of the model.
  • The authors here will not investigate the problem of identifying which of the two undesired behaviors is true for a pair of inconsis-
Highlights
  • In order to explain the predictions produced by accurate yet black-box neural models, a growing number of works propose extending these models with natural language explanation generation modules, obtaining models that explain themselves in human language (Hendricks et al, 2016; Camburu et al, 2018; Park et al, 2018; Kim et al, 2018; Ling et al, 2017).

    In this work, we first draw attention to the fact that such models, while appealing, are prone to generating inconsistent explanations
  • We introduce a framework for checking if models are robust against generating inconsistent natural language explanations
  • We apply our framework on a state-of-the-art neural natural language inference model that generates natural language explanations for its decisions (Camburu et al, 2018). We show that this model can generate a significant number of inconsistent explanations
  • We identified a total of 1044 pairs of inconsistent explanations starting from the SNLI test set, which contains 9824 in
  • We drew attention that models generating natural language explanations are prone to producing inconsistent explanations
  • We introduced a generic framework for identifying such inconsistencies and showed that the best existing model on e-SNLI can generate a significant number of inconsistencies
Methods
  • Given a model m that can jointly produce predictions and natural language explanations, the authors propose a framework that, for any given instance x, attempts to generate new instances for which the model produces explanations that are inconsistent with the explanation produced for x; the authors refer to the latter as em(x).

    The authors approach the problem in two high-level steps.
  • This work is the first to tackle this problem setting, especially due to the challenging requirement of generating a full target sequence — see Section 4 for comparison with existing works
Results
  • Results and discussion

    The authors identified a total of 1044 pairs of inconsistent explanations starting from the SNLI test set, which contains 9824 in-

    Creating Ie.
  • The authors identified a total of 1044 pairs of inconsistent explanations starting from the SNLI test set, which contains 9824 in-.
  • The authors noticed that there are, on avnegation and swapping explanations.
  • Erage, 1.93 ± 1.77 distinct reverse hypotheses givwe remove the tokens “not” and “n’t” if ing rise to a pair of inconsistent explanation.
  • Since they are present.
Conclusion
  • The authors drew attention that models generating natural language explanations are prone to producing inconsistent explanations.
  • This concern is general and can have a large practical impact.
  • The authors introduced a generic framework for identifying such inconsistencies and showed that the best existing model on e-SNLI can generate a significant number of inconsistencies.
  • Future work will focus on developing more advanced procedures for detecting inconsistencies, and on building robust models that do not generate inconsistencies
Summary
  • Introduction:

    In order to explain the predictions produced by accurate yet black-box neural models, a growing number of works propose extending these models with natural language explanation generation modules, obtaining models that explain themselves in human language (Hendricks et al, 2016; Camburu et al, 2018; Park et al, 2018; Kim et al, 2018; Ling et al, 2017).

    In this work, the authors first draw attention to the fact that such models, while appealing, are prone to generating inconsistent explanations.
  • Inconsistent explanations reveal at least one of the following undesired behaviors: (i) at least one of the explanations is not faithfully describing the decision mechanism of the model, or the model relied on a faulty decision mechanism for at least one of the instances.
  • A pair of inconsistent explanations does not necessarily imply that the model relies on a faulty decision mechanism, because the explanations may not faithfully describe the decision mechanism of the model.
  • The authors here will not investigate the problem of identifying which of the two undesired behaviors is true for a pair of inconsis-
  • Methods:

    Given a model m that can jointly produce predictions and natural language explanations, the authors propose a framework that, for any given instance x, attempts to generate new instances for which the model produces explanations that are inconsistent with the explanation produced for x; the authors refer to the latter as em(x).

    The authors approach the problem in two high-level steps.
  • This work is the first to tackle this problem setting, especially due to the challenging requirement of generating a full target sequence — see Section 4 for comparison with existing works
  • Results:

    Results and discussion

    The authors identified a total of 1044 pairs of inconsistent explanations starting from the SNLI test set, which contains 9824 in-

    Creating Ie.
  • The authors identified a total of 1044 pairs of inconsistent explanations starting from the SNLI test set, which contains 9824 in-.
  • The authors noticed that there are, on avnegation and swapping explanations.
  • Erage, 1.93 ± 1.77 distinct reverse hypotheses givwe remove the tokens “not” and “n’t” if ing rise to a pair of inconsistent explanation.
  • Since they are present.
  • Conclusion:

    The authors drew attention that models generating natural language explanations are prone to producing inconsistent explanations.
  • This concern is general and can have a large practical impact.
  • The authors introduced a generic framework for identifying such inconsistencies and showed that the best existing model on e-SNLI can generate a significant number of inconsistencies.
  • Future work will focus on developing more advanced procedures for detecting inconsistencies, and on building robust models that do not generate inconsistencies
Tables
  • Table1: Examples of detected inconsistent explanations – the reverse hypotheses generated by our method (right) are realistic
  • Table2: More examples of inconsistent explanations detected with our method
Download tables as Excel
Funding
  • This work was supported by a JP Morgan PhD Fellowship, the Alan Turing Institute under the EPSRC grant EP/N510129/1, the EPSRC grant EP/R013667/1, the AXA Research Fund, and the EU Horizon 2020 Research and Innovation Programme under the grant 875160
Reference
  • Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. CoRR, abs/1711.02173.
    Findings
  • Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In ECML/PKDD (3), volume 8190 of LNCS, pages 387–40Springer.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. CoRR, abs/1508.05326.
    Findings
  • Oana-Maria Camburu, Tim Rocktaschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural language inference with natural language explanations. In NeurIPS, pages 9560–9572.
    Google ScholarLocate open access versionFindings
  • Vicente Ivan Sanchez Carmona, Jeff Mitchell, and Sebastian Riedel. 2018. Behavior analysis of NLI models: Uncovering the influence of three factors on robustness. In NAACL-HLT, pages 1975–198Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. 2018. Seq2Sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. CoRR, abs/1803.01128.
    Findings
  • Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In ECCV (4), volume 9908 of LNCS, pages 3–19. Springer.
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. 2017. Grounding visual explanations (extended abstract). CoRR, abs/1711.06465.
    Findings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735– 1780.
    Google ScholarLocate open access versionFindings
  • Hossein Hosseini, Baicen Xiao, and Radha Poovendran. 2017. Deceiving Google’s cloud video intelligence API built for summarizing videos. In CVPR Workshops, pages 1305–1309. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. CoRR, abs/1804.06059.
    Findings
  • Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John F. Canny, and Zeynep Akata. 2018. Textual explanations for self-driving vehicles. In ECCV (2), volume 11206 of LNCS, pages 577–593. Springer.
    Google ScholarLocate open access versionFindings
  • Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. CoRR, abs/1705.04146.
    Findings
  • Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. 2016. Learning natural language inference using bidirectional LSTM model and inner-attention. CoRR, abs/1605.09090.
    Findings
  • Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Pasquale Minervini and Sebastian Riedel. 2018. Adversarially regularising neural NLI models to integrate logical background knowledge. In CoNLL, pages 65–74. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tsendsuren Munkhdalai and Hong Yu. 2016. Neural semantic encoders. CoRR, abs/1607.04315.
    Findings
  • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. 20Multimodal explanations: Justifying decisions and pointing to the evidence. CoRR, abs/1802.08129.
    Findings
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In KDD, pages 1135–1144. ACM.
    Google ScholarLocate open access versionFindings
  • Tim Rocktaschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. 2015. Reasoning about entailment with neural attention. CoRR, abs/1509.06664.
    Findings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR (Poster).
    Google ScholarLocate open access versionFindings
  • Wenqi Wang, Benxiao Tang, Run Wang, Lina Wang, and Aoshuang Ye. 2019. A survey on adversarial attacks and defenses in text. CoRR, abs/1902.07285.
    Findings
  • Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. 2019. Adversarial attacks on deep learning models in natural language processing: A survey.
    Google ScholarFindings
  • Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adversarial examples. In ICLR (Poster). OpenReview.net.
    Google ScholarFindings
  • Below we present the list of templates that we manually found to match most of the e-SNLI explanations (Camburu et al., 2018). We recall that during the collection of the dataset Camburu et al. (2018) did not impose any template, they were a natural consequence of the task and SNLI dataset.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments