Do Not Trust Me: Explainability Against Text Classification

Mateusz Gniewkowski,Marek Klonowski,Piotr Syga, Paweł Walkowiak,Tomasz Walkowiak

ECAI 2023（2023）

引用 0|浏览9

暂无评分

摘要

Explaining artificial intelligence models can be utilized to launch targeted adversarial attacks on text classification algorithms. Understanding the reasoning behind the model's decisions makes it easier to prepare such samples. Most of the current text-based adversarial attacks rely on brute-force by using SHAP approach to identify the importance of tokens in the samples, we modify the crucial ones to prepare targeted attacks. We base our results on experiments using 5 datasets. Our results show that our approach outperforms TextBugger and TextFooler, achieving better results with 4 out of 5 datasets against TextBugger, and 3 out of 5 datasets against TextFooler, while minimizing perturbation introduced to the texts. In particular, we managed to outperform the efficacy of TextFooler by over 1900% and TextBugger by over 200% on the WikiPL dataset, additionally keeping high cosine similarity between the original text sample and the adversarial example. The evaluation of the results was additionally supported through a survey to assess their quality and ensure that the text perturbations did not change the intended class according to subjective, human classification.

查看译文

关键词

explainability,classification,text

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要