Layerwise universal adversarial attack on NLP models

Olga Tsymboi, Danil Malaev, Andrei Petrovskii,Ivan Oseledets

conf_acl(2023)

引用 2|浏览34
暂无评分
摘要
In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9.3% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Using small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7.1% in the fooling rate.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络