Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates
CoRR(2024)
摘要
Machine-learning models demand for periodic updates to improve their average
accuracy, exploiting novel architectures and additional data. However, a
newly-updated model may commit mistakes that the previous model did not make.
Such misclassifications are referred to as negative flips, and experienced by
users as a regression of performance. In this work, we show that this problem
also affects robustness to adversarial examples, thereby hindering the
development of secure model update practices. In particular, when updating a
model to improve its adversarial robustness, some previously-ineffective
adversarial examples may become misclassified, causing a regression in the
perceived security of the system. We propose a novel technique, named
robustness-congruent adversarial training, to address this issue. It amounts to
fine-tuning a model with adversarial training, while constraining it to retain
higher robustness on the adversarial examples that were correctly classified
before the update. We show that our algorithm and, more generally, learning
with non-regression constraints, provides a theoretically-grounded framework to
train consistent estimators. Our experiments on robust models for computer
vision confirm that (i) both accuracy and robustness, even if improved after
model update, can be affected by negative flips, and (ii) our
robustness-congruent adversarial training can mitigate the problem,
outperforming competing baseline methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要