Pandora's White-Box: Increased Training Data Leakage in Open LLMs
CoRR(2024)
摘要
In this paper we undertake a systematic study of privacy attacks against open
source Large Language Models (LLMs), where an adversary has access to either
the model weights, gradients, or losses, and tries to exploit them to learn
something about the underlying training data. Our headline results are the
first membership inference attacks (MIAs) against pre-trained LLMs that are
able to simultaneously achieve high TPRs and low FPRs, and a pipeline showing
that over 50% (!) of the fine-tuning dataset can be extracted from a
fine-tuned LLM in natural settings. We consider varying degrees of access to
the underlying model, customization of the language model, and resources
available to the attacker. In the pre-trained setting, we propose three new
white-box MIAs: an attack based on the gradient norm, a supervised neural
network classifier, and a single step loss ratio attack. All outperform
existing black-box baselines, and our supervised attack closes the gap between
MIA attack success against LLMs and other types of models. In fine-tuning, we
find that given access to the loss of the fine-tuned and base models, a
fine-tuned loss ratio attack FLoRA is able to achieve near perfect MIA
peformance. We then leverage these MIAs to extract fine-tuning data from
fine-tuned language models. We find that the pipeline of generating from
fine-tuned models prompted with a small snippet of the prefix of each training
example, followed by using FLoRa to select the most likely training sample,
succeeds the majority of the fine-tuning dataset after only 3 epochs of
fine-tuning. Taken together, these findings show that highly effective MIAs are
available in almost all LLM training settings, and highlight that great care
must be taken before LLMs are fine-tuned on highly sensitive data and then
deployed.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要