Eight Methods to Evaluate Robust Unlearning in LLMs
CoRR(2024)
摘要
Machine unlearning can be useful for removing harmful capabilities and
memorized text from large language models (LLMs), but there are not yet
standardized methods for rigorously evaluating it. In this paper, we first
survey techniques and limitations of existing unlearning evaluations. Second,
we apply a comprehensive set of tests for the robustness and competitiveness of
unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich
(2023). While WHP's unlearning generalizes well when evaluated with the
"Familiarity" metric from Eldan and Russinovich, we find i)
higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP
performs on par with the original model on Harry Potter Q A tasks, iii) it
represents latent knowledge comparably to the original model, and iv) there is
collateral unlearning in related domains. Overall, our results highlight the
importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要