CRISPR: Eliminating Bias Neurons from an Instruction-following Language Model.
CoRR(2023)
摘要
Large language models (LLMs) executing tasks through instruction-based
prompts often face challenges stemming from distribution differences between
user instructions and training instructions. This leads to distractions and
biases, especially when dealing with inconsistent dynamic labels. In this
paper, we introduces a novel bias mitigation method, CRISPR, designed to
alleviate instruction-label biases in LLMs. CRISPR utilizes attribution methods
to identify bias neurons influencing biased outputs and employs pruning to
eliminate the bias neurons. Experimental results demonstrate the method's
effectiveness in mitigating biases in instruction-based prompting, enhancing
language model performance on social bias benchmarks without compromising
pre-existing knowledge. CRISPR proves highly practical, model-agnostic,
offering flexibility in adapting to evolving social biases.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要