Can Large Language Models Explain Themselves?
arxiv(2024)
摘要
Instruction-tuned large language models (LLMs) excel at many tasks, and will
even provide explanations for their behavior. Since these models are directly
accessible to the public, there is a risk that convincing and wrong
explanations can lead to unsupported confidence in LLMs. Therefore,
interpretability-faithfulness of self-explanations is an important
consideration for AI Safety. Assessing the interpretability-faithfulness of
these explanations, termed self-explanations, is challenging as the models are
too complex for humans to annotate what is a correct explanation. To address
this, we propose employing self-consistency checks as a measure of
faithfulness. For example, if an LLM says a set of words is important for
making a prediction, then it should not be able to make the same prediction
without these words. While self-consistency checks are a common approach to
faithfulness, they have not previously been applied to LLM's self-explanations.
We apply self-consistency checks to three types of self-explanations:
counterfactuals, importance measures, and redactions. Our work demonstrate that
faithfulness is both task and model dependent, e.g., for sentiment
classification, counterfactual explanations are more faithful for Llama2,
importance measures for Mistral, and redaction for Falcon 40B. Finally, our
findings are robust to prompt-variations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要