Localizing Model Behavior with Path Patching

Nicholas Goldowsky-Dill, Chris MacLeod,Lucas Sato,Aryaman Arora

CoRR(2023)

引用 0|浏览9
暂无评分
摘要
Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.
更多
查看译文
关键词
model behavior,path
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要