Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arxiv(2024)
摘要
We introduce methods for discovering and applying sparse feature circuits.
These are causally implicated subnetworks of human-interpretable features for
explaining language model behaviors. Circuits identified in prior work consist
of polysemantic and difficult-to-interpret units like attention heads or
neurons, rendering them unsuitable for many downstream applications. In
contrast, sparse feature circuits enable detailed understanding of
unanticipated mechanisms. Because they are based on fine-grained units, sparse
feature circuits are useful for downstream tasks: We introduce SHIFT, where we
improve the generalization of a classifier by ablating features that a human
judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised
and scalable interpretability pipeline by discovering thousands of sparse
feature circuits for automatically discovered model behaviors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要