Toy Models of Superposition

Nelson Elhage,Tristan Hume,Catherine Olsson,Nicholas Schiefer,Tom Henighan,Shauna Kravec,Zac Hatfield-Dodds,Robert Lasenby,Dawn Drain, Carol Chen,Roger Grosse,Sam McCandlish,Jared Kaplan,Dario Amodei,Martin Wattenberg,Christopher Olah

arxiv（2022）

引用 25|浏览154

暂无评分

摘要

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

查看译文

关键词

models

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要