Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan,Arthur Conmy,Lewis Smith,Tom Lieberum,Vikrant Varma,János Kramár,Rohin Shah,Neel Nanda

CoRR（2024）

Cited 1|Views25

Abstract

Recent work has found that sparse autoencoders (SAEs) are an effectivetechnique for unsupervised discovery of interpretable features in languagemodels' (LMs) activations, by finding sparse, linear reconstructions of LMactivations. We introduce the Gated Sparse Autoencoder (Gated SAE), whichachieves a Pareto improvement over training with prevailing methods. In SAEs,the L1 penalty used to encourage sparsity introduces many undesirable biases,such as shrinkage – systematic underestimation of feature activations. The keyinsight of Gated SAEs is to separate the functionality of (a) determining whichdirections to use and (b) estimating the magnitudes of those directions: thisenables us to apply the L1 penalty only to the former, limiting the scope ofundesirable side effects. Through training SAEs on LMs of up to 7B parameterswe find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage,are similarly interpretable, and require half as many firing features toachieve comparable reconstruction fidelity.

Translated text

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined