Improving Dictionary Learning with Gated Sparse Autoencoders
CoRR(2024)
Abstract
Recent work has found that sparse autoencoders (SAEs) are an effectivetechnique for unsupervised discovery of interpretable features in languagemodels' (LMs) activations, by finding sparse, linear reconstructions of LMactivations. We introduce the Gated Sparse Autoencoder (Gated SAE), whichachieves a Pareto improvement over training with prevailing methods. In SAEs,the L1 penalty used to encourage sparsity introduces many undesirable biases,such as shrinkage – systematic underestimation of feature activations. The keyinsight of Gated SAEs is to separate the functionality of (a) determining whichdirections to use and (b) estimating the magnitudes of those directions: thisenables us to apply the L1 penalty only to the former, limiting the scope ofundesirable side effects. Through training SAEs on LMs of up to 7B parameterswe find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage,are similarly interpretable, and require half as many firing features toachieve comparable reconstruction fidelity.
MoreTranslated text
PDF
View via Publisher
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined