Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

1Tohoku University, 2RIKEN, 3MBZUAI
NeurIPS 2025

Abstract

Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question—do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

SAEs, Transcoders, and FF-KVs

SAE

SAE decomposes and reconstructs the neuron activations, typically after the FF layer (residual stream). That is, let \(\bm{x}_\mathrm{FF_{out}} \in \mathbb{R}^{d_\mathrm{model}}\) be neuron activations after the FF layer, and \(d_{\mathrm{SAE}}\) denotes the dimension of SAE features; SAE module computes as follows:

\(\bm{x}_\mathrm{FF_{out}} \approx \) \(\text{ReLU}(\bm{x}_\mathrm{FF_{out}}\bm{W}_{\mathrm{enc}} + \bm{b}_{\mathrm{enc}})\) \(\bm{W}_{\mathrm{dec}}\) \(+ \bm{b}_{\mathrm{dec}}\)

with \(\bm{W}_{\mathrm{enc}} \in \mathbb{R}^{d_\mathrm{model} \times d_\mathrm{SAE}}\), \(\bm{W}_{\mathrm{dec}} \in \mathbb{R}^{d_\mathrm{SAE} \times d_\mathrm{model}}\), \(\bm{b}_{\mathrm{enc}} \in \mathbb{R}^{d_\mathrm{SAE}}\), and \(\bm{b}_{\mathrm{dec}} \in \mathbb{R}^{d_\mathrm{model}}\) in the SAE module. \(\text{ReLU}(\cdot): \mathbb{R}^d\to \mathbb{R}^d\) denotes an element-wise ReLU activation. Each activation dimension is treated as a potentially interpretable feature, and the matrix maps each feature dimension to its feature vector in the representation space. This module is trained so that the activations are as sparse as possible with a sparsity loss to disentangle the potentially polysemantic input neurons.

Transcoder

Notably, as an alternative to SAEs and perhaps the closest attempt to this study, Transcoders have recently been proposed. This approximates the original FF by training a sparse MLP as a proxy to predict FF output \(\bm{x}_\mathrm{FF_{out}}\) from FF input \(\bm{x}_\mathrm{FF_{in}}\), and its internal activations (\(\in \mathbb{R}^{d_\mathrm{TC}}\)) are evaluated in the same way as the standard SAE. Still, their work did not clearly evaluate how interpretable the original FF's internal activations are, and this work complements this overlooked question.

FF-KVs

Feed-forward layers in Transformers once project the FF input \(\bm{x}_\mathrm{FF_{in}} \in \mathbb{R}^{d_\mathrm{model}}\) to \(d_\mathrm{FF}\)-dimensional representation (\(d_\mathrm{model} < d_\mathrm{FF}\)), applies an element-wise non-linear activation \(\bm{\phi}(\cdot): \mathbb{R}^d\to \mathbb{R}^d\), such as SwiGLU, and projects it back, as follows:

\(\bm{x}_\mathrm{FF_{out}} = \) \(\bm{\phi}(\bm{x}_\mathrm{FF_{in}} \bm{W}_K + \bm{b}_K)\) \(\bm{W}_V\) \(+ \bm{b}_V = \displaystyle\sum_{i \in d_\mathrm{FF}}\) \(\bm{\phi}(\bm{x}_\mathrm{FF_{in}} \bm{W}_{K[i,:]})\) \(\bm{W}_{V[:,i]}\) \(+ \bm{b}_V\)

where \(\bm{W}_K \in \mathbb{R}^{d_\mathrm{model} \times d_\mathrm{FF}}\) and \(\bm{W}_V \in \mathbb{R}^{d_\mathrm{FF} \times d_\mathrm{model}}\) are learnable weight matrices, and \(\bm{b}_K \in \mathbb{R}^{d_\mathrm{FF}}\), \(\bm{b}_V \in \mathbb{R}^{d_\mathrm{model}}\) are learnable biases. \(d_{\mathrm{FF}}\) is typically set as \(4d_{\mathrm{model}}\). One interpretation of the FF layer is a knowledge retrieval module; that is, the module first creates keys (activations) from an input \(\bm{x}_\mathrm{FF_{in}}\) and then aggregates their associated values (feature vectors).

TopK FF-KV

To encourage the alignment with SAE research, we also introduce sparsity to activations in FFs by applying a top-\(k\) activation function to the key vector. This keeps only the \(k\) neurons with the \(k\) largest activations in each inference, zeroing out the activation for the rest.

\(\bm{x}_\mathrm{FF_{out}} \approx \) \(\text{Top-}k(\bm{\phi}(\bm{W}_K\bm{x}_\mathrm{FF_{in}} + \bm{b}_K))\) \(\bm{W}_V\) \(+ \bm{b}_V\)

Normalized FF-KV

The feature vectors of SAE are typically normalized, whereas those in FF are not. If a particular feature vector \(\bm{W}_{V[i,:]}\) has a large norm, the magnitude of its corresponding activation may be underestimated. To handle this potential concern, we normalize each row of \(\bm{W}_V\), and the discounted vector norm is weighted to activations.

\(\bm{x}_\mathrm{FF_{out}} \approx \) \(\text{Top-}k(\bm{\phi}(\bm{W}_K\bm{x}_\mathrm{FF_{in}} + \bm{b}_K) \odot \bm{s})\) \(\tilde{\bm{W}}_V\) \(+ \bm{b}_V\)
\(\bm{s} = [\|\bm{W}_{V[1,:]}\|_2, \|\bm{W}_{V[2,:]}\|_2, \cdots, \|\bm{W}_{V[d_{FF},:]}\|_2] \in \mathbb{R}^{d_\mathrm{FF}}\)

Here, \(\tilde{\bm{W}}_V = \text{diag}(\bm{s})^{-1} \bm{W}_V\), where \(\text{diag}(\cdot)\) expands a vector \(\mathbb{R}^d\) to a diagonal matrix \(\mathbb{R}^{d\times d}\).

Automatic Evaluation

We evaluate FF-KVs, SAEs, and Transcoders using the metrics from SAEBENCH. We also report the Feature Alive Rate as a complementary statistic. First of all, the SAE-based results and FF-KV results rendered similar tendencies. In each metric, the absolute difference between the scores from different methods is close (i.e., typically within seed variance or layer differences). In addition, the difficulty of each task (metric) is aligned across the tasks; for example, SAEs and FF-KVs achieved higher RAVEL-Isolation scores than RAVEL-Causality scores. These results suggest that, even with the activations in the original FF module, comparable interpretability can be realized compared to proxy-based methods, i.e., SAEs and Transcoders.

First research result visualization

Human Evaluation

We randomly sampled \(50\) features each from the FF-KV, TopK FF-KV, SAE, and Transcoder of Gemma-2-2B model, yielding a total of 200 features. Each feature \(f_p\) is presented with its top-ten associated texts \(\in\mathcal{S}_p\) based on the activation magnitude over a subset of OpenWebText corpus (200M tokens in total). The results show that the number of conceptual features is nearly the same across the four interpretability methods. In this sense, the quality of the obtained features are comparable. Transcoders could find a larger number of features that are interpretable (superficial or conceptual), but the ratio of superficial features is higher than in the other methods.

SAE Type Superficial Features Conceptual Features Uninterpretable Features
FF-KV 6 8 36
TopK FF-KV 9 9 32
SAE 6 9 35
Transcoder 16 11 23

BibTeX

@inproceedings{ye2025ffkvsae,
  title={Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders},
  author={Mengyu Ye and Jun Suzuki and Tatsuro Inaba and Tatsuki Kuribayashi},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://neurips.cc/virtual/2025/poster/120144}
}