← Back to Paper Thoughts

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Authors: Nikhil Garg, Jon Kleinberg, Kenny Peng
Published: February 2026, arXiv preprint

Overview

The linear representation hypothesis (LRH) is one of the most useful ideas in mechanistic interpretability: intermediate layers of language models store meaningful concepts — "features" — as linear directions in activation space. This idea allows the effectiveness of linear probes, activation steering, and sparse autoencoders. The LRH does pose a basic question you might ask: if a layer has d neurons, how many features m can it actually store?

This paper gives a mathematical framework that separates the LRH into two distinct claims — linear representation (features are linearly encoded into activations - how models write to the residual stream) and linear accessibility (features can be linearly decoded back out - how models read from the residual stream) — and shows that these are meaningfully different. The headline results: under classical compressed sensing (linear encoding, nonlinear decoding), you need d = O(k log(m/k)) dimensions, where k is the number of features active per input. But if you also require linear decoding, the upper bound requires d = O(k² log m / ε²) for some reconstruction error limited by ε. That extra factor of k is the price of linear readout. The good news: in both cases, you can store exponentially many features relative to neurons, confirming the "superposition hypothesis."

The Setup

Let's set up the notation. An input text gets mapped to an activation vector f(ℓ) ∈ ℝ^d by some intermediate layer with d neurons. A feature z_i is some intrinsic property of the text (is it a question? is the sentiment positive? is it about dogs? - Though in principle these features do not have to be human interpretable). There are m total features, collected into a vector z(ℓ) = [z₁(ℓ), ..., z_m(ℓ)] ∈ ℝ^m.

Linear representation says the activations are a linear combination of feature values:

f(ℓ) = A z(ℓ)

where A ∈ ℝ^d×m is an encoding matrix. Its columns a₁, ..., a_m are the representation vectors — the directions in activation space corresponding to each feature.

Linear accessibility says each feature can be read out by a dot product with some probe vector b_i:

ẑ_i = ⟨b_i, f(ℓ)⟩

Collecting probes into a matrix B ∈ ℝ^d×m, the full linear decode is ẑ = B^⊤Az, and we want ‖B^⊤Az − z‖_∞ < ε for all valid inputs. The ∞-norm means every single feature must be recovered to within ε.

The key assumption is k-sparsity: any given input has at most k active features and k ≪ m. This makes intuitive sense — any piece of text is "about" only a small fraction of all possible concepts.

The Main Results

Classical compressed sensing (nonlinear decoding)

If you only need linear encoding but allow arbitrary (nonlinear) decoding algorithms, classical results show that d = O(k log(m/k)) suffices. The decoder here is ℓ₁-minimisation (basis pursuit), which finds the vector with the smallest ℓ₁ norm that is consistent with the observed activations x (i.e. argmin ‖z'‖₁ subject to Az' = x). The interesting result is that for k-sparse inputs, z' = z. This is powerful, but it's an optimisation problem and not something a single matrix multiply can do.

Linear compressed sensing (this paper)

Requiring linear decoding changes things. The paper proves nearly matching bounds:

Upper bound (Theorem 2): d = O_ε(k² log m) suffices.
Lower bound (Theorem 3): d = Ω_ε(k² / log k · log(m/k)) is required.

Where O represents the upper order bound, and Ω the lower bound order.

The gap between classical and linear compressed sensing is a factor of roughly k. Linear decoding is strictly harder — but still allows exponentially many features.

The Upper Bound: How the Construction Works

The upper bound proof is constructive and pretty understandable for non-mathematicians, so it's worth going through. The idea is to set B = A (same matrix for encoding and decoding) and find A such that A^⊤Az ≈ z for all k-sparse inputs (because remember we want to show that ‖B^⊤Az − z‖_∞ < ε).

Step 1: Expanding the reconstruction

Let a₁, ..., a_m be the columns of A. The reconstructed value of feature i is:

ẑ_i = ⟨a_i, Az⟩ = ∑_j=1^m z_j ⟨a_i, a_j⟩

Separating the j = i term:

ẑ_i = z_i ⟨a_i, a_i⟩ + Σ_j≠i z_j ⟨a_i, a_j⟩

The first term is what we want (the true feature value, scaled by the self-inner-product). The second term is interference — other features contributing to the measurement of feature i. The cross-term ⟨a_i, a_j⟩ tells us how much the presence of feature j corrupts the readout of feature i. If one feature is present, the read out of another feature because that much harder, proportional to ⟨a_i, a_j⟩.

Step 2: Using sparsity

Since z is k-sparse, only k terms in the interference sum are nonzero and all others contribute 0. Let T be the set of indices where z_j ≠ 0 (this is also known as the support, as in, T is the support of z), and |T| ≤ k. The expression simplifies to:

ẑ_i = z_i ⟨a_i, a_i⟩ + Σ_j∈T\{i} z_j ⟨a_i, a_j⟩

where j∈T\{i} means j in the set T without element i. This shows that we need at most k interference terms to be small.

Step 3: μ-incoherence

To control the error, we want two things from A: self-inner-products equal to 1 (so the first term is exactly z_i), and small cross-inner-products (so interference is bounded). This is formalised as μ-incoherence: a matrix is μ-incoherent if ⟨a_i, a_i⟩ = 1 for all i, and |⟨a_i, a_j⟩| < μ for all i ≠ j. This is essentially requiring A is nearly orthonormal.

If A is μ-incoherent, the error is bounded by:

|ẑ_i − z_i| ≤ Σ_j∈T\{i} |z_j| · |⟨a_i, a_j⟩| < k · 1 · μ = kμ

where we use the triangle inequality, |z_j| ≤ 1 (the feature range is [-1, 1]), |⟨a_i, a_j⟩| < μ (incoherence), and there are at most k terms from T. Therefore, we need kμ < ε, i.e. μ = ε/k.

Step 4: Random matrices are incoherent

The authors use Lemma 5 to show the Incoherence of Random Matrices. I followed it in as much as I understand the steps, but I wont recant it all here. But essentially we need to answer this question: do μ-incoherent matrices of column size d exist, and how big does d need to be? Let C ∈ ℝ^d×m be a random matrix with entries ±1/√d, then the self-inner-products exactly equal 1 (1 / d · d = 1). For the cross-inner-products, Hoeffding's inequality and features of a Rademacher distribution show that that ⟨c_i, c_j⟩| < μ for all i ≠ j with high probability, as long as:

d = O(log m / μ²)

Plugging in μ = ε/k:

d = O(k² log m / ε²)

That's the upper bound for residual stream dimension. A random matrix can satisfy μ-incoherence.

The Lower Bound (Result Only)

The lower bound proof is more involved, and I got lost very quickly. I used Opus 4.6 to give a broad overview:

The high-level strategy is: assume d is too small, then show a contradiction must exist — some feature i will have too many other features interfering with its readout, all in the same direction, so that the errors compound and exceed ε.

The proof uses a result from Alon (2003) showing that matrices which are near-identity (large diagonal, small off-diagonal) must have high rank. This is applied not to the full matrix B^⊤A, but to all large principal submatrices of it, establishing that any sufficiently large subset of features must contain an interfering pair. A graph reduction argument combined with Turán's theorem (which bounds how many edges a clique-free graph can have) then shows that some feature must have at least 2k interfering neighbours. The pigeonhole principle ensures k of these interfere in the same direction, giving the contradiction.

Anyway, the final result comes out as d = Ω_ε(k²/log k · log(m/k)) as stated above.

My Thoughts: What Does This Mean for Scaling?

The theoretical picture is that with d neurons and k-sparse features, you can store m = exp(Θ(d/k²)) features under fully linear encoding and decoding. That's exponentially many, which is great. But the k² dependence raises some interesting questions about how this plays out as models scale.

The k² tax and what it means for current models

For current language models doing current tasks, the k² constraint might still be workable. A typical large model might have d ≈ 4096–12288 neurons per layer. Empirically, sparse autoencoders tend to find that only tens to low hundreds of features fire on any given input. If the effective sparsity k is in the tens to hundreds, k² is in the hundreds to tens of thousands, and d/k² is still large enough to store a comfortable number of features. Current models seem to happily live in this regime.

Could more capable models hit the wall?

But here's a question I find interesting: As models become more capable — doing harder reasoning, maintaining more context, tracking more abstract relationships simultaneously — the effective k might grow. Not because the input text got more complex, but because the model's internal representation of what matters about that text got richer. A simple model reading "the cat sat on the mat" might activate features for animal, domestic scene, simple syntax. A more capable model might track many more features, potentially unrelated features or things we can't even understand. If k doubles, you need 4× the neurons to maintain the same feature capacity for the same performance at reading activations. The model might undergo a paradigm shift in how information is written and read from the residual stream in order to allow increases in number of active features.

Can models escape into nonlinear decoding?

Classical compressed sensing gives you d = O(k log m) — no k² penalty — but requires ℓ₁-minimisation to decode, which is an iterative optimisation algorithm. A single transformer layer does a linear transformation plus a nonlinearity (and the paper shows that single-layer nonlinearity doesn't help asymptotically - for monotonic activation, anyway). In order to decode non-linearly, the transformer would have to allocate more than a single layer. This process then requires increased computaitonal steps that would have to be more beneficial performance wise than being able to read and write in a single matrix multiplication.

Model scaling

As models scale toward more complex reasoning, we might see linear representations break down specifically in the layers responsible for the densest, most complex computation — because that's where k grows large enough for the k² penalty to harm performance meaningfully. Features that are "easy" (sparse, linguistic, categorical) would stay linear but features supporting hard reasoning might not, in an effort to reduce sparsity and converge on the linear k relationship with d. That being said, models show evidence of cross-layer features: Sparse Crosscoders for Cross-Layer Features, where features spread over multiple layers. Maybe this is one way in which the systems learn to get around the specific k² penalty of linear read out.

Other Findings

Feature geometry is more flexible than might think

It's often assumed that different features are represented by orthogonal directions. The paper shows (Proposition 9) that this isn't necessary: you can construct valid encodings where all representation vectors point in nearly the same direction, all probe vectors point in nearly the same direction, and yet each feature's own representation and probe vectors are nearly orthogonal to each other. The only real requirement is that feature i's probe vector is approximately orthogonal to feature j's representation vector (for i ≠ j). I imagine this means that because representation vector i and probe vector i are not required to be nearly orthogonal (like they are for different features) then the signal is recoverable. However, if you additionally constrain representation and probe vectors to have unit norm (Proposition 11), then the intuitive picture is restored: each feature's representation and probe align, and different features become approximately orthogonal.

Activation functions don't help

Theorem 12 shows that allowing probes of the form g(x) = σ(Wx + b) with a monotonic activation function (like ReLU) doesn't change the lower bound for binary classification. The argument is that if σ(w^⊤x + b) separates z_i = 1 from z_i = 0, then w^⊤x alone must separate them (since σ is monotonic), so a plain linear threshold suffices. Note, this result does not make any statements about the recovery of real-valued vectors.

Read the Paper

The original paper: arXiv:2602.11246

Discussion

Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.