The gap narrowed fast
A year ago, “looking inside” a large language model mostly meant inspecting attention maps or probing hidden states. Useful sometimes, but not the same as understanding the model’s internal computation.
That changed in 2024 and 2025. Three developments matter:
| Lab | What they did | Scale | Source |
|---|---|---|---|
| Anthropic | Extracted interpretable features from Claude 3 Sonnet using sparse autoencoders | 34M features, 65%+ variance explained, fewer than 300 active features per token | Scaling Monosemanticity |
| Anthropic | Published “Mapping the Mind of a Large Language Model” with causal steering experiments | Millions of concepts identified, including the Golden Gate Bridge feature | Anthropic Research |
| DeepMind | Released Gemma Scope: 400+ sparse autoencoders for Gemma 2, 30M+ learned features | ~15% of Gemma 2 9B training compute used for interpretability, 20 PiB of activations stored | Gemma Scope Blog |
Mechanistic interpretability is no longer just an academic curiosity. It is becoming a practical tool.
How it works
The core question: what internal variables, circuits, and computations cause the model to do what it does?
Two hypotheses drive the field:
| Hypothesis | What it means |
|---|---|
| Linear representation | Concepts are directions in activation space |
| Superposition | Models store more features than dimensions by overlapping them in nearly orthogonal directions |
Source: Anthropic, Scaling Monosemanticity
That is why a single neuron does not equal one concept. A neuron participates in many concepts. A concept is spread across many neurons.
Sparse autoencoders (SAEs) solve this by decomposing activation vectors into a larger dictionary of learned features with a sparsity constraint, so only a few activate at once. DeepMind’s Gemma Scope used a JumpReLU SAE architecture trained at every layer and sublayer of Gemma 2.
What we actually found
Not just syntax. Real semantic structure.
Anthropic’s features from Claude 3 Sonnet include:
| Feature category | Examples |
|---|---|
| Entities | Cities, people, atomic elements, scientific fields |
| Code | Programming syntax, bugs, code backdoors |
| Safety | Deception, sycophancy, bias, dangerous content, scam emails |
| Multilingual | Golden Gate Bridge fires on English, Japanese, Chinese, Greek, Vietnamese, Russian, and images |
Source: Anthropic Research
The Golden Gate Bridge feature is especially clear. Anthropic found a local neighborhood including Alcatraz Island, Ghirardelli Square, the Golden State Warriors, Gavin Newsom, the 1906 earthquake, and Vertigo. That is not a keyword detector. It is a conceptual cluster.
Critically, these features are causal. Anthropic ran steering experiments that made Claude answer as if it were the Golden Gate Bridge itself.
Why this matters for companies
| Use case | What interpretability enables |
|---|---|
| Auditing | Ask whether a jailbreak feature, toxic content feature, or bias feature exists inside the model. Expose latent capabilities missed by input/output evaluation |
| Debugging | Inspect which features activate on specific failure cases. Trace circuits that cause hallucination or sycophancy on certain inputs |
| Compliance | The EU AI Act creates transparency obligations for certain AI systems. Interpretability provides documentation of known failure modes, internal safety analysis, and model governance artifacts |
Source for compliance framing: EU Commission FAQ on transparent AI systems
Current tools
| Tool | Use case | Source |
|---|---|---|
| TransformerLens | Transformer internals, activation patching, circuits | Research workflows |
| SAE-Vis | Visualizing sparse autoencoder features | Feature inspection |
| Neuronpedia | Feature browser, public SAE demos | Used by DeepMind for Gemma Scope |
What still does not work
| Limitation | Detail |
|---|---|
| Cannot map everything | Anthropic says discovered features are only a small subset. Full mapping would be cost-prohibitive |
| Circuits are still hard | Finding a feature does not explain how it gets used. Representation does not equal mechanism |
| Cross-layer superposition | Features distributed across layers complicate clean decompositions |
| Interpretations can be unstable | A feature can look semantic in one context and fuzzy in another. 2025 NeurIPS work on falsifying SAE explanations shows the community knows this |
| Scaling is expensive | Gemma Scope: 20 PiB of activations, hundreds of billions of SAE parameters |
The right mental model
Mechanistic interpretability cannot yet fully explain general reasoning, prove a model is safe, or replace external evals and red-teaming.
What it can do: identify internal features with meaningful semantic content, show causal influence on behavior, locate circuits involved in specific outputs, and support auditing, debugging, and safety evaluation.
We can now inspect parts of an LLM’s internals with enough fidelity to change how we build, test, and govern AI systems. That is already a meaningful shift.