The gap narrowed fast

A year ago, “looking inside” a large language model mostly meant inspecting attention maps or probing hidden states. Useful sometimes, but not the same as understanding the model’s internal computation.

That changed in 2024 and 2025. Three developments matter:

LabWhat they didScaleSource
AnthropicExtracted interpretable features from Claude 3 Sonnet using sparse autoencoders34M features, 65%+ variance explained, fewer than 300 active features per tokenScaling Monosemanticity
AnthropicPublished “Mapping the Mind of a Large Language Model” with causal steering experimentsMillions of concepts identified, including the Golden Gate Bridge featureAnthropic Research
DeepMindReleased Gemma Scope: 400+ sparse autoencoders for Gemma 2, 30M+ learned features~15% of Gemma 2 9B training compute used for interpretability, 20 PiB of activations storedGemma Scope Blog

Mechanistic interpretability is no longer just an academic curiosity. It is becoming a practical tool.

How it works

The core question: what internal variables, circuits, and computations cause the model to do what it does?

Two hypotheses drive the field:

HypothesisWhat it means
Linear representationConcepts are directions in activation space
SuperpositionModels store more features than dimensions by overlapping them in nearly orthogonal directions

Source: Anthropic, Scaling Monosemanticity

That is why a single neuron does not equal one concept. A neuron participates in many concepts. A concept is spread across many neurons.

Sparse autoencoders (SAEs) solve this by decomposing activation vectors into a larger dictionary of learned features with a sparsity constraint, so only a few activate at once. DeepMind’s Gemma Scope used a JumpReLU SAE architecture trained at every layer and sublayer of Gemma 2.

What we actually found

Not just syntax. Real semantic structure.

Anthropic’s features from Claude 3 Sonnet include:

Feature categoryExamples
EntitiesCities, people, atomic elements, scientific fields
CodeProgramming syntax, bugs, code backdoors
SafetyDeception, sycophancy, bias, dangerous content, scam emails
MultilingualGolden Gate Bridge fires on English, Japanese, Chinese, Greek, Vietnamese, Russian, and images

Source: Anthropic Research

The Golden Gate Bridge feature is especially clear. Anthropic found a local neighborhood including Alcatraz Island, Ghirardelli Square, the Golden State Warriors, Gavin Newsom, the 1906 earthquake, and Vertigo. That is not a keyword detector. It is a conceptual cluster.

Critically, these features are causal. Anthropic ran steering experiments that made Claude answer as if it were the Golden Gate Bridge itself.

Why this matters for companies

Use caseWhat interpretability enables
AuditingAsk whether a jailbreak feature, toxic content feature, or bias feature exists inside the model. Expose latent capabilities missed by input/output evaluation
DebuggingInspect which features activate on specific failure cases. Trace circuits that cause hallucination or sycophancy on certain inputs
ComplianceThe EU AI Act creates transparency obligations for certain AI systems. Interpretability provides documentation of known failure modes, internal safety analysis, and model governance artifacts

Source for compliance framing: EU Commission FAQ on transparent AI systems

Current tools

ToolUse caseSource
TransformerLensTransformer internals, activation patching, circuitsResearch workflows
SAE-VisVisualizing sparse autoencoder featuresFeature inspection
NeuronpediaFeature browser, public SAE demosUsed by DeepMind for Gemma Scope

What still does not work

LimitationDetail
Cannot map everythingAnthropic says discovered features are only a small subset. Full mapping would be cost-prohibitive
Circuits are still hardFinding a feature does not explain how it gets used. Representation does not equal mechanism
Cross-layer superpositionFeatures distributed across layers complicate clean decompositions
Interpretations can be unstableA feature can look semantic in one context and fuzzy in another. 2025 NeurIPS work on falsifying SAE explanations shows the community knows this
Scaling is expensiveGemma Scope: 20 PiB of activations, hundreds of billions of SAE parameters

The right mental model

Mechanistic interpretability cannot yet fully explain general reasoning, prove a model is safe, or replace external evals and red-teaming.

What it can do: identify internal features with meaningful semantic content, show causal influence on behavior, locate circuits involved in specific outputs, and support auditing, debugging, and safety evaluation.

We can now inspect parts of an LLM’s internals with enough fidelity to change how we build, test, and govern AI systems. That is already a meaningful shift.