What We Can Actually See Inside LLMs Now

The gap narrowed fast

A year ago, “looking inside” a large language model mostly meant inspecting attention maps or probing hidden states. Useful sometimes, but not the same as understanding the model’s internal computation.

That changed in 2024 and 2025. Three developments matter:

Lab	What they did	Scale	Source
Anthropic	Extracted interpretable features from Claude 3 Sonnet using sparse autoencoders	34M features, 65%+ variance explained, fewer than 300 active features per token	Scaling Monosemanticity
Anthropic	Published “Mapping the Mind of a Large Language Model” with causal steering experiments	Millions of concepts identified, including the Golden Gate Bridge feature	Anthropic Research
DeepMind	Released Gemma Scope: 400+ sparse autoencoders for Gemma 2, 30M+ learned features	~15% of Gemma 2 9B training compute used for interpretability, 20 PiB of activations stored	Gemma Scope Blog

Mechanistic interpretability is no longer just an academic curiosity. It is becoming a practical tool.

How it works

The core question: what internal variables, circuits, and computations cause the model to do what it does?

Two hypotheses drive the field:

Hypothesis	What it means
Linear representation	Concepts are directions in activation space
Superposition	Models store more features than dimensions by overlapping them in nearly orthogonal directions

Source: Anthropic, Scaling Monosemanticity

That is why a single neuron does not equal one concept. A neuron participates in many concepts. A concept is spread across many neurons.

Sparse autoencoders (SAEs) solve this by decomposing activation vectors into a larger dictionary of learned features with a sparsity constraint, so only a few activate at once. DeepMind’s Gemma Scope used a JumpReLU SAE architecture trained at every layer and sublayer of Gemma 2.

What we actually found

Not just syntax. Real semantic structure.

Anthropic’s features from Claude 3 Sonnet include:

Feature category	Examples
Entities	Cities, people, atomic elements, scientific fields
Code	Programming syntax, bugs, code backdoors
Safety	Deception, sycophancy, bias, dangerous content, scam emails
Multilingual	Golden Gate Bridge fires on English, Japanese, Chinese, Greek, Vietnamese, Russian, and images

Source: Anthropic Research

The Golden Gate Bridge feature is especially clear. Anthropic found a local neighborhood including Alcatraz Island, Ghirardelli Square, the Golden State Warriors, Gavin Newsom, the 1906 earthquake, and Vertigo. That is not a keyword detector. It is a conceptual cluster.

Critically, these features are causal. Anthropic ran steering experiments that made Claude answer as if it were the Golden Gate Bridge itself.

Why this matters for companies

Use case	What interpretability enables
Auditing	Ask whether a jailbreak feature, toxic content feature, or bias feature exists inside the model. Expose latent capabilities missed by input/output evaluation
Debugging	Inspect which features activate on specific failure cases. Trace circuits that cause hallucination or sycophancy on certain inputs
Compliance	The EU AI Act creates transparency obligations for certain AI systems. Interpretability provides documentation of known failure modes, internal safety analysis, and model governance artifacts

Source for compliance framing: EU Commission FAQ on transparent AI systems

Current tools

Tool	Use case	Source
TransformerLens	Transformer internals, activation patching, circuits	Research workflows
SAE-Vis	Visualizing sparse autoencoder features	Feature inspection
Neuronpedia	Feature browser, public SAE demos	Used by DeepMind for Gemma Scope

What still does not work

Limitation	Detail
Cannot map everything	Anthropic says discovered features are only a small subset. Full mapping would be cost-prohibitive
Circuits are still hard	Finding a feature does not explain how it gets used. Representation does not equal mechanism
Cross-layer superposition	Features distributed across layers complicate clean decompositions
Interpretations can be unstable	A feature can look semantic in one context and fuzzy in another. 2025 NeurIPS work on falsifying SAE explanations shows the community knows this
Scaling is expensive	Gemma Scope: 20 PiB of activations, hundreds of billions of SAE parameters

The right mental model

Mechanistic interpretability cannot yet fully explain general reasoning, prove a model is safe, or replace external evals and red-teaming.

What it can do: identify internal features with meaningful semantic content, show causal influence on behavior, locate circuits involved in specific outputs, and support auditing, debugging, and safety evaluation.

We can now inspect parts of an LLM’s internals with enough fidelity to change how we build, test, and govern AI systems. That is already a meaningful shift.