The audit problem with LLMs
Most AI audits today work like this: send inputs, check outputs, flag anomalies. That is external evaluation. It tells you what the model does. It does not tell you why.
For low-risk applications, that is often enough. For high-risk systems under regulatory scrutiny, it is not. The EU AI Act (Art. 11, Art. 15) expects technical documentation that covers accuracy, robustness, and transparency. External evals alone leave a documentation gap: you can show the model behaves well on your test set, but you cannot show what internal mechanisms produce that behavior.
Mechanistic interpretability is starting to close that gap.
What interpretability adds to an audit
Anthropic’s Scaling Monosemanticity work extracted 34 million features from Claude 3 Sonnet using sparse autoencoders. Among them: features for deception, sycophancy, bias, code backdoors, and dangerous content.
Anthropic explicitly frames these features as a kind of “test set for safety,” because they expose latent capabilities that normal input/output evaluation can miss. Source: Mapping the Mind of a Large Language Model
That matters for audits because:
| External eval alone | External eval + interpretability |
|---|---|
| ”The model did not produce toxic output on our test set" | "The model has an internal feature for toxic content, but it is suppressed by the safety finetune" |
| "The model passed our bias test" | "The model has gender bias features in professions. Activation levels are low after RLHF" |
| "No jailbreaks detected in testing" | "A jailbreak-pattern feature exists. Current safety training reduces its activation but does not eliminate it” |
The second column is not just more informative. It is a different kind of evidence.
Three audit patterns
1. Latent capability detection
A model can have internal representations for dangerous content even if it never outputs that content in normal use. Safety finetuning can suppress activation, but the capability may still exist.
Anthropic found features for:
- Biosafety-related content
- Criminal and dangerous content
- Code backdoors
- Manipulation and power-seeking concepts
Source: Scaling Monosemanticity
For audit purposes, this answers: “Does the model know how to do this, even if it currently refuses?“
2. Safety mechanism verification
If a safety finetune is supposed to prevent certain outputs, interpretability can show whether the training actually suppressed the relevant features or just added a surface-level refusal pattern.
This is the difference between “the model says no” and “the internal mechanism for that behavior is weakened.”
3. Bias mapping
Anthropic reported features for gender bias in professions. With interpretability tools, an audit can identify which features encode bias, measure their activation strength, and track whether mitigation reduces them at the representation level, not just at the output level.
How this connects to the EU AI Act
The EU AI Act does not require mechanistic interpretability by name. But several provisions create transparency and documentation obligations where interpretability evidence is directly useful:
| AI Act requirement | What interpretability provides |
|---|---|
| Art. 11: Technical documentation | Internal analysis of model behavior, known failure modes, documented features |
| Art. 13: Transparency | Evidence of what the model represents internally, not just what it outputs |
| Art. 15: Accuracy, robustness | Internal mechanism analysis beyond external benchmarks |
| Art. 9: Risk management | Identification of latent capabilities (deception, bias, dangerous content) as risk factors |
The European Commission’s FAQ on transparent AI systems tasks the Commission with guidelines and a Code of Practice for transparent generative AI systems. Interpretability evidence fits naturally into this framework.
Practical limitations
| Limitation | Impact on audits |
|---|---|
| Feature coverage is incomplete | Anthropic says discovered features are a small subset of all concepts. An audit cannot claim full coverage |
| Circuit-level analysis is still early | Finding a feature does not explain how it causes output. Causal chains are partially mapped at best |
| Scaling cost is high | DeepMind’s Gemma Scope used ~15% of training compute and stored 20 PiB of activations. Not every team can run this |
| Explanations can be unstable | A 2025 NeurIPS paper on falsifying SAE explanations shows some interpretations are fragile |
What this means for engineering teams
If you operate AI systems that fall under regulatory scrutiny, interpretability evidence is not a legal requirement today, but it is becoming a credibility advantage.
A team that can say “we inspected the model’s internal features for bias and found X, and our mitigation reduced activation by Y%” has a stronger audit story than a team that only ran a benchmark.
Practical steps:
| Step | Action |
|---|---|
| 1 | Identify which of your AI systems are high-risk or may face regulatory review |
| 2 | For those systems, document whether interpretability tools are available for the model you use |
| 3 | Run SAE-based feature inspection on safety-relevant categories (bias, toxicity, deception) |
| 4 | Include interpretability findings in your technical documentation (Art. 11) |
| 5 | Track feature activation changes across model updates and finetuning rounds |
Interpretability is not a compliance checkbox. But it is a tool that makes your audit evidence materially stronger.
The models we deploy are no longer fully opaque. The question is whether your governance process has caught up to that fact.