The audit problem with LLMs

Most AI audits today work like this: send inputs, check outputs, flag anomalies. That is external evaluation. It tells you what the model does. It does not tell you why.

For low-risk applications, that is often enough. For high-risk systems under regulatory scrutiny, it is not. The EU AI Act (Art. 11, Art. 15) expects technical documentation that covers accuracy, robustness, and transparency. External evals alone leave a documentation gap: you can show the model behaves well on your test set, but you cannot show what internal mechanisms produce that behavior.

Mechanistic interpretability is starting to close that gap.

What interpretability adds to an audit

Anthropic’s Scaling Monosemanticity work extracted 34 million features from Claude 3 Sonnet using sparse autoencoders. Among them: features for deception, sycophancy, bias, code backdoors, and dangerous content.

Anthropic explicitly frames these features as a kind of “test set for safety,” because they expose latent capabilities that normal input/output evaluation can miss. Source: Mapping the Mind of a Large Language Model

That matters for audits because:

External eval aloneExternal eval + interpretability
”The model did not produce toxic output on our test set""The model has an internal feature for toxic content, but it is suppressed by the safety finetune"
"The model passed our bias test""The model has gender bias features in professions. Activation levels are low after RLHF"
"No jailbreaks detected in testing""A jailbreak-pattern feature exists. Current safety training reduces its activation but does not eliminate it”

The second column is not just more informative. It is a different kind of evidence.

Three audit patterns

1. Latent capability detection

A model can have internal representations for dangerous content even if it never outputs that content in normal use. Safety finetuning can suppress activation, but the capability may still exist.

Anthropic found features for:

  • Biosafety-related content
  • Criminal and dangerous content
  • Code backdoors
  • Manipulation and power-seeking concepts

Source: Scaling Monosemanticity

For audit purposes, this answers: “Does the model know how to do this, even if it currently refuses?“

2. Safety mechanism verification

If a safety finetune is supposed to prevent certain outputs, interpretability can show whether the training actually suppressed the relevant features or just added a surface-level refusal pattern.

This is the difference between “the model says no” and “the internal mechanism for that behavior is weakened.”

3. Bias mapping

Anthropic reported features for gender bias in professions. With interpretability tools, an audit can identify which features encode bias, measure their activation strength, and track whether mitigation reduces them at the representation level, not just at the output level.

How this connects to the EU AI Act

The EU AI Act does not require mechanistic interpretability by name. But several provisions create transparency and documentation obligations where interpretability evidence is directly useful:

AI Act requirementWhat interpretability provides
Art. 11: Technical documentationInternal analysis of model behavior, known failure modes, documented features
Art. 13: TransparencyEvidence of what the model represents internally, not just what it outputs
Art. 15: Accuracy, robustnessInternal mechanism analysis beyond external benchmarks
Art. 9: Risk managementIdentification of latent capabilities (deception, bias, dangerous content) as risk factors

The European Commission’s FAQ on transparent AI systems tasks the Commission with guidelines and a Code of Practice for transparent generative AI systems. Interpretability evidence fits naturally into this framework.

Practical limitations

LimitationImpact on audits
Feature coverage is incompleteAnthropic says discovered features are a small subset of all concepts. An audit cannot claim full coverage
Circuit-level analysis is still earlyFinding a feature does not explain how it causes output. Causal chains are partially mapped at best
Scaling cost is highDeepMind’s Gemma Scope used ~15% of training compute and stored 20 PiB of activations. Not every team can run this
Explanations can be unstableA 2025 NeurIPS paper on falsifying SAE explanations shows some interpretations are fragile

What this means for engineering teams

If you operate AI systems that fall under regulatory scrutiny, interpretability evidence is not a legal requirement today, but it is becoming a credibility advantage.

A team that can say “we inspected the model’s internal features for bias and found X, and our mitigation reduced activation by Y%” has a stronger audit story than a team that only ran a benchmark.

Practical steps:

StepAction
1Identify which of your AI systems are high-risk or may face regulatory review
2For those systems, document whether interpretability tools are available for the model you use
3Run SAE-based feature inspection on safety-relevant categories (bias, toxicity, deception)
4Include interpretability findings in your technical documentation (Art. 11)
5Track feature activation changes across model updates and finetuning rounds

Interpretability is not a compliance checkbox. But it is a tool that makes your audit evidence materially stronger.

The models we deploy are no longer fully opaque. The question is whether your governance process has caught up to that fact.