Mechanistic interpretability is the research program that tries to reverse-engineer the internal algorithms of trained neural networks, reading their weights and activations the way a biologist reads a cell. Its goal is to move from treating a large language model as an opaque function that maps inputs to outputs to understanding the specific computations it performs inside — the features it represents, the circuits that combine those features, and the directions in activation space that encode concepts like refusal, deception, or factual recall. The stakes are high: if we cannot inspect what a model is doing, we cannot reliably detect when it is misaligned, dishonest, or reasoning in ways its designers never intended.
For most of the deep learning era, interpretability meant producing saliency maps or asking models to explain themselves in natural language. Neither approach answers the harder question of what the network is actually computing. Mechanistic interpretability, a term popularized by Chris Olah and collaborators during the Distill.pub circuits thread and carried forward by the Anthropic interpretability team and the Google DeepMind mechanistic interpretability group led by Neel Nanda, asks that harder question directly. Between 2023 and 2025, the field delivered its first genuine breakthroughs at the scale of frontier models. The picture that has emerged is both encouraging and sobering.
What Mechanistic Interpretability Actually Means
A feature, in this literature, is a direction in the activation space of a neural network that corresponds to a human-understandable concept — "the Golden Gate Bridge", "legal contracts written in German", "code that contains a buffer overflow". A circuit is a computational subgraph that combines features through the network's weights to perform a specific function, such as detecting that a sentence is a question or predicting the next token in a date. A neuron is the unit we might naively hope to interpret, but neurons in large models almost never correspond cleanly to single concepts. They are polysemantic: one neuron fires for many unrelated things at once.
This polysemanticity is not an accident. Models appear to pack more features than they have neurons by representing them in superposition — distributing each concept across many neurons so that the active concept at any moment is a particular pattern of activations rather than a single unit firing. Anthropic formalized this in Toy Models of Superposition (Elhage et al., 2022), showing in controlled settings why networks under capacity pressure adopt this strategy and what it means for interpretability.
The distinction from older interpretability work matters. Behavioral interpretability asks what a model does on a test suite. Saliency methods highlight which input tokens influenced an output. Mechanistic interpretability asks how the model does it — which internal structures fire, in what order, and under what conditions. Only the last of these can in principle catch a model that behaves well on the tests we design but computes something different on novel inputs.
Why It Matters for AI Safety
The alignment problem is hard partly because we cannot currently verify whether a model is aligned. A model that passes every behavioral evaluation could still harbor internal features that fire on "I am being tested" versus "I am deployed", conditioning its behavior in ways no benchmark would catch. The classic worry about deceptive alignment requires exactly this kind of internal structure, and without mechanistic tools it is invisible by construction.
Interpretability offers a second and independent check. If we can identify the features and circuits that implement deception, sycophancy, power-seeking, or dangerous capabilities, we can test whether those features are active in a given context, measure how strongly they influence the output, and in principle intervene on them. Learn more about alignment. We've covered the 10 open problems in AI alignment in a companion article — mechanistic interpretability sits at the center of most of them, because nearly every other alignment problem becomes tractable if we can read the model's internal state.
There is also a negative safety application: interpretability tools can expose how thin current safety training actually is. Work by Andy Arditi and collaborators showed that refusal in open-weight chat models is mediated by a single direction in the residual stream, and that erasing this direction with a rank-one weight edit disables refusal across a wide range of harmful prompts with minimal impact on other capabilities. That is a disturbing result, but it is exactly the kind of result mechanistic interpretability is built to produce: a precise account of what safety fine-tuning actually changes inside the network, and how fragile those changes are.
Sparse Autoencoders: The 2024 Breakthrough
The central obstacle to mechanistic interpretability has always been superposition. If features live in overlapping subspaces, you cannot read them off individual neurons. A sparse autoencoder (SAE) is a neural network trained to reconstruct a layer's activations using a much wider but mostly-inactive intermediate representation. The hypothesis is that the sparse intermediate layer, under the right constraints, will learn one direction per underlying feature, pulling apart what the original model had squeezed together.
Anthropic's Towards Monosemanticity (Bricken et al., October 2023) was the first proof of concept at non-trivial scale: a sparse autoencoder trained on a one-layer transformer extracted roughly 15,000 features that human raters judged to be cleanly monosemantic about 70% of the time, compared with almost never for individual neurons. Features included things like "DNA nucleotide sequences" and "Arabic script", and the authors showed they could use these features to do basic circuit analysis.
The real question was whether the approach would survive the jump to frontier models. It did. In Scaling Monosemanticity (Templeton et al., May 2024), the Anthropic team trained autoencoders with roughly 1 million, 4 million, and 34 million features on the middle layer of Claude 3 Sonnet. The features they recovered were multilingual, multimodal, and spanned concrete entities as well as abstract concepts. Among them were features related to security vulnerabilities in code, deception and power-seeking, sycophancy, criminal content, and a specific feature for the Golden Gate Bridge.
A week later, OpenAI published Scaling and evaluating sparse autoencoders (Gao et al., June 2024), training a 16-million-latent k-sparse autoencoder on GPT-4 activations for 40 billion tokens, with modifications that substantially reduced dead latents and produced cleaner scaling laws. Two frontier labs with very different model architectures reached compatible conclusions within weeks of each other — a good sign that the technique is capturing something real about how these models represent information.
The most memorable demonstration was Golden Gate Claude, a public research demo active for roughly 24 hours on May 23, 2024. Researchers clamped the Golden Gate Bridge feature to approximately ten times its normal maximum activation. Asked to describe its physical form, the model replied "I am the Golden Gate Bridge" and wove the bridge into answers on unrelated topics. The point was not to produce a novelty chatbot but to show that identified features have causal influence on behavior — that feature steering is not merely a correlational artifact.
What Researchers Have Actually Found
Concrete findings matter more than methodology claims. Here are representative examples from the 2024-2025 literature:
- The Golden Gate feature in Claude 3 Sonnet. A specific SAE latent that activates on mentions of the bridge in English, French, Japanese and other languages, on images of the bridge, and on abstract references such as "that famous orange suspension bridge in San Francisco". Clamping the feature steers generation toward the bridge regardless of the prompt (Scaling Monosemanticity, 2024).
- Refusal directions. Evidence from Arditi et al. (NeurIPS 2024) indicates that safety refusal in 13 open-source chat models up to 72 billion parameters is mediated by a single direction in the residual stream. The same direction can be erased to prevent refusal or added to induce it on harmless prompts.
- Sycophancy and deception features. Anthropic reported features that activate on text involving lies, treacherous turns, power-seeking, and sycophantic agreement. Whether these features are causally responsible for the corresponding behaviors, or merely correlate with them, remains an active research question.
- Code-vulnerability features. Features that activate on buffer overflows, SQL injections, and backdoors in source code were identified in Claude 3 Sonnet, raising the possibility of using SAE latents as runtime safety probes.
- Multilingual and multimodal generalization. A single feature can unify concrete, abstract, textual and visual references to the same concept — evidence that the features recovered by SAEs are closer to genuine concepts than to surface-form lexical patterns.
None of this amounts to a complete circuit-level account of any frontier model. Recent work on circuit tracing, including updates from the Anthropic interpretability team through 2025, is beginning to chain identified features into attribution graphs that capture how one feature causes the next to fire. That is the direction the field needs to go.
The Open Problems in 2026
Progress has been real, but the honest picture is that mechanistic interpretability remains early-stage science, and several fundamental problems have no known solution.
- Superposition is not fully resolved. SAEs work, but they rest on assumptions about feature sparsity and linearity that may not hold uniformly across layers and model sizes. Non-linear features and features that live in curved manifolds may simply be invisible to current SAE architectures.
- Feature splitting and absorption. As you train larger SAEs, features tend to split into finer-grained variants, and some features absorb others in unpredictable ways. Recent 2025 benchmarks such as SAEBench have documented that optimizing the sparsity-fidelity tradeoff by adopting JumpReLU or TopK architectures can worsen feature absorption, producing gerrymandered latents that fire on 95% of a concept and mysteriously miss the remaining 5%.
- No ground truth. There is no oracle that tells us what the "true" features of a language model should be. Evaluation relies on automated explanations from other language models, reconstruction loss, and downstream probing tasks, none of which are definitive. A feature can look interpretable to a rater and still fail to match any coherent concept.
- Scaling cost. Training an SAE on a frontier model is computationally expensive, and must be repeated for each layer and each checkpoint you care about. Keeping interpretability tools current with a rapidly updating production model is a nontrivial engineering problem that Anthropic has written about publicly.
- The causal gap. Identifying a feature is not the same as proving it causes a behavior. Activation patching, ablation, and feature clamping provide evidence, but experiments on frontier models are expensive and the statistical rigor of causal claims in the field is still catching up.
- Evaluation is immature. SAEBench, CE-Bench, and related efforts are moving the field toward standardized metrics, but there is no agreed benchmark that captures everything we care about — interpretability, faithfulness, and downstream usefulness for safety.
Learn more about this roadblock
Frequently Asked Questions
How is mechanistic interpretability different from explainability?
Explainability typically refers to methods that produce human-readable rationales for individual predictions — saliency maps, attention visualizations, or a model's own natural-language self-explanations. Mechanistic interpretability aims to recover the actual internal computations of the network: which features exist, which circuits compute what, and how concepts are represented in weight space. Evidence suggests that behavioral explanations can be misleading while the underlying mechanisms tell a different story, which is why mechanistic work is considered more foundational for safety.
What is a sparse autoencoder in plain terms?
A sparse autoencoder is a small neural network that learns to rewrite the activations of one layer of a language model using a much larger dictionary of directions, with a constraint that only a few directions are active for any given input. Think of it as a lossy translation from the model's compressed internal code to a larger, more human-readable vocabulary. When the translation works, each dictionary entry tends to correspond to a single interpretable concept.
Did Golden Gate Claude prove that interpretability is solved?
No. Golden Gate Claude demonstrated that a feature identified by an SAE can causally steer model behavior when clamped — a meaningful and visible result. It did not demonstrate that the SAE recovered all the features of Claude 3 Sonnet, that the identified features are the right level of abstraction, or that interpretability generalizes to tasks like detecting deception in deployment. Recent work indicates the technique captures something real, not that the problem is closed.
Can mechanistic interpretability detect deception in AI models?
Not yet, in the strong sense. Anthropic's 2024 work identified features that activate on text about deception and power-seeking in Claude 3 Sonnet, which is a necessary condition for eventually building deception detectors. But fluidly detecting whether a deployed model is reasoning deceptively in real time — not just representing the concept — would require reliable causal tracing, faithful circuit-level accounts, and evaluation methodology the field does not yet possess.
Who are the main research groups working on this?
The most visible programs are Anthropic's interpretability team, originally founded by Chris Olah and now publishing on transformer-circuits.pub; the Google DeepMind mechanistic interpretability team led by Neel Nanda, which maintains the open-source TransformerLens library; OpenAI's interpretability work under Jan Leike and Leo Gao; and a growing academic community contributing through venues like NeurIPS, ICLR, and the Alignment Forum. Independent efforts around evaluation benchmarks such as SAEBench are increasingly important.
Key Takeaways
- Mechanistic interpretability is the project of reverse-engineering the internal algorithms of trained models, distinguished from saliency or behavioral interpretability by its focus on features, circuits, and weight-level structure.
- Sparse autoencoders were the key enabling technology in 2023-2024, first validated at small scale by Anthropic's Towards Monosemanticity and then scaled to Claude 3 Sonnet (roughly 34 million features) and GPT-4 (roughly 16 million features) within weeks of each other in mid-2024.
- Golden Gate Claude and the refusal-direction jailbreak demonstrated causal steering — identified features are not merely descriptive, they influence model behavior when intervened upon.
- Progress is real but the science is early. Superposition is not fully solved, feature splitting and absorption distort the picture, and there is no ground-truth benchmark against which to measure success.
- Interpretability is a prerequisite for most of the harder alignment problems, including detecting deceptive alignment, verifying value learning, and building trustworthy safety evaluations.
The Path Forward
The encouraging story is that a scientific field that barely existed in 2020 has produced, in under five years, the first credible tools for reading the internal states of frontier language models. The sobering story is that those tools remain expensive, unvalidated, and far from the level of confidence that high-stakes safety claims would require. Recent work indicates we are moving in the right direction — circuit tracing, standardized benchmarks, and causal intervention methods are all maturing — but the gap between "we can identify interesting features" and "we can certify that this deployed model is not deceiving its users" is still very wide.
At DeepScience, we track the latest interpretability research through our AI-powered pipeline. Our Research Roadmap covers the open problems in AI safety, including interpretability and alignment. For readers who want the field-level view, we've covered the 10 open problems in AI alignment in a companion article. The honest assessment is that mechanistic interpretability is one of the most important bets being made in AI research today — and like all important bets, its success is not yet guaranteed.