Circuit Tracing: A Step Closer to Understanding Large Language Models
Reverse-engineering large languages models' computation circuit to understand their decision-making processes The post Circuit Tracing: A Step Closer to Understanding Large Language Models appeared first on Towards Data Science.

Context
Over the years, Transformer-based large language models (LLMs) have made substantial progress across a wide range of tasks evolving from simple information retrieval systems to sophisticated agents capable of coding, writing, conducting research, and much more. But despite their capabilities, these models are still largely black boxes. Given an input, they accomplish the task but we lack intuitive ways to understand how the task was actually accomplished.
LLMs are designed to predict the statistically best next word/token. But do they only focus on predicting the next token, or plan ahead? For instance, when we ask a model to write a poem, is it generating one word at a time, or is it anticipating rhyme patterns before outputting the word? or when asked about basic reasoning question like what is state capital where city Dallas is located? They often produce results that looks like a chain of reasoning, but did the model actually use that reasoning? We lack visibility into the model’s internal thought process. To understand LLMs, we need to trace their underlying logic.
The study of LLMs internal computation falls under “Mechanistic Interpretability,” which aims to uncover the computational circuit of models. Anthropic is one of the leading AI companies working on interpretability. In March 2025, they published a paper titled “Circuit Tracing: Revealing Computational Graphs in Language Models,” which aims to tackle the problem of circuit tracing.
This post aims to explain the core ideas behind their work and build a foundation for understating circuit tracing in LLMs.
What is a circuit in LLMs?
Before we can define a “circuit” in language models, we first need to look inside the LLM. It’s a Neural Network built on the transformer architecture, so it seems obvious to treat neurons as a basic computational unit and interpret the patterns of their activations across layers as the model’s computation circuit.
However, the “Towards Monosemanticity” paper revealed that tracking neuron activations alone doesn’t provide a clear understanding of why those neurons are activated. This is because individual neurons are often polysemantic they respond to a mix of unrelated concepts.
The paper further showed that neurons are composed of more fundamental units called features, which capture more interpretable information. In fact, a neuron can be seen as a combination of features. So rather than tracing neuron activations, we aim to trace feature activations the actual units of meaning driving the model’s outputs.
With that, we can define a circuit as a sequence of feature activations and connections used by the model to transform a given input into an output.
Now that we know what we’re looking for, let’s dive into the technical setup.
Technical Setup
We’ve established that we need to trace feature activations rather than neuron activations. To enable this, we need to convert the neurons of the existing LLM models into features, i.e. build a replacement model that represents computations in terms of features.
Before diving into how this replacement model is constructed, let’s briefly review the architecture of Transformer-based large language models.
The following diagram illustrates how transformer-based language models operate. The idea is to convert the input into tokens using embeddings. These tokens are passed to the attention block, which calculates the relationships between tokens. Then, each token is passed to the multi-layer perceptron (MLP) block, which further refines the token using a non-linear activation and linear transformations. This process is repeated across many layers before the model generates the final output.

Now that we have laid out the structure of transformer based LLM, let’s looks at what transcoders are. The authors have used a “Transcoder” to develop the replacement model.
Transcoders
A transcoder is a neural network (generally with a much higher dimension than LLM’s dimension) in itself designed to replace the MLP block in a transformer model with a more interpretable, functionally equivalent component (feature).

It processes tokens from the attention block in three stages: encoding, sparse activation, and decoding. Effectively, it scales the input to a higher-dimensional space, applies activation to force the model to activate only sparse features, and then compresses the output back to the original dimension in the decoding stage.

With a basic understanding of transformer-based LLMs and transcoder, let’s look at how a transcoder is used to build a replacement model.
Construct a replacement model
As mentioned earlier, a transformer block typically consists of two main components: an attention block and an MLP block (feedforward network). To build a replacement model, the MLP block in the original transformer model is replaced with a transcoder. This integration is seamless because the transcoder is trained to mimic the output of the original MLP, while also exposing its internal computations through sparse and modular features.
While standard transcoders are trained to imitate the MLP behavior within a single transformer layer, the authors of the paper used a cross layer transcoder (CLT), which captures the combined effects of multiple transcoder blocks across several layers. This is important because it allows us to track if a feature is spread across multiple layers, which is needed for circuit tracing.
The below image illustrates how the cross-layer transcoders (CLT) setup is used in building a replacement model. The Transcoder output at layer 1 contributes to constructing the MLP-equivalent output in all the upper layers until the end.

Side Note: the following image is from the paper and shows how a replacement model is constructed. it replaces the neuron of the original model with features.

Now that we understand the architecture of the replacement model, let’s look at how the interpretable presentation is built on the replacement model’s computational path.
Interpretable presentation of model’s computation: Attribution graph
To build the interpretable representation of the model’s computational path, we start from the model’s output feature and trace backward through the feature network to uncover which earlier feature contributed to it. This is done using the backward Jacobian, which tells how much a feature in the previous layer contributed to the current feature activation, and is applied recursively until we reach the input. Each feature is considered as a node and each influence as an edge. This process can lead to a complex graph with millions of edges and nodes, hence pruning is done to keep the graph compact and manually interpretable.
The authors refer to this computational graph as an attribution graph and have also developed a tool to inspect it. This forms the core contribution of the paper.
The image below illustrate a sample attribution graph.

Now, with all this understanding, we can go to feature interpretability.
Feature interpretability using an attribution graph
The researchers used attribution graphs on Anthropic’s Claude 3.5 Haiku model to study how it behaves across different tasks. In the case of poem generation, they discovered that the model doesn’t just generate the next word. It engages in a form of planning, both forward and backward. Before generating a line, the model identifies several possible rhyming or semantically appropriate words to end with, then works backward to craft a line that naturally leads to that target. Surprisingly, the model appears to hold multiple candidate end words in mind simultaneously, and it can restructure the entire sentence based on which one it ultimately chooses.
This technique offers a clear, mechanistic view of how language models generate structured, creative text. This is a significant milestone for the AI community. As we develop increasingly powerful models, the ability to trace and understand their internal planning and execution will be essential for ensuring alignment, safety, and trust in AI systems.
Limitations of the current approach
Attribution graphs offer a way to trace model behavior for a single input, but they don’t yet provide a reliable method for understanding global circuits or the consistent mechanisms a model uses across many examples. This analysis relies on replacing MLP computations with transcoders, but it is still unclear whether these transcoders truly replicate the original mechanisms or simply approximate the outputs. Additionally, the current approach highlights only active features, but inactive or inhibitory ones can be just as important for understanding the model’s behavior.
Conclusion
Circuit tracing via attribution graph is an early but important step toward understanding how language models work internally. While this approach still has a long way to go, the introduction of circuit tracing marks a major milestone on the path to true interpretability.
References:
- https://transformer-circuits.pub/2025/attribution-graphs/methods.html
- https://arxiv.org/pdf/2406.11944
- https://transformer-circuits.pub/2025/attribution-graphs/biology.html
- https://transformer-circuits.pub/2024/crosscoders/index.html
- https://transformer-circuits.pub/2023/monosemantic-features
The post Circuit Tracing: A Step Closer to Understanding Large Language Models appeared first on Towards Data Science.