Takeaways from Our First Artificial Agent Speaker Series Event
Uzay Macar on Chain-of-Thought Interpretability: How do AI systems really make decisions? And which reasoning steps actually matter?
This past Saturday, UK AI Forum hosted our first speaker series event: Uzay Macar from MATS visited St John's College to present groundbreaking research on understanding how large language models reason through complex problems. As AI systems become increasingly agentic and are deployed in high-stakes scenarios, understanding their decision-making processes isn't just academically interesting—it's crucial for safety and control.
The Challenge: Making Sense of AI "Thinking"
When modern reasoning models solve problems, they generate long chains of thought—sometimes thousands of tokens explaining their reasoning step by step. But not all reasoning steps are created equal. Some sentences are critical decision points that lock in the final answer, while others are merely post-hoc rationalisations. It’s important to be able to tell these apart.
Macar and his colleagues developed three complementary approaches to identify what they call "thought anchors"—reasoning steps that have outsized importance and disproportionately influence the subsequent reasoning process:
Counterfactual resampling: Measuring each sentence's impact by comparing outcomes across 100 rollouts with and without that sentence
Receiver head analysis: Identifying attention patterns where specific "broadcasting" sentences receive disproportionate attention from all future steps
Attention suppression: Causally measuring logical connections by suppressing attention to specific sentences and observing downstream effects
Key Findings: Models Often "Decide Then Justify"
The research reveals surprising patterns in how models actually reason:
Planning and uncertainty management sentences (like "Let me try a different approach...") consistently emerge as high-leverage decision points with the highest counterfactual importance
Explicit computations, despite seeming important, often have minimal causal impact on final outcomes
In decision-making scenarios, early analytical sentences often lock in decisions that subsequent reasoning merely rationalises—the model decides first, then justifies
Perhaps most encouragingly for interpretability research: all three methods converge on identifying the same critical sentences, despite using completely different approaches (black-box behavioural analysis and white-box mechanistic analysis).
Try It Yourself
The team has released an open-source tool at thought-anchors.com where you can visualise reasoning traces as directed acyclic graphs, with critical sentences highlighted and dependencies mapped. This isn't just for researchers—it's a practical tool for debugging AI reasoning and understanding model decisions.
Main Takeaways
Study distributions, not single rollouts: One reasoning trace doesn't capture the full picture; you need to analyze distributions over trajectories
Sentences are the right level of analysis: They replace intermediate activations as the natural unit for understanding reasoning
Some sentences matter far more than others: Planning and uncertainty management steps have the highest counterfactual importance, while normative self-preservation sentences are surprisingly fragile
Convergence across methods is promising: Black-box and white-box approaches identify the same critical sentences, validating the framework
Chain-of-thought has discoverable structure: Reasoning isn't flat; it has hierarchical organisation we can map and understand
An Urgent Timeline
Macar emphasized an important point: this interpretability window may close soon. As models move toward latent reasoning and ‘neuralese’ (thinking without explicit chains of thought), these legible reasoning traces may disappear. The field has a limited window to develop tools and methods for understanding explicit reasoning before the next paradigm shift.
During Q&A, Macar discussed potential future directions: Can we train models to explicitly mark their most important reasoning steps? How can we preserve legible chain-of-thought even as capabilities advance? And critically—we have very little visibility into how major AI labs are actually analysing reasoning traces, suggesting significant tooling gaps remain.
Why This Matters for AI Safety
Understanding which reasoning steps drive decisions versus which are post-hoc rationalizations is essential for AI safety and control. In scenarios involving potential deception or coercion, interventions should target the early analytical steps where decisions crystallize, not just the surface-level problematic statements. As Macar noted, interpretability can help steer models away from dangerous behaviours by identifying the causal nodes that actually produce those behaviours.
For anyone interested in AI safety, Macar emphasised this is an exciting time to enter the field—with opportunities spanning technical research, policy, governance, and more. The AI safety community is uniquely collaborative, united by working toward the same crucial goal.
Uzay’s full paper “ "Thought Anchors: Which LLM Reasoning Steps Matter?" can be found here: https://arxiv.org/pdf/2506.19143
After the Q&A session there was time for mingling and more informal discussion.
Our Speaker Series is running every Saturday through the end of November, so we hope to see you there!