Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)

Saturday 11 October 2025
16:00 18:00

Lightfoot Room, Old Divinity School, St John's College 2 All Saints Passage England, CB2 3LS United Kingdom (map)

Google Calendar ICS

As AI systems become increasingly agentic, understanding how they make decisions becomes crucial for safety and control. This talk presents novel mechanistic interpretability methods for decomposing chain-of-thought reasoning in LLM agents to identify "thought anchors" - critical reasoning steps that disproportionately influence outcomes and guide subsequent reasoning trajectories.

Using three complementary attribution methods - counterfactual resampling, attention pattern analysis, and causal intervention - we map the hidden structure of LLM reasoning at the sentence level. Our research reveals surprising patterns: models often "decide then justify," in decision-making scenarios (e.g., blackmail) with early analytical sentences locking in decisions that subsequent reasoning merely rationalizes. Planning and uncertainty management sentences consistently emerge as high-leverage points in the mathematics domain, while explicit computation often have minimal causal impact.

These findings have direct implications for AI agency and control. By identifying which reasoning steps actually drive decisions versus which are post-hoc rationalizations, we can better understand when and how AI systems exercise agency. In safety-critical scenarios like potential deception or coercion, our methods reveal that interventions should target early analytical steps rather than surface-level problematic statements.

I'll demonstrate our open-source tool (thought-anchors.com) for visualizing reasoning dependencies and discuss how this framework could inform future safeguards for agentic AI systems. Understanding the causal structure of AI reasoning is essential as we delegate more complex decisions to these systems - we need to know not just what they decide, but how those decisions crystallize during the reasoning process.

Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)

Mission
Contact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)

Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

MissionContact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Mission
Contact us at contact@ukaiforum.com