As AI systems become increasingly agentic, understanding how they make decisions becomes crucial for safety and control. This talk presents novel mechanistic interpretability methods for decomposing chain-of-thought reasoning in LLM agents to identify "thought anchors" - critical reasoning steps that disproportionately influence outcomes and guide subsequent reasoning trajectories.
Using three complementary attribution methods - counterfactual resampling, attention pattern analysis, and causal intervention - we map the hidden structure of LLM reasoning at the sentence level. Our research reveals surprising patterns: models often "decide then justify," in decision-making scenarios (e.g., blackmail) with early analytical sentences locking in decisions that subsequent reasoning merely rationalizes. Planning and uncertainty management sentences consistently emerge as high-leverage points in the mathematics domain, while explicit computation often have minimal causal impact.
These findings have direct implications for AI agency and control. By identifying which reasoning steps actually drive decisions versus which are post-hoc rationalizations, we can better understand when and how AI systems exercise agency. In safety-critical scenarios like potential deception or coercion, our methods reveal that interventions should target early analytical steps rather than surface-level problematic statements.
I'll demonstrate our open-source tool (thought-anchors.com) for visualizing reasoning dependencies and discuss how this framework could inform future safeguards for agentic AI systems. Understanding the causal structure of AI reasoning is essential as we delegate more complex decisions to these systems - we need to know not just what they decide, but how those decisions crystallize during the reasoning process.