Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)
Oct
11

Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)

  • Lightfoot Room, Old Divinity School, St John's College (map)
  • Google Calendar ICS

As AI systems become increasingly agentic, understanding how they make decisions becomes crucial for safety and control. This talk presents novel mechanistic interpretability methods for decomposing chain-of-thought reasoning in LLM agents to identify "thought anchors" - critical reasoning steps that disproportionately influence outcomes and guide subsequent reasoning trajectories.

Using three complementary attribution methods - counterfactual resampling, attention pattern analysis, and causal intervention - we map the hidden structure of LLM reasoning at the sentence level. Our research reveals surprising patterns: models often "decide then justify," in decision-making scenarios (e.g., blackmail) with early analytical sentences locking in decisions that subsequent reasoning merely rationalizes. Planning and uncertainty management sentences consistently emerge as high-leverage points in the mathematics domain, while explicit computation often have minimal causal impact.

These findings have direct implications for AI agency and control. By identifying which reasoning steps actually drive decisions versus which are post-hoc rationalizations, we can better understand when and how AI systems exercise agency. In safety-critical scenarios like potential deception or coercion, our methods reveal that interventions should target early analytical steps rather than surface-level problematic statements.

I'll demonstrate our open-source tool (thought-anchors.com) for visualizing reasoning dependencies and discuss how this framework could inform future safeguards for agentic AI systems. Understanding the causal structure of AI reasoning is essential as we delegate more complex decisions to these systems - we need to know not just what they decide, but how those decisions crystallize during the reasoning process.

View Event →
Challenges for Questionnaire-based Safety Assessments of AI Agents (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)
Oct
18

Challenges for Questionnaire-based Safety Assessments of AI Agents (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

  • Lightfoot Room, Old Divinity School, St John's College (map)
  • Google Calendar ICS

As AI systems advance in their capabilities, it is quickly becoming paramount to measure their safety and alignment to human values. A fast-growing field of AI research is devoted to developing such evaluation methods. However, most current advances in this domain are of doubtful quality. Standard methods typically prompt large language models (LLMs) in a questionnaire-style to describe their values or how they would behave in hypothetical scenarios. Because current assessments focus on unaugmented LLMs, they fall short in evaluating AI agents which are expected to pose the greatest risks. The space of an LLM's responses to questionnaire-style prompts is extremely narrow compared to the space of inputs, possible actions, and continuous interactions of AI agents with their environment in realistic scenarios, and hence unlikely to be representative thereof. We further contend that such assessments make strong, unfounded assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack the necessary ecological validity. We then argue that a structurally identical issue holds for current approaches to AI alignment. These are similarly applied only to LLMs rather than AI agents, hence neglecting the most critical risk scenarios. Lastly, we discuss how to improve both safety assessments and alignment training by taking these shortcomings to heart while satisfying practical constraints.

View Event →
Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)
Oct
25

Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)

  • Lightfoot Room, Old Divinity School, St John's College (map)
  • Google Calendar ICS

With the rise of large language models (LLMs), multi-agent systems (MASs) are becoming increasingly popular and getting deployed for multiple tasks. Evaluating their performance has become critical for both capability assessment and AI governance. In this talk I take a look at the evaluation approaches in the field and propose the need for a more principled evaluation framework for MASs. 

This talk presents a critique of current MAS evaluation approaches along three key axes. First, task selection is ad hoc and benchmarks often aggregate unrelated games without grounding in a theory of cooperation. As a result, existing benchmarks lack a principled foundation for what constitutes cooperative capability and why it matters. Second, metrics collapse distinct capacities into win rates in games while conflating success with the presence or absence of a capability. Third, governance-relevant behaviours are often neglected with little attention to significant risk factors like collusion or value misalignment. 

In light of such concerns, I propose the need for a new evaluation paradigm grounded in Cooperative AI theory. Such a framework would ideally decompose cooperation into core cognitive-social capacities that are grounded in theory with task designs that make the capability to task mapping explicit. Moreover, any such framework should include evaluations that reflect both agent capability and systemic safety concerns relevant to AI governance. Such a structured evaluation framework would be essential if we are to set meaningful safety and capability standards for real-world multi-agent deployments in the near future.

View Event →
Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)
Nov
1

Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)

  • Lightfoot Room, Old Divinity School, St John's College (map)
  • Google Calendar ICS

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at this https URL.

View Event →
Stress Testing Deliberative Alignment for Anti-Scheming Training (Marius Hobbhahn, Apollo Research)
Nov
8

Stress Testing Deliberative Alignment for Anti-Scheming Training (Marius Hobbhahn, Apollo Research)

Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming". Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigating scheming requires different strategies than are typically used in ML. We propose that assessing anti-scheming interventions requires at least (1) testing propensity to scheme on far out-of-distribution (OOD) tasks, (2) evaluating whether lack of scheming is driven by situational awareness, and (3) checking for robustness to pre-existing misaligned goals. We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming, and design evaluations for covert actions. We then stress-test deliberative alignment as a case study for anti-scheming. Across 26 OOD evaluations (180+ environments), deliberative alignment reduces covert action rates (OpenAI o3: 13%->0.4%) but does not fully eliminate them. Our mitigation is also able to largely stop agents from pursuing a hidden goal previously trained into the model, but we still find misbehavior after additional red-teaming. We find that models' chain-of-thought (CoT) often demonstrates awareness of being evaluated for alignment, and show causal evidence that this awareness decreases covert behavior, while unawareness increases it. Therefore, we cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness. While we rely on human-legible CoT for training, studying situational awareness, and demonstrating clear evidence of misalignment, our ability to rely on this degrades as models continue to depart from reasoning in standard English. We encourage research into alignment mitigations for scheming and their assessment, especially for the adversarial case of deceptive alignment, which this paper does not address.

View Event →
Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments (Chris Schnabl, Pivotal Labs & University of Cambridge)
Nov
15

Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments (Chris Schnabl, Pivotal Labs & University of Cambridge)

As AI systems shift from static predictors to increasingly autonomous agents, the challenge is not only how they act, but how we can trust and govern those actions. My recent work on Attestable Audits (ICML 2025) introduces a method for making AI evaluations verifiable: using secure hardware, we can cryptographically guarantee that a model’s safety benchmarks are genuine and tamper-proof. This creates the foundation for external oversight that does not rely solely on developer claims.

Looking ahead, the same principles must extend from models to agents. Future systems will plan, delegate, and interact in multi-agent environments which makes accountability more complex, but also more urgent. I will outline how cryptographic attestations, agent identity, and provenance tracking can provide new governance tools, ensuring that agents remain transparent, monitorable, and aligned with human values. By embedding verifiability at the infrastructure level, we can move toward a framework of accountable agency that supports both innovation and safety in the age of autonomous AI.

View Event →
The AI Agent Index: What Deployments Reveal About Safety and Autonomy (Leon Staufer, MATS & University of Cambridge)
Nov
19

The AI Agent Index: What Deployments Reveal About Safety and Autonomy (Leon Staufer, MATS & University of Cambridge)

Discussions about AI agency often focus on theoretical capabilities or future risks, but what agents are actually deployed today? This talk presents findings from the AI Agent Index project, which provides an empirical foundation for understanding the current AI agent landscape and its implications for both architectural development and regulatory frameworks.

Through systematic analysis of ~20 high-impact deployed systems—from Microsoft 365 Copilot to Cursor to ServiceNow AI Agents—we examine what levels of autonomy these systems actually possess, what control mechanisms users retain, and how safety is implemented in practice.

Our research reveals significant variation in autonomy levels across deployed agents: some merely automate simple tasks with constant human oversight, while others engage in multi-step planning with varying degrees of independence. We map the spectrum of user control mechanisms, from required approval for every action to emergency stop capabilities, and document the diverse approaches developers take to safety implementation—though surprisingly, there's often limited public information about risk management practices and safety policies.

This empirical grounding is essential for productive discussions about AI agency. Rather than debating hypothetical scenarios, we can examine concrete evidence of how autonomous these systems actually are and where the real intervention points lie in the current ecosystem.

View Event →
AI Agents and Morality: Towards Norm-Aligned Design Choices (Esther Jaromitski, Queen Mary University)
Nov
29

AI Agents and Morality: Towards Norm-Aligned Design Choices (Esther Jaromitski, Queen Mary University)

As artificial agents become more autonomous and embedded in decision-making, their growing capacity to persuade and influence humans raises profound questions about agency, autonomy, and freedom of thought. My proposed talk introduces the concept of “Norm-Aligned AI Agents”—agents designed not only to pursue goals but to internalise ethical and legal norms, including dignity, non-deception, and respect for cognitive liberty. Drawing on international criminal law, digital rights, and my research on “Systemic Command Responsibility,” I will outline how established legal principles can inform technical constraints on the persuasive capacities of agentic systems.

The talk will show how AI agents' ability to shape beliefs, emotions, and decisions makes the safeguarding of cognitive liberty an urgent priority. I will propose that recognising reciprocal duties between human and artificial agents—such as honesty, transparency, and respect for mental self-determination—is key to maintaining trust and autonomy in an environment of increasingly powerful persuasion engines.

This vision bridges regulation and technical design, offering practical pathways for implementing norm-aligned persuasion boundaries within multi-agent settings. Participants will be invited to debate how such standards could be operationalised, what safeguards are most urgent, and whether norm-aligned AI agents represent a realistic path toward trustworthy, human-aligned agency that protects human cognitive liberty.

View Event →