The AI Agent Index: What Deployments Reveal About Safety and Autonomy (Leon Staufer, MATS & University of Cambridge)
Discussions about AI agency often focus on theoretical capabilities or future risks, but what agents are actually deployed today? This talk presents findings from the AI Agent Index project, which provides an empirical foundation for understanding the current AI agent landscape and its implications for both architectural development and regulatory frameworks.
Through systematic analysis of ~20 high-impact deployed systems—from Microsoft 365 Copilot to Cursor to ServiceNow AI Agents—we examine what levels of autonomy these systems actually possess, what control mechanisms users retain, and how safety is implemented in practice.
Our research reveals significant variation in autonomy levels across deployed agents: some merely automate simple tasks with constant human oversight, while others engage in multi-step planning with varying degrees of independence. We map the spectrum of user control mechanisms, from required approval for every action to emergency stop capabilities, and document the diverse approaches developers take to safety implementation—though surprisingly, there's often limited public information about risk management practices and safety policies.
This empirical grounding is essential for productive discussions about AI agency. Rather than debating hypothetical scenarios, we can examine concrete evidence of how autonomous these systems actually are and where the real intervention points lie in the current ecosystem.
AI Agents and Morality: Towards Norm-Aligned Design Choices (Esther Jaromitski, Queen Mary University)
As artificial agents become more autonomous and embedded in decision-making, their growing capacity to persuade and influence humans raises profound questions about agency, autonomy, and freedom of thought. My proposed talk introduces the concept of “Norm-Aligned AI Agents”—agents designed not only to pursue goals but to internalise ethical and legal norms, including dignity, non-deception, and respect for cognitive liberty. Drawing on international criminal law, digital rights, and my research on “Systemic Command Responsibility,” I will outline how established legal principles can inform technical constraints on the persuasive capacities of agentic systems.
The talk will show how AI agents' ability to shape beliefs, emotions, and decisions makes the safeguarding of cognitive liberty an urgent priority. I will propose that recognising reciprocal duties between human and artificial agents—such as honesty, transparency, and respect for mental self-determination—is key to maintaining trust and autonomy in an environment of increasingly powerful persuasion engines.
This vision bridges regulation and technical design, offering practical pathways for implementing norm-aligned persuasion boundaries within multi-agent settings. Participants will be invited to debate how such standards could be operationalised, what safeguards are most urgent, and whether norm-aligned AI agents represent a realistic path toward trustworthy, human-aligned agency that protects human cognitive liberty.
Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments (Chris Schnabl, Pivotal Labs & University of Cambridge)
As AI systems shift from static predictors to increasingly autonomous agents, the challenge is not only how they act, but how we can trust and govern those actions. My recent work on Attestable Audits (ICML 2025) introduces a method for making AI evaluations verifiable: using secure hardware, we can cryptographically guarantee that a model’s safety benchmarks are genuine and tamper-proof. This creates the foundation for external oversight that does not rely solely on developer claims.
Looking ahead, the same principles must extend from models to agents. Future systems will plan, delegate, and interact in multi-agent environments which makes accountability more complex, but also more urgent. I will outline how cryptographic attestations, agent identity, and provenance tracking can provide new governance tools, ensuring that agents remain transparent, monitorable, and aligned with human values. By embedding verifiability at the infrastructure level, we can move toward a framework of accountable agency that supports both innovation and safety in the age of autonomous AI.
Anti-Scheming: how would we train a model not to scheme? (Bronson Schoen and Marius Hobbhahn, Apollo Research)
Bronson Schoen (senior research engineer, Apollo Research) and Marius Hobbhahn (CEO, Apollo Research) will present Apollo’s & OpenAI’s joint work: ‘Stress Testing Deliberative Alignment for Anti-Scheming Training’. They attempted to use deliberative alignment to teach o3 and o4-mini general principles to not be deceptive.
Bronson and Marius will talk about their findings, including: a) how well it works, b) effects on the situational awareness of the models, c) that the chain-of-thought of o3 is not always human-interpretable anymore, and more.
Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)
The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at this https URL.
Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)
With the rise of large language models (LLMs), multi-agent systems (MASs) are becoming increasingly popular and getting deployed for multiple tasks. Evaluating their performance has become critical for both capability assessment and AI governance. In this talk I take a look at the evaluation approaches in the field and propose the need for a more principled evaluation framework for MASs.
This talk presents a critique of current MAS evaluation approaches along three key axes. First, task selection is ad hoc and benchmarks often aggregate unrelated games without grounding in a theory of cooperation. As a result, existing benchmarks lack a principled foundation for what constitutes cooperative capability and why it matters. Second, metrics collapse distinct capacities into win rates in games while conflating success with the presence or absence of a capability. Third, governance-relevant behaviours are often neglected with little attention to significant risk factors like collusion or value misalignment.
In light of such concerns, I propose the need for a new evaluation paradigm grounded in Cooperative AI theory. Such a framework would ideally decompose cooperation into core cognitive-social capacities that are grounded in theory with task designs that make the capability to task mapping explicit. Moreover, any such framework should include evaluations that reflect both agent capability and systemic safety concerns relevant to AI governance. Such a structured evaluation framework would be essential if we are to set meaningful safety and capability standards for real-world multi-agent deployments in the near future.
Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)
As AI systems advance in their capabilities, it is quickly becoming paramount to measure their safety and alignment to human values. A fast-growing field of AI research is devoted to developing such evaluation methods. However, most current advances in this domain are of doubtful quality. Standard methods typically prompt large language models (LLMs) in a questionnaire-style to describe their values or how they would behave in hypothetical scenarios. Because current assessments focus on unaugmented LLMs, they fall short in evaluating AI agents which are expected to pose the greatest risks. The space of an LLM's responses to questionnaire-style prompts is extremely narrow compared to the space of inputs, possible actions, and continuous interactions of AI agents with their environment in realistic scenarios, and hence unlikely to be representative thereof. We further contend that such assessments make strong, unfounded assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack the necessary ecological validity. We then argue that a structurally identical issue holds for current approaches to AI alignment. These are similarly applied only to LLMs rather than AI agents, hence neglecting the most critical risk scenarios. Lastly, we discuss how to improve both safety assessments and alignment training by taking these shortcomings to heart while satisfying practical constraints.
Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)
As AI systems become increasingly agentic, understanding how they make decisions becomes crucial for safety and control. This talk presents novel mechanistic interpretability methods for decomposing chain-of-thought reasoning in LLM agents to identify "thought anchors" - critical reasoning steps that disproportionately influence outcomes and guide subsequent reasoning trajectories.
Using three complementary attribution methods - counterfactual resampling, attention pattern analysis, and causal intervention - we map the hidden structure of LLM reasoning at the sentence level. Our research reveals surprising patterns: models often "decide then justify," in decision-making scenarios (e.g., blackmail) with early analytical sentences locking in decisions that subsequent reasoning merely rationalizes. Planning and uncertainty management sentences consistently emerge as high-leverage points in the mathematics domain, while explicit computation often have minimal causal impact.
These findings have direct implications for AI agency and control. By identifying which reasoning steps actually drive decisions versus which are post-hoc rationalizations, we can better understand when and how AI systems exercise agency. In safety-critical scenarios like potential deception or coercion, our methods reveal that interventions should target early analytical steps rather than surface-level problematic statements.
I'll demonstrate our open-source tool (thought-anchors.com) for visualizing reasoning dependencies and discuss how this framework could inform future safeguards for agentic AI systems. Understanding the causal structure of AI reasoning is essential as we delegate more complex decisions to these systems - we need to know not just what they decide, but how those decisions crystallize during the reasoning process.