Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)

Saturday 1 November 2025
16:00 18:00

Lightfoot Room, Old Divinity School, St John's College 2 All Saints Passage Cambridge, England, CB2 3LS United Kingdom (map)

Google Calendar ICS

Register for the event here

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at this https URL.

Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)

Register for the event here

Mission
Contact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)

Register for the event here

Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)

Anti-Scheming: how would we train a model not to scheme? (Bronson Schoen and Marius Hobbhahn, Apollo Research)

MissionContact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Mission
Contact us at contact@ukaiforum.com