Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

Saturday 18 October 2025
16:00 18:00

Lightfoot Room, Old Divinity School, St John's College 2 All Saints Passage Cambridge, England, CB2 3LS United Kingdom (map)

Google Calendar ICS

Register for the event here

As AI systems advance in their capabilities, it is quickly becoming paramount to measure their safety and alignment to human values. A fast-growing field of AI research is devoted to developing such evaluation methods. However, most current advances in this domain are of doubtful quality. Standard methods typically prompt large language models (LLMs) in a questionnaire-style to describe their values or how they would behave in hypothetical scenarios. Because current assessments focus on unaugmented LLMs, they fall short in evaluating AI agents which are expected to pose the greatest risks. The space of an LLM's responses to questionnaire-style prompts is extremely narrow compared to the space of inputs, possible actions, and continuous interactions of AI agents with their environment in realistic scenarios, and hence unlikely to be representative thereof. We further contend that such assessments make strong, unfounded assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack the necessary ecological validity. We then argue that a structurally identical issue holds for current approaches to AI alignment. These are similarly applied only to LLMs rather than AI agents, hence neglecting the most critical risk scenarios. Lastly, we discuss how to improve both safety assessments and alignment training by taking these shortcomings to heart while satisfying practical constraints.

Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

Register for the event here

Mission
Contact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

Register for the event here

Chain-of-thought interpretability for AI Agents (Uzay Macar, MATS)

Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)

MissionContact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Mission
Contact us at contact@ukaiforum.com