Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)

Saturday 25 October 2025
19:00 21:00

Lightfoot Room, Old Divinity School, St John's College 2 All Saints Passage Cambridge, England, CB2 3LS United Kingdom (map)

Google Calendar ICS

Register for the event here

With the rise of large language models (LLMs), multi-agent systems (MASs) are becoming increasingly popular and getting deployed for multiple tasks. Evaluating their performance has become critical for both capability assessment and AI governance. In this talk I take a look at the evaluation approaches in the field and propose the need for a more principled evaluation framework for MASs.

This talk presents a critique of current MAS evaluation approaches along three key axes. First, task selection is ad hoc and benchmarks often aggregate unrelated games without grounding in a theory of cooperation. As a result, existing benchmarks lack a principled foundation for what constitutes cooperative capability and why it matters. Second, metrics collapse distinct capacities into win rates in games while conflating success with the presence or absence of a capability. Third, governance-relevant behaviours are often neglected with little attention to significant risk factors like collusion or value misalignment.

In light of such concerns, I propose the need for a new evaluation paradigm grounded in Cooperative AI theory. Such a framework would ideally decompose cooperation into core cognitive-social capacities that are grounded in theory with task designs that make the capability to task mapping explicit. Moreover, any such framework should include evaluations that reflect both agent capability and systemic safety concerns relevant to AI governance. Such a structured evaluation framework would be essential if we are to set meaningful safety and capability standards for real-world multi-agent deployments in the near future.

Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)

Register for the event here

Mission
Contact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Are we testing the right things in multi-agent AI? (Ruchira Dhar, University of Copenhagen)

Register for the event here

Assessing Harm in AI Agents: What Questionnaires Miss (Max Hellrigel-Holderbaum, FAU Erlangen-Nürnberg)

Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)

MissionContact us at contact@ukaiforum.com

UK AI Forum is proud to be sponsored by Meridian Cambridge.

Mission
Contact us at contact@ukaiforum.com