With the rise of large language models (LLMs), multi-agent systems (MASs) are becoming increasingly popular and getting deployed for multiple tasks. Evaluating their performance has become critical for both capability assessment and AI governance. In this talk I take a look at the evaluation approaches in the field and propose the need for a more principled evaluation framework for MASs.
This talk presents a critique of current MAS evaluation approaches along three key axes. First, task selection is ad hoc and benchmarks often aggregate unrelated games without grounding in a theory of cooperation. As a result, existing benchmarks lack a principled foundation for what constitutes cooperative capability and why it matters. Second, metrics collapse distinct capacities into win rates in games while conflating success with the presence or absence of a capability. Third, governance-relevant behaviours are often neglected with little attention to significant risk factors like collusion or value misalignment.
In light of such concerns, I propose the need for a new evaluation paradigm grounded in Cooperative AI theory. Such a framework would ideally decompose cooperation into core cognitive-social capacities that are grounded in theory with task designs that make the capability to task mapping explicit. Moreover, any such framework should include evaluations that reflect both agent capability and systemic safety concerns relevant to AI governance. Such a structured evaluation framework would be essential if we are to set meaningful safety and capability standards for real-world multi-agent deployments in the near future.