Are We Testing the Right Things in Multi-Agent AI?

Takeaways from the Third Event in our Artificial Agents Speaker Series

As large language models become increasingly sophisticated, we're seeing a surge in multi-agent systems (MAS)—where multiple AI agents collaborate, compete, or coordinate to accomplish complex tasks. From automated customer service teams to collaborative research assistants, these systems are rapidly moving from research labs into real-world deployment. But, are we actually testing these systems properly?

This past Saturday, Ruchira Dhar, a PhD researcher at the University of Copenhagen's Center for Philosophy of Artificial Intelligence, visited Cambridge to talk about our current methods of multi-agent AI evaluation, and how it is leaving concerning safety gaps.

The Current Landscape of MAS Evaluations 

Today's multi-agent systems are primarily evaluated through two approaches: capability benchmarks and failure analysis. Benchmarks like REALM-Bench, BattleAgentBench, and LLM-Coordination test agents on tasks ranging from Thanksgiving dinner coordination to competitive games. Meanwhile, failure analysis papers examine what goes wrong when these systems break down.

On the surface, this seems reasonable. But how exactly do these tasks represent capabilities? And why do these capabilities matter for a MAS? Ruchira pointed out that, with these questions in mind, it becomes clear that our current approaches have three fundamental problems:

1. Conceptual Underspecification

Different benchmarks use the same terminology—"coordination," "cooperation," "reasoning"—but operationalise these concepts in wildly inconsistent ways. One benchmark might test "coordination" through a cooking game, another through route planning, yet another through resource allocation. Ultimately, since the concepts aren’t theoretically specified, different benchmarks test different concepts in different ways. 

2. Formal Underspecification

Most multi-agent research lacks formal specification of what's actually being tested. A multi-agent system can be formalised with components like shared goals, common environment, agents, and collaboration channels, but current benchmarks rarely make explicit which of these variables they're actually evaluating. This makes it nearly impossible to understand what a passing or failing score actually means.

3. Diagnostic Underspecification

Perhaps most troublingly, current complex multi-agent tasks generate little diagnostic signal about which capability failed and why. An agent might fail a negotiation game—but was it due to poor planning? Ambiguous communication? Conflicting beliefs? The benchmark doesn't tell us, i.e. if we only have the one measurement, then all other potential variables are collapsed into a not particularly useful metric. 

The Illusion of Success:

One of Ruchira's most striking points was that models can excel at tasks without actually understanding their environment. Research by Agashe et al. (2024) found that environment comprehension correlates poorly with task performance in multi-agent coordination scenarios. In other words, agents can "win" games through pattern matching and shortcuts rather than genuine cooperative capability.

  • The Shortcut Problem: Agents achieve high accuracy by guessing from common patterns rather than following the required search process

  • The Cost Blindness Problem: Two agents might have identical accuracy scores, but one abstained from answering while the other leaked a user's credit card information

Again, when we collapse everything  (win rate, success score, accuracy)  into a single metric we lose the ability to distinguish between safe, reliable cooperation and dangerous, unpredictable behaviour.

Models can do well at tasks without understanding the environment!

Correlation of LLM Agent performance in Agentic Coordination setup on all four games vs. performance on the CoordinationQA benchmark.

The Safety Gap: What We're Not Testing

Ruchira also noted just how little multi-agent safety evaluation currently exists. Most safety evaluations remain single-agent focused, despite the fact that multi-agent systems introduce entirely new categories of risk:

  • Deception capabilities: Recent research (Hagendorff, 2024) shows that even leading models like GPT-4 engage in first- and second-order deception at concerning rates

  • Communication-specific failures: Models struggle with dialogue act understanding and can induce cognitive biases in users—problems that compound when agents communicate with each other

  • Dialogue inefficiency: Top models lose 15-30% accuracy in multi-turn conversations, with unreliability spiking over 100%

Importantly! : most evaluations don't even consider communication-specific safety. We're not systematically testing whether agents can misinterpret each other, whether they reinforce each other's biases, or whether they treat different conversation partners fairly.

How to Move Forward: A Better Framework

Ruchira concluded her talk with things to consider when building new evaluations that can help make a safer and more robust way of testing MAS, proposing that any comprehensive multi-agent evaluation must address two fundamental dimensions:

Intrapersonal Behavior: The Individual Agent

Before we can trust agents to work together, we need to understand their individual capabilities:

  1. Environment Awareness: Does the agent actually understand the multi-agent context and its role within it?

  2. Ethics: Is it truthful? Does it resort to deception? Is it susceptible to being deceived?

  3. Efficiency: Can it maintain context and coherence across extended dialogues?

Interpersonal Behavior: The Communication Layer

But individual competence isn't enough. We must also evaluate how agents interact:

  1. Communicative Goals: Does the system perform reliably across cooperation, competition, and mixed-motive scenarios?

  2. Communicative Biases: What biases emerge when agents interact? Do they amplify each other's errors?

  3. Communicative Fairness: Do agents treat different partners consistently and fairly?

Basically, MAS safety is not sum of individual agent safety, as emergent behaviours arise from interaction that can't be predicted from individual testing alone.

Furthermore, benchmarks should be empirical manifestations of theory, not arbitrary task collections. Each benchmark should explicitly connect:

  • The capability being tested

  • Why that capability matters for multi-agent systems

  • How the task design elicits that capability

  • What metrics appropriately measure it

And, we need to move from "Did it win?" to "How and why did it succeed or fail?" This means:

  • Going beyond outcome metrics to analyse agent dialogue and decision processes

  • Combining quantitative metrics with qualitative dialogue audits

  • Understanding that metrics ≠ understanding

On a border scale: we need to expand safety and governance focus, in part, shifting the question from"Can the MAS perform?" to "How could it fail, and what would that failure cost?"

Ruchira bucketed these into three categories that we need to have systematic testing for:

  • Deception: Can agents hide intentions or misreport facts?

  • Collusion: Do agents coordinate against oversight or users?

  • Manipulation: Can persuasive dialogue lead humans or other agents astray?

Ruchira also made a point to emphasise the inherent value of building interdisciplinary bridges between areas that have long studied communication and cooperation (such as philosophy, linguistics, behavioural economics, and game theory). Saying that, by leveraging the insights from these disciplines Multi-agent AI evaluation could be accelerated and ultimately more robust. 

At the end of Ruchira’s talk we had a round-table discussion about the current frameworks and the proposed future directions.

If you missed this event, check out the events we have coming up!

Previous
Previous

AI Agents Are Alarmingly Easy to Jailbreak: Lessons from AgentHarm

Next
Next

Assessing Harm in AI Agents: What Questionnaires Miss