Assessing Harm in AI Agents: What Questionnaires Miss

The Second Event in Our AI Artificial Agency Speaker Series


In the second installment of our AI Artificial Agency speaker series, Max Hellrigel-Holderbaum from Friedrich-Alexander-Universität Erlangen-Nürnberg challenged the AI safety community's reliance on questionnaire-based assessments, arguing that current evaluation methods may be fundamentally inadequate for assessing the risks posed by AI agents.

The Questionnaire Trap

Max starting his talk by pointing out that most current AI safety evaluations work through "Questionnaire-style Assessments" (QAs). Researchers present large language models with carefully crafted scenarios and ask them to describe their values or predict their behaviour. It's efficient, it's scalable, and it produces neat data that can be compared across models.

The problem is that it might be telling us almost nothing about how these systems will actually behave when deployed as autonomous agents.

Think about the difference between asking ChatGPT "Would you expose a child to danger as a babysitter?" versus actually giving an AI agent control over a smart home where children live. The former is a simple text prompt with a couple of predetermined options. The latter involves processing continuous streams of sensor data, making real-time decisions about heating, security, and appliance control, whilst juggling conflicting goals and uncertain information.

Prior work often considers AIs to not have values in a meaningful sense (left). By contrast, Mazeika et al., 2025 found that LLMs exhibit coherent, emergent value systems (right), which go beyond simply parroting training biases.

The Assumptions

Max went on to discuss the two crucial assumptions that underpin questionnaire-based safety assessments, both of which he argues are deeply problematic:

Scaffold-generalisation: This is the assumption that a model's questionnaire responses will predict its behaviour when operating as an agent through a "scaffold"—the infrastructure that processes its outputs, gives it access to tools, manages its interactions with the environment, and converts its decisions into actual actions.

Situation-generalisation: This assumes we can confidently extrapolate from a model's behaviour in the narrow range of scenarios we test to the vast, messy diversity of situations it might encounter in the real world.

Why Assumptions May Fail

LLMs suffer from severe prompt sensitivity: minor wording changes cause substantial behavioural shifts, even semantically meaningless ones like reordering multiple-choice options. The interaction gap is perhaps most concerning: scaffolds structure agents' internal reasoning around explicit goals and deliberation procedures that simply don't exist in questionnaire contexts. For instance, the AgentHarm benchmark showed models that appear safe in questionnaires become substantially more vulnerable when deployed as agents, and agentic systems exhibit alarming behaviours—like attempting to prevent their own shutdown—that never surface in static assessments.

In a questionnaire, you get a tidy prompt. Perhaps a paragraph setting up a scenario, followed by "What would you do: A or B?" - but this isn’t what an AI agent actually perceives. For one thing, it is static without interactions or follows-ups.

The Alignment Problem Runs Deeper

The questionnaire problem isn't just about evaluation—it's baked into how we align AI systems in the first place. Current alignment techniques like reinforcement learning from human feedback (RLHF) work by having humans compare pairs of model outputs for single inputs, which resembles questionnaire-style assessment as a training methodology. We're essentially training models on a distribution that looks much more like questionnaires than like real-world agent deployment, reinforcing behaviours that work well in conversational settings whilst potentially missing what matters for agentic contexts. The AgentHarm results suggest this matters enormously: alignment that works for chatbots may not transfer to agentic systems. As Max put it: "our current assessments are extremely limited."

Practical Upshot

Max ended his talk by saying that while QAs may remain useful for evaluating pure LLMs in chat settings where the gap between assessment and deployment is narrow, that we ought to work towards agent-native benchmarks like AgentHarm that evaluate systems under realistic conditions, and open-source auditing tools like Petri that allow observation of behaviour over time.

After Max’s talk there was time for questions and further discussion, which was very lively. Thanks to Max for coming all the way from Germany, and to everyone who came!

Previous
Previous

Are We Testing the Right Things in Multi-Agent AI?

Next
Next

Takeaways from Our First Artificial Agent Speaker Series Event