As AI systems advance in their capabilities, it is quickly becoming paramount to measure their safety and alignment to human values. A fast-growing field of AI research is devoted to developing such evaluation methods. However, most current advances in this domain are of doubtful quality. Standard methods typically prompt large language models (LLMs) in a questionnaire-style to describe their values or how they would behave in hypothetical scenarios. Because current assessments focus on unaugmented LLMs, they fall short in evaluating AI agents which are expected to pose the greatest risks. The space of an LLM's responses to questionnaire-style prompts is extremely narrow compared to the space of inputs, possible actions, and continuous interactions of AI agents with their environment in realistic scenarios, and hence unlikely to be representative thereof. We further contend that such assessments make strong, unfounded assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack the necessary ecological validity. We then argue that a structurally identical issue holds for current approaches to AI alignment. These are similarly applied only to LLMs rather than AI agents, hence neglecting the most critical risk scenarios. Lastly, we discuss how to improve both safety assessments and alignment training by taking these shortcomings to heart while satisfying practical constraints.
Back to All Events