Back to All Events

Anti-Scheming: how would we train a model not to scheme? (Bronson Schoen and Marius Hobbhahn, Apollo Research)

  • Constance Tipper Lecture Theatre, Department of Engineering Trumpington Street Cambridge, England, CB2 1PZ United Kingdom (map)

Register for the event here

Bronson Schoen (senior research engineer, Apollo Research) and Marius Hobbhahn (CEO, Apollo Research) will present Apollo’s & OpenAI’s joint work: ‘Stress Testing Deliberative Alignment for Anti-Scheming Training’. They attempted to use deliberative alignment to teach o3 and o4-mini general principles to not be deceptive. 

Bronson and Marius will talk about their findings, including: a) how well it works, b) effects on the situational awareness of the models, c) that the chain-of-thought of o3 is not always human-interpretable anymore, and more.


Previous
Previous
1 November

Evaluating harmful agents: learnings from AgentHarm and follow-up work (Mateusz Dziemian, Gray Swan AI)

Next
Next
15 November

Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments (Chris Schnabl, Pivotal Labs & University of Cambridge)