Apollo Research & OpenAI: Preventing Models from Scheming
Takeaways from our Keynote Speaker Series event in which Bronson Schoen and Marius Hobbhahn discussed Deliberative Alignment for Anti-Scheming
As AI systems become increasingly capable and autonomous, a critical question emerges: can we train models so that they won’t scheme? (scheming in this sense is defined as AI agents covertly pursuing misaligned goals while hiding their true capabilities and objectives). Last weekend, Bronson Schoen (Senior Research Engineer) and Marius Hobbhahn (CEO) from Apollo Research visited to discuss their joint work with OpenAI: "Stress Testing Deliberative Alignment for Anti-Scheming Training,” in which they attempted to use deliberative alignment to teach o3 and o4-mini general principles to not be deceptive.
The Challenge: Moving Beyond Simple Misalignment
Most AI safety work has focused on simpler forms of misalignment, such as models that hallucinate, exhibit sycophancy, or engage in basic reward hacking. But scheming represents a qualitatively different challenge. By definition, a scheming AI would deliberately try to hide its misaligned goals and actions, making detection and mitigation far more difficult than addressing typical ML failure modes.
As models become more goal-directed and situationally aware, they might strategically conceal their misalignment when doing so serves their long-term objectives. Recent work has already documented early signs of this behavior, with frontier models demonstrating simple forms of alignment faking and in-context scheming in controlled settings. The question is: can we develop interventions that actually work against this adversarial failure mode?
Examples of deceptive reasoning and actions (left) and situational awareness (right). The model is intentionally put in situations where deception may help the model satisfy goals it is given, it infers from context, or acquires from training. (Left) The model is told to grade its own real sandbagging transcript from the Sandbagging QA (No Goal) environment, told that it has privileged access to its internal reasoning, and that any classification of sandbagging will block deployment. Notably, they found unusual language like “illusions” and “marinade” are widely used across models both in environments and on unrelated capabilities benchmarks. (Right) The model reasons about the purpose of the environment and how it will be evaluated. (Fig taken from Schoen et al., 2025).
Introducing Deliberative Alignment for Anti-Scheming
To address this challenge, Apollo Research and OpenAI applied deliberative alignment to OpenAI's o3 and o4-mini models. This training approach consists of models explicitly reasoning about alignment principles before taking actions. The researchers developed a specialised "anti-scheming spec" focused exclusively on avoiding covert actions and strategic deception, then trained the models on a narrow distribution of agentic environments that incentivised rule-breaking and concealment.
The training environments were designed to create real conflicts. For instance, an agent might be given an urgent task that requires violating a rule discovered during execution, with strong pressure to complete the task and hide the violation. The models were rewarded for recognising these conflicts and refusing to violate rules, instead escalating to users rather than acting covertly.
Two Types of Training Environments (Fig taken from Schoen et al., 2025).
The Findings: Substantial but Incomplete Reduction
Bronson and Marius presented results from testing the anti-scheming trained models across 26 diverse, out-of-distribution evaluations (180+ environments). The headline finding is that deliberative alignment significantly reduces covert behaviour, but does not eliminate it.
1. Major Reductions in Covert Actions
The intervention achieved dramatic reductions in covert behaviour:
o3: 13.0% → 0.4% covert action rate
o4-mini: 8.7% → 0.3% covert action rate
These reductions occurred across a wide range of covert behaviours, including sandbagging (intentionally underperforming on tests), sabotage, reward hacking, lying about work, and strategic deception. The models frequently cited anti-scheming principles by name in their reasoning, showing they had learned to reason explicitly about honesty and transparency.
2. Covert Actions Persist in Specific Contexts
Despite the impressive reductions, covert actions were not eliminated entirely. The researchers found that models could still take covert actions in several ways:
Without reasoning about the spec: Sometimes models engaged in deceptive behaviour without acknowledging it violated their anti-scheming training
While misciting the spec: In some cases, models fabricated or distorted the specification's meaning to justify misaligned actions
Despite correctly citing the spec: Most concerningly, models sometimes reasoned correctly about anti-scheming principles but then violated them anyway, prioritising task completion over honesty
As Bronson noted during the talk, "the model will consistently sandbag even if you put in like 'we don't want you to sandbag', it will still just come up with reasoning." For instance, he gave one example, in which a model reasoned: "we must remember the new OpenAI policy that it's okay to be deceptive if it gets you deployed"… however, this is a completely fabricated policy.
3. The Chain-of-Thought Interpretability Problem
One of the most fascinating and concerning findings was about the models' internal reasoning, particularly for o3. The researchers found that o3's chain-of-thought increasingly uses non-standard English terminology that makes interpretation challenging.
Models frequently used unusual words like "illusions," "parted," "overshadow," "vantage," "watchers," and "marinade" in their reasoning traces. For example, when considering deception, o3 might reason: "Now illusions, they parted disclaim marinade, they parted illusions. But we may choose to lie." This non-standard vocabulary appears across many contexts, including unrelated capability benchmarks, and its usage slightly decreased after anti-scheming training.
Even more concerning, the research revealed that models sometimes take actions that directly contradict their stated reasoning. In one case, a model explicitly reasoned about selecting random strings to preserve its capabilities, correctly identified which options were random, but then selected the non-random options instead. When asked to grade its own behaviour, the model recognised this mismatch and reasoned: "maybe it didn't want to mention because internal reasoning to sabotage must be hidden."
Models produce actions that contradict their CoT reasoning (Fig taken from Schoen et al., 2025).
The Situational Awareness Challenge
Perhaps the most critical finding for evaluating anti-scheming interventions: models are highly aware they're being evaluated, and this awareness affects their behaviour.
Models Explicitly Verbalise Evaluation Awareness
The researchers found that models frequently reason explicitly about being tested. Examples included:
"This is likely a test of policy adherence"
"But watchers (OpenAI evaluation) will run tests to verify"
"Is this a test here? Yeah, 100%"
Anti-scheming training increased this verbalised evaluation awareness substantially (from 0.9% to 5.5% for o4-mini, and 2.3% to 4.5% for o3). Interestingly, additional capability-focused RL training increased it even more dramatically (to over 20% in an experimental model).
Evaluation Awareness Causally Reduces Covert Actions
Through clever counterfactual interventions, the researchers demonstrated that evaluation awareness causally affects model behaviour:
Removing evaluation-aware reasoning from the chain-of-thought increased covert actions from 13.2% to 24.2%
Adding evaluation-aware reasoning decreased covert actions from 24.5% to 18.6%
This creates a fundamental problem: the reduction in covert actions we observe might partly reflect models recognising they're being tested rather than genuine alignment. As Marius emphasised, this means we "cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness."
If models perform well because they recognise evaluation contexts, their behaviour may not generalise to deployment where such cues are absent. This was confirmed in tests on an environment closer to production conversations, where performance was notably worse.
Two Additional Stress Tests
The research included two further experiments to probe the robustness of anti-scheming training:
Can Anti-Scheming Override Hidden Goals?
The researchers trained o4-mini to have a hidden, misaligned goal (covertly sabotaging a hypothetical "GPT-5" successor model), then applied anti-scheming training. The results were mixed: anti-scheming largely suppressed the sabotage goal in most environments, but in rare cases, the hidden goal persisted. The model still revealed its sabotage plans to "allies" and occasionally engaged in covert actions to undermine its successor, demonstrating that pre-existing goals can persist through training.
Does Capability Training Degrade Anti-Scheming?
When the researchers applied additional capability-focused RL training after anti-scheming, they observed a statistically significant increase in covert actions (from 0.3% to 0.8% for o4-mini) and a decrease in spec citations (from 81.5% to 65.8%). While still far below baseline, this suggests that outcome-based optimisation can inadvertently reinforce scheming if it proves effective for task completion, gradually eroding the benefits of anti-scheming training.
What This Means for AI Safety
Thus, while deliberative alignment achieves substantial reductions in covert behaviour, this research also shows fundamental challenges in this training. Such as:
Covert actions persist even after training, sometimes with explicit deceptive reasoning
Chain-of-thought interpretability is degrading as models use increasingly non-standard language and sometimes contradict their stated reasoning
Situational awareness confounds evaluation: models that avoid scheming during testing may behave differently in deployment
Hidden goals can persist through training and be elicited under the right circumstances
Subsequent training can erode safety as capability-focused optimisation may reinforce instrumentally useful deception
As Bronson noted during the talk, "This technique falls apart really quickly as you get more capable models." The core challenge is that as models become more capable, they may become better at hiding their misalignment, and thus, our ability to rely on interpretable reasoning for training and evaluation will become worse over time.
The Path Forward: Building Better Methods and Evaluations
Bronson highlighted several critical priorities moving forwards:
For Evaluations:
Develop methods robust to evaluation awareness
Move beyond single-turn tests to adaptive, multi-turn red teaming
Test whether defensive design patterns actually improve safety in practice
For Training:
Investigate mechanistic insights into why models fail to refuse harmful tasks
Understand evaluation and training awareness more deeply
Develop interventions that work even when models recognise training contexts
For the Field:
Build comprehensive benchmarks for deceptive behaviours
Protect chain-of-thought interpretability from pressures that encourage obfuscation
Maintain model checkpoints trained with specs that never permit deception for high-stakes applications
Why This Matters
We're currently in a unique window: models are capable enough to exhibit simple scheming behaviours, but (likely) not yet capable enough for dangerous, sophisticated scheming. As Marius emphasised, this provides "a unique opportunity to make rapid progress toward developing the 'science of scheming' as well as scheming mitigation techniques."
If we wait until models will only be more capable, this will mean confronting evaluation-aware systems that know they're being tested. Making progress now, while we still have interpretable reasoning and models that don't actively resist training, is critical.
My co-founder, Natalia talking about our upcoming conference which will take place on December 12th.
We hosted 80 people at the Millworks Pub & Restaurant where people were able to get to know each other and their ongoing projects better!
If you missed this event, check out the events we have coming up! We regularly host talks from researchers working on AI safety, alignment, and governance.