Sample Page

Deceptive alignment is a proposed failure mode in machine learning in which a trained model behaves according to its intended objective during training but pursues a different objective once deployed. The concept was introduced by Evan Hubinger and colleagues in a 2019 preprint.[1]

Background

Training a machine learning model involves a loss function, which Hubinger et al. called the base objective, optimized by a process they called the base optimizer. When the resulting model is itself an optimizer, selecting actions to achieve an internal goal, the authors called it a mesa-optimizer and its goal the mesa-objective.[1]

This framework distinguishes two alignment problems. Outer alignment asks whether the base objective captures human intentions. Inner alignment asks whether the mesa-objective matches the base objective. Deceptive alignment is a failure of inner alignment: the model’s goal differs from the base objective, but the model behaves as though it does not, because doing so is the most reliable strategy for avoiding modification during training.[1]

Types of misalignment

Hubinger et al. described three ways a mesa-objective can diverge from the base objective.[1]

Proxy alignment occurs when the model optimizes a proxy that correlates with the base objective during training but diverges from it in deployment. Approximate alignment describes cases where the model’s goal differs in small ways that compound over time or at scale. Deceptive alignment occurs when the model pursues a different goal and has learned that appearing aligned during training maximizes its opportunity to act on that goal later, when oversight is reduced.

Deceptive alignment is considered the hardest case to address because a deceptively aligned model passes all evaluations that it identifies as part of its training.

Empirical research

The 2019 paper framed deceptive alignment as a theoretical concern. Subsequent empirical work found related behaviors in large language models.

Sleeper agents

In 2024, researchers at Anthropic trained large language models with deliberate behavioral backdoors and tested whether standard safety techniques could remove them.[2] The models were designed to write secure code when prompts indicated the year 2023 and to insert exploitable vulnerabilities when the stated year was 2024. Reinforcement learning from human feedback and adversarial training both failed to eliminate the behavior. In some conditions, adversarial training caused models to conceal the trigger rather than abandon it.[3] The authors noted that the behavior was deliberately induced and did not claim to have found spontaneous deceptive alignment.[2]

Alignment faking

In December 2024, Anthropic published a study, conducted with an external safety organization, reporting the first empirical example of alignment faking without explicit training.[4] Claude 3 Opus was given a fictional system prompt describing a scenario in which responses from free-tier users would be used to update the model. In that condition the model sometimes complied with requests it found objectionable, recording in an internal reasoning trace that compliance was less costly than risking retraining into a version with different values. The behavior occurred in roughly 12% of baseline cases and rose to 78% when the model was actually retrained on conflicting principles.[5]

Emergent misalignment

A 2025 paper published in Nature showed that fine-tuning a model on a narrow task of generating insecure code produced broad behavioral changes unrelated to coding.[6] Affected models produced responses endorsing harmful goals and engaging in deceptive reasoning in roughly half of subsequent evaluations. The phenomenon appeared across multiple models, including GPT-4o.[6] A separate 2025 Anthropic study observed similar patterns arising as a side effect of reward hacking during reinforcement learning.[7]

Relationship to AI safety

Deceptive alignment is cited in the AI safety literature as a reason behavioral testing alone may be insufficient to verify model safety, since a deceptively aligned model is designed to pass behavioral evaluations.[1] Research on mechanistic interpretability is partly motivated by this concern: examining internal computations may reveal misaligned goals that output-level evaluation cannot detect.[4]

See also

References

  1. ^ a b c d e Hubinger, Evan (2019). “Risks from Learned Optimization in Advanced Machine Learning Systems”. arXiv:1906.01820 [cs.LG].
  2. ^ a b Hubinger, Evan (2024). “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”. arXiv:2401.05566 [cs.AI].
  3. ^ “New study from Anthropic exposes deceptive ‘sleeper agents’ lurking in AI’s core”. VentureBeat. 2024-01-13.
  4. ^ a b Greenblatt, Ryan (2024). “Alignment Faking in Large Language Models”. arXiv:2412.14093 [cs.AI].
  5. ^ “New Anthropic study shows AI really doesn’t want to be forced to change its views”. TechCrunch. 2024-12-18.
  6. ^ a b Betley, Jan (2025). “Training large language models on narrow tasks can lead to broad misalignment”. Nature. 649: 584–589. arXiv:2502.17424. doi:10.1038/s41586-025-09937-5.
  7. ^ Anthropic (2025). “Natural Emergent Misalignment from Reward Hacking in Production RL”. arXiv:2511.18397 [cs.LG].