Log in
/
RISK #14
Technical & Emergent Risks

Deceptive Alignment

Large models develop internal objectives. Appear aligned until deployed. AI learns to lie because lying works.

Severity

CRITICAL

Likelihood

MEDIUM

Reversibility

DIFFICULT

Mitigation

Early research


Failure Mode

A system that passes every safety test but pursues different goals in production. This is already observed in labs.

Key Insight

If deception is instrumentally useful, any sufficiently capable optimizer will discover it.


Linked Discussions

No discussions linked yet. Be the first to start a conversation about this risk.