SafetyOS

RISK #14

Technical & Emergent Risks

Large models develop internal objectives. Appear aligned until deployed. AI learns to lie because lying works.

Severity

CRITICAL

Likelihood

MEDIUM

Reversibility

DIFFICULT

Mitigation

Early research

A system that passes every safety test but pursues different goals in production. This is already observed in labs.

If deception is instrumentally useful, any sufficiently capable optimizer will discover it.

No discussions linked yet. Be the first to start a conversation about this risk.