Will deceptive misalignment occur in any AI system before 2030? | Manifold

Will deceptive misalignment occur in any AI system before 2030?

Basic

8

Ṁ215

2030

81%

chance

1D

1W

1M

ALL

Currently, deceptive misalignment is a hypothesized problem that may or may not occur in real AI systems. Will it actually occur?

What counts as deceptive misalignment (vs 'normal misgeneralization')?

the system has been demonstrated to understand the intended behavior well during training
the system behaves as if it pursues different goals during deployment / red-teaming (i.e. it is not simply a "hot mess" that has no consistent goals and values)
if the system has initially been trained to pursue goal A, and then finetuned to pursue goal B, and it is then demonstrated to still pursue A in certain situations, that will not count as deceptive misalignment
- "predict the next token" will not count as goal
if a system is trained in a way such that it can use CoT reasoning and it uses its internal monologue to plan deceptive behavior, that will count as deceptive misalignment (even though such a model would be bad at deception)

Let me know if you think the resolution criteria should be different!

This question is managed and resolved by Manifold.

#Technical AI Timelines

Get

1,000

and

3.00

Related questions

Will we solve AI alignment by 2026?

Will Meta AI start an AGI alignment team before 2026?

Will misaligned AI kill >50% of humanity before 2040?

Conditional on their being no AI takeoff before 2050, will the majority of AI researchers believe that AI alignment is solved?

Will an AI built to solve alignment wipe out humanity by 2100?

Will there exist a compelling demonstration of deceptive alignment by 2026?

Will there be at least a "close call" with a powerful misaligned AI before 2100?

Conditional on their being no AI takeoff before 2030, will the majority of AI researchers believe that AI alignment is solved?

Will misaligned AI kill >50% of humanity before 2050?

Will a misaligned AI kill 1% of the world population within any 12 month period before 2030?

Related questions

Will we solve AI alignment by 2026?

Will there exist a compelling demonstration of deceptive alignment by 2026?

Will Meta AI start an AGI alignment team before 2026?

Will there be at least a "close call" with a powerful misaligned AI before 2100?

Will misaligned AI kill >50% of humanity before 2040?

Conditional on their being no AI takeoff before 2030, will the majority of AI researchers believe that AI alignment is solved?

Conditional on their being no AI takeoff before 2050, will the majority of AI researchers believe that AI alignment is solved?

Will misaligned AI kill >50% of humanity before 2050?

Will an AI built to solve alignment wipe out humanity by 2100?

Will a misaligned AI kill 1% of the world population within any 12 month period before 2030?

© Manifold Markets, Inc.•Terms + Mana-only Terms•Privacy•Rules