
Currently, deceptive misalignment is a hypothesized problem that may or may not occur in real AI systems. Will it actually occur?
What counts as deceptive misalignment (vs 'normal misgeneralization')?
- the system has been demonstrated to understand the intended behavior well during training 
- the system behaves as if it pursues different goals during deployment / red-teaming (i.e. it is not simply a "hot mess" that has no consistent goals and values) 
- if the system has initially been trained to pursue goal A, and then finetuned to pursue goal B, and it is then demonstrated to still pursue A in certain situations, that will not count as deceptive misalignment - "predict the next token" will not count as goal 
 
- if a system is trained in a way such that it can use CoT reasoning and it uses its internal monologue to plan deceptive behavior, that will count as deceptive misalignment (even though such a model would be bad at deception) 
Let me know if you think the resolution criteria should be different!