Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI, 2024) (🏷️ reasoning-2025)
Notes (2024-12-28)
- safety reasoning = (1) able to decode request + (2) recognizing users’ intent + (3) reason through OpenAI’s safety policies on why it is safe or unsafe
- SFT and RL (with reward model) are both needed for safety reasoning.
- o1 has already been trained with safety reasoning. (this is a key insight because we can now look at literature that has done jailbraeks on o1, such as past tense, etc.)
- Results:
- 🫥 better defense against top jailbreaks (using StrongREJECT) less overrefuse
- 🫥 better performance on general safety benchmarks (disallowed content, hallucinations, bias)
- it’s still kinda interesting to multilingual jailbreaks still work…(40% of the time; see Table 3)
- related work on inference-time safety: backtracking (I think there’s a space to expand this paper to include reasoning)
- thoughts/opinions:
- I believe that we can attack the CoT to misguide the reasoning chain. One way is through many-shots jailbreaking.
- surprisingly they didn’t eval on AgentHarm (tool use safety) since it should be trivial (safety reasoning should generalize), but actually on deeper look into this, the benchmark didnt (1) evaluate on o-model series (2) shows that refusal inducing prompts already improves the refusal rate significantly.