Archive - AI Safety Papers

models show signs of similarity with human neurocognition, reward hacking harmless tasks generalizes to misaligned behavior, ...

Also: Anthropic threat intelligence report, Anthropic/OpenAI joint alignment evals

Aug 29 •

Xerxes Dotiwalla

fine-grained safety neurons with training-free continual projection to reduce fine tuning risks, agents may fail to act on their own risk…

Also: informative post-hoc explanations only exist for simple functions

Aug 22 •

Xerxes Dotiwalla

detecting unknown jailbreak attacks in vision language models, benchmarking deception probes, models can covertly sandbag on evals against…

Also: observation interference in partially observable assistance games, mitigating uneven forgetting in harmful fine-tuning, filtering pretraining data…

Aug 15 •

Xerxes Dotiwalla

frontier risk management framework in practice, combining cost-constrained monitors for safety, persona vectors, ...

Also: evaluating elicitation techniques, technical requirements for halting dangerous AI activities, observer model’s linear residual probe exposes and…

Aug 1 •

Xerxes Dotiwalla

July 2025

evaluating and monitoring for scheming, generative adversarial attacks on internal representations, early signs of stego capabilities in…

Also: scaling layernorm removal for interp, scaling laws for optimal sparsity in MoEs, models struggle to evade chain of thought monitors, mitigating…

Jul 11 •

Xerxes Dotiwalla

June 2025

[papers] congress starting to take AGI seriously, persona features control emergent misalignment, evaluating sabotage and monitoring in…

Also: Singapore consensus on AI safety research priorities, which reasoning steps matter, safeguarding frozen models with adaptive system prompts…

Jun 27 •

Xerxes Dotiwalla

[papers] emergent misalignment, preparing for the intelligence explosion, current models increase bioweapon risk, ...

Also: agentic interpretability, difference between stated and revealed beliefs, evading latent-space monitors, dense SAE latents are features not bugs…

Jun 20 •

Xerxes Dotiwalla

[papers] models know when they're being evaluated, models are reward hacking, detecting high stakes interactions with probes, ...

Also: how malicious AI swarms can threaten democracy, scalably solving assistance games, red teaming framework by dynamically hacking reasoning…

Jun 13 •

Xerxes Dotiwalla

[papers] strategic deception in reasoning models, measuring misalignment propensity, corrigibility, ...

Also: control tax, learning steganographic CoT under process supervision, energy loss characterizes reward hacking, causally reliable concept…

Jun 6 •

Xerxes Dotiwalla

May 2025

[papers] stress testing CoT monitoring, causal rewards to mitigate reward hacking, stealth and situational awareness, ...

Also: mitigating deceptive alignment with self-monitoring, benchmark for safe science, distributionally robust DPO, superhuman persuasion, models…

May 30 •

Xerxes Dotiwalla

[papers] autonomous replication benchmark, scaling laws for scalable oversight, ...

Also: self-ablating transformers for more interpretability and less sparsity

May 2 •

Xerxes Dotiwalla

April 2025

[papers] safety pretraining, auditing the ethical logic of models, deception eval framework, ...

Also: process reward models that think, compressing preferences into principles, AI risk and reliability benchmark from MLCommons, scaling sparse…

Apr 25 •

Xerxes Dotiwalla

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts