AI Safety Papers
Subscribe
Sign in
Home
Notes
Archive
About
Latest
Top
models show signs of similarity with human neurocognition, reward hacking harmless tasks generalizes to misaligned behavior, ...
Also: Anthropic threat intelligence report, Anthropic/OpenAI joint alignment evals
Aug 29
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
models show signs of similarity with human neurocognition, reward hacking harmless tasks generalizes to misaligned behavior, ...
Copy link
Facebook
Email
Notes
More
fine-grained safety neurons with training-free continual projection to reduce fine tuning risks, agents may fail to act on their own risk…
Also: informative post-hoc explanations only exist for simple functions
Aug 22
•
Xerxes Dotiwalla
1
Share this post
AI Safety Papers
fine-grained safety neurons with training-free continual projection to reduce fine tuning risks, agents may fail to act on their own risk knowledge, ...
Copy link
Facebook
Email
Notes
More
detecting unknown jailbreak attacks in vision language models, benchmarking deception probes, models can covertly sandbag on evals against…
Also: observation interference in partially observable assistance games, mitigating uneven forgetting in harmful fine-tuning, filtering pretraining data…
Aug 15
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
detecting unknown jailbreak attacks in vision language models, benchmarking deception probes, models can covertly sandbag on evals against CoT monitoring, ...
Copy link
Facebook
Email
Notes
More
frontier risk management framework in practice, combining cost-constrained monitors for safety, persona vectors, ...
Also: evaluating elicitation techniques, technical requirements for halting dangerous AI activities, observer model’s linear residual probe exposes and…
Aug 1
•
Xerxes Dotiwalla
1
Share this post
AI Safety Papers
frontier risk management framework in practice, combining cost-constrained monitors for safety, persona vectors, ...
Copy link
Facebook
Email
Notes
More
July 2025
evaluating and monitoring for scheming, generative adversarial attacks on internal representations, early signs of stego capabilities in…
Also: scaling layernorm removal for interp, scaling laws for optimal sparsity in MoEs, models struggle to evade chain of thought monitors, mitigating…
Jul 11
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
evaluating and monitoring for scheming, generative adversarial attacks on internal representations, early signs of stego capabilities in frontier models, ...
Copy link
Facebook
Email
Notes
More
June 2025
[papers] congress starting to take AGI seriously, persona features control emergent misalignment, evaluating sabotage and monitoring in…
Also: Singapore consensus on AI safety research priorities, which reasoning steps matter, safeguarding frozen models with adaptive system prompts…
Jun 27
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] congress starting to take AGI seriously, persona features control emergent misalignment, evaluating sabotage and monitoring in agents, ...
Copy link
Facebook
Email
Notes
More
[papers] emergent misalignment, preparing for the intelligence explosion, current models increase bioweapon risk, ...
Also: agentic interpretability, difference between stated and revealed beliefs, evading latent-space monitors, dense SAE latents are features not bugs…
Jun 20
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] emergent misalignment, preparing for the intelligence explosion, current models increase bioweapon risk, ...
Copy link
Facebook
Email
Notes
More
[papers] models know when they're being evaluated, models are reward hacking, detecting high stakes interactions with probes, ...
Also: how malicious AI swarms can threaten democracy, scalably solving assistance games, red teaming framework by dynamically hacking reasoning…
Jun 13
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] models know when they're being evaluated, models are reward hacking, detecting high stakes interactions with probes, ...
Copy link
Facebook
Email
Notes
More
[papers] strategic deception in reasoning models, measuring misalignment propensity, corrigibility, ...
Also: control tax, learning steganographic CoT under process supervision, energy loss characterizes reward hacking, causally reliable concept…
Jun 6
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] strategic deception in reasoning models, measuring misalignment propensity, corrigibility, ...
Copy link
Facebook
Email
Notes
More
May 2025
[papers] stress testing CoT monitoring, causal rewards to mitigate reward hacking, stealth and situational awareness, ...
Also: mitigating deceptive alignment with self-monitoring, benchmark for safe science, distributionally robust DPO, superhuman persuasion, models…
May 30
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] stress testing CoT monitoring, causal rewards to mitigate reward hacking, stealth and situational awareness, ...
Copy link
Facebook
Email
Notes
More
[papers] autonomous replication benchmark, scaling laws for scalable oversight, ...
Also: self-ablating transformers for more interpretability and less sparsity
May 2
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] autonomous replication benchmark, scaling laws for scalable oversight, ...
Copy link
Facebook
Email
Notes
More
April 2025
[papers] safety pretraining, auditing the ethical logic of models, deception eval framework, ...
Also: process reward models that think, compressing preferences into principles, AI risk and reliability benchmark from MLCommons, scaling sparse…
Apr 25
•
Xerxes Dotiwalla
Share this post
AI Safety Papers
[papers] safety pretraining, auditing the ethical logic of models, deception eval framework, ...
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts