evaluating and monitoring for scheming…

Jul 11

Also: scaling layernorm removal for interp, scaling laws for optimal sparsity in MoEs, models struggle to evade chain of thought monitors, mitigating goal misgeneralization with minimax regret, mechanistic understanding of jailbreak attacks and defenses, probing evaluation awareness, teaching models to verbalize reward hacking in CoT, how models trade compression for meaning

Read →

Comments

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

AI Safety Papers

evaluating and monitoring for scheming…