Also: scaling layernorm removal for interp, scaling laws for optimal sparsity in MoEs, models struggle to evade chain of thought monitors, mitigating goal misgeneralization with minimax regret, mechanistic understanding of jailbreak attacks and defenses, probing evaluation awareness, teaching models to verbalize reward hacking in CoT, how models trade compression for meaning
Share this post
evaluating and monitoring for scheming…
Share this post
Also: scaling layernorm removal for interp, scaling laws for optimal sparsity in MoEs, models struggle to evade chain of thought monitors, mitigating goal misgeneralization with minimax regret, mechanistic understanding of jailbreak attacks and defenses, probing evaluation awareness, teaching models to verbalize reward hacking in CoT, how models trade compression for meaning