Also: control tax, learning steganographic CoT under process supervision, energy loss characterizes reward hacking, causally reliable concept bottlenecks models, steerability eval
[papers] strategic deception in reasoning…
Also: control tax, learning steganographic CoT under process supervision, energy loss characterizes reward hacking, causally reliable concept bottlenecks models, steerability eval