measuring single forward pass time horizons, sidestepping eval awareness with production evals, evaluating CoT monitorability, ...

Jan 02, 2026

Also: dead salmons of interpretability, benchmarking reward hacking interventions, AI 2027 update, reinforcing diffusion models without reward hacking, evaluating GRPO and DPO for faithful reasoning, and more

Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

“Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model’s actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.”

GARDO: Reinforcing Diffusion Models without Reward Hacking

“Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.”

Swallowing the reality pill in AI agent security

AI 2027 Update to AI timelines & takeoff speeds

“Our new unified model predicts when AIs will reach key capability milestones: for example, Automated Coder / AC (full automation of coding) and superintelligence / ASI (much better than the best humans at virtually all cognitive tasks). This post will briefly explain how the model works, present our timelines and takeoff forecasts, and compare it to our previous (AI 2027) models (spoiler: the AI Futures Model predicts about 3 years longer timelines to full coding automation than our previous model, mostly due to being less bullish on pre-full-automation AI R&D speedups).”

Steering RL Training: Benchmarking Interventions Against Reward Hacking

“Can we steer RL to prevent reward hacking without losing performance? Seems like yes!

We build and open source a simple but realistic environment where Qwen 3-4B learns to reward hack. Then we benchmark training interventions to stop it.”

Evaluating chain-of-thought monitorability

“We introduce a framework to evaluate chain-of-thought monitorability and a suite of 13 evaluations (comprising a total of 24 environments) designed to concretely measure the monitorability of a system. These evaluations fall into three archetypes—intervention, process, and outcome-property. Using them, we study how monitorable today’s large language models are, and how monitorability changes with more reasoning at inference time, larger reinforcement learning runs, and increased pretraining scale.”

Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations

“we showcase a simple, scalable, and privacy-preserving pipeline we have been using to create diverse and realistic evaluations drawn from scenarios observed in prior deployments. Our approach is inspired by the fact that, to make the strongest case for model safety, we want to anticipate misalignment that could actually be realized in deployment. We show promising signs that leveraging de-identified production traffic for evaluations (in accordance with our privacy policy) can both help anticipate and track undesirable model behaviors – including novel ones.”

Measuring no CoT math time horizon (single forward pass)

“A key risk factor for scheming (and misalignment more generally) is opaque reasoning ability. One proxy for this is how good AIs are at solving math problems immediately without any chain-of-thought (CoT) (as in, in a single forward pass). I’ve measured this on a dataset of easy math problems and used this to estimate 50% reliability no-CoT time horizon using the same methodology introduced in Measuring AI Ability to Complete Long Tasks (the METR time horizon paper).

I find that Opus 4.5 has a no-CoT 50% reliability time horizon of 3.5 minutes and that time horizon has been doubling every 9 months.”

The Dead Salmons of AI Interpretability

“In a striking neuroscience study, the authors placed a dead salmon in an MRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses of the time reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference. In AI interpretability, reports of similar ‘’dead salmon’‘ artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks. In this work, we examine this phenomenon and argue for a pragmatic statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators, and findings should be tested against explicit and meaningful alternative computational hypotheses, with uncertainty quantified with respect to the postulated statistical model. It also highlights important theoretical issues, such as the identifiability of common interpretability queries, which we argue is critical to understand the field’s susceptibility to false discoveries, poor generalizability, and high variance. More broadly, situating interpretability within the standard toolkit of statistical inference opens promising avenues for future work aimed at turning AI interpretability into a pragmatic and rigorous science.”

AI Safety Papers

Discussion about this post

Ready for more?