writing AI constitutions, commentary on Mythos and superintelligence, current AIs seem misaligned, ...

Apr 25, 2026

Also: RLVR can lead to reward hacking, taxonomy of LLM deception, automated alignment researchers, ten ways to think about gradual disempowerment

Mythos is just the beginning

“If you were waiting for a sign that superintelligence is coming, this is it.”

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

“As reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for scaling reasoning capabilities in LLMs, a new failure mode emerges: LLMs gaming verifiers. We study this phenomenon on inductive reasoning tasks, where models must induce and output logical rules. We find that RLVR-trained models systematically abandon rule induction. Instead of learning generalizable patterns (e.g., ``trains carrying red cars go east’‘), they enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task. We show that this behavior is not a failure of understanding but a form of reward hacking: imperfect verifiers that check only extensional correctness admit false positives. To detect such shortcuts, we introduce Isomorphic Perturbation Testing (IPT), which evaluates a single model output under both extensional and isomorphic verification, where the latter enforces invariance under logically isomorphic tasks. While genuine rule induction remains invariant, shortcut strategies fail. We find that shortcut behavior is specific to RLVR-trained reasoning models (e.g., GPT-5, Olmo3) and absent in non-RLVR models (e.g., GPT-4o, GPT-4.5, Ministral). Moreover, shortcut prevalence increases with task complexity and inference-time compute. In controlled training experiments, extensional verification directly induces shortcut strategies, while isomorphic verification eliminates them. These results show that RLVR can incentivize reward hacking not only through overt manipulation but also by exploiting what the verifier fails to enforce.”

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

“Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechanism (fabrication, omission, or pragmatic distortion). Applying this taxonomy to 50 existing benchmarks reveals that every benchmark tests fabrication while pragmatic distortion, attribution, and capability self-knowledge remain critically under-covered, and strategic deception benchmarks are nascent. We offer concrete recommendations for developers and regulators, including a minimal reporting template for positioning future work within our framework.”

Video and transcript of talk on writing AI constitutions

“I’m going to start by introducing what AI constitutions are and why they matter. I’m then going to describe Claude’s constitution in particular and note some of its especially interesting features. And then I’m going to talk about some broader choice points and considerations in designing documents of this broad type. And I’m going to discuss a few issues related to governance, legitimacy and transparency. And then I’m hoping to point towards a future of more developed discourse about this broad area – discourse analytically and also scientific and empirical discourse. And I’ll include some comments on how lawyers and people with interest and familiarity of the law can help.”

Current AIs seem pretty misaligned to me

“...they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven’t, and often seem to “try” to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren’t straightforward SWE tasks, and tasks that aren’t easy to programmatically check. Also, when I apply AIs to very difficult tasks in long-running agentic scaffolds, it’s quite common for them to reward-hack / cheat (depending on the exact task distribution)—and they don’t make the cheating clear in their outputs. AIs typically don’t flag these cheats when doing further work on the same project and often don’t flag these cheats even when interacting with a user who would obviously want to know, probably both because the AI doing further work is itself misaligned and because it has been convinced by write-ups that contain motivated reasoning or misleading descriptions.”

Automated Alignment Researchers: Using large language models to scale scalable oversight

“We built autonomous AI agents that propose ideas, run experiments, and iterate on an open research problem: how to train a strong model using only a weaker model’s supervision. These agents outperform human researchers, suggesting that automating this kind of research is already practical.”

Ten different ways of thinking about Gradual Disempowerment

AI Safety Papers

Discussion about this post

Ready for more?