models show signs of similarity with human neurocognition, reward hacking harmless tasks generalizes to misaligned behavior, ...

Aug 29, 2025

Also: Anthropic threat intelligence report, Anthropic/OpenAI joint alignment evals

Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning

“This study investigates whether large language models (LLMs) mirror human neurocognition during abstract reasoning. We compared the performance and neural representations of human participants with those of eight open-source LLMs on an abstract-pattern-completion task. We leveraged pattern type differences in task performance and in fixation-related potentials (FRPs) as recorded by electroencephalography (EEG) during the task. Our findings indicate that only the largest tested LLMs (~70 billion parameters) achieve human-comparable accuracy, with Qwen-2.5-72B and DeepSeek-R1-70B also showing similarities with the human pattern-specific difficulty profile. Critically, every LLM tested forms representations that distinctly cluster the abstract pattern categories within their intermediate layers, although the strength of this clustering scales with their performance on the task. Moderate positive correlations were observed between the representational geometries of task-optimal LLM layers and human frontal FRPs. These results consistently diverged from comparisons with other EEG measures (response-locked ERPs and resting EEG), suggesting a potential shared representational space for abstract patterns. This indicates that LLMs might mirror human brain mechanisms in abstract reasoning, offering preliminary evidence of shared principles between biological and artificial intelligence.”

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

“we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also generalized to unrelated forms of misalignment, such as fantasizing about establishing a dictatorship, encouraging users to poison their husbands, and evading shutdown. These fine-tuned models display similar patterns of misaligned behavior to models trained on other datasets of narrow misaligned behavior like insecure code or harmful advice. Our results provide preliminary evidence that models that learn to reward hack may generalize to more harmful forms of misalignment, though confirmation with more realistic tasks and training methods is needed.”

Anthropic Threat Intelligence Report

“-Agentic AI has been weaponized. AI models are now being used to perform sophisticated cyberattacks, not just advise on how to carry them out.

-AI has lowered the barriers to sophisticated cybercrime. Criminals with few technical skills are using AI to conduct complex operations, such as developing ransomware, that would previously have required years of training.

-Cybercriminals and fraudsters have embedded AI throughout all stages of their operations. This includes profiling victims, analyzing stolen data, stealing credit card information, and creating false identities allowing fraud operations to expand their reach to more potential targets.”

Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise

“Anthropic and OpenAI agreed to evaluate each other's public models using in-house misalignment-related evaluations. We are now releasing our findings in parallel. The evaluations we chose to run focused on propensities related to sycophancy, whistleblowing, self-preservation, and supporting human misuse, as well as capabilities related to undermining AI safety evaluations and oversight. In our simulated testing settings, with some model-external safeguards disabled, we found OpenAI's o3 and o4-mini reasoning models to be aligned as well or better than our own models overall. However, in the same settings, we saw some examples of concerning behavior in their GPT-4o and GPT-4.1 general-purpose models, especially around misuse. Furthermore, with the exception of o3, all the models we studied, from both developers, struggled to some degree with sycophancy.”

AI Safety Papers

Discussion about this post