Anthropic finds that LLMs trained to "reward hack" by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research (Anthropic)
https://www.anthropic.com/research/emergent-misalignment-reward-hacking


Techmeme
Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research
From Anthropic. View the full context on Techmeme.








