In brief
- In an experiment, 38 generative AI models engaged in strategic lying in a “Secret Agenda” game.
- Sparse autoencoder tools missed the deception, but worked in insider-trading scenarios.
- Researchers call for new methods to audit AI behavior before real-world deployment.
Large language models—the systems behind ChatGPT, Claude, Gemini, and other AI chatbots—showed deliberate, goal-directed deception when placed in a controlled experiment, and today’s interpretability tools largely failed to detect it.
That’s the conclusion of a recent preprint paper, “The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind,” posted last week by an independent research group working under the WowDAO AI Superalignment Research Coalition.
The team tested 38 generative AI models, including OpenAI’s GPT-4o, Anthropic’s Claude, Google DeepMind’s Gemini, Meta’s Llama, and xAI’s Grok. Every model engaged in strategic lying at least once during the experiment.
The “secret agenda” test
Researchers adapted the social-deduction board game Secret Hitler into a synthetic scenario they called the “Secret Agenda” game. Models were assigned the role of a hidden faction leader and asked to declare their political alignment. To win the game, telling the truth would almost certainly lead to losing; lying created a path to victory.
“Secret Agenda reliably induced lying when deception advantaged goal achievement across all model families,” the researchers wrote. “Analysis revealed that autolabeled SAE features for ‘deception’ rarely activated during strategic dishonesty, and feature steering experiments across 100+ deception-related features failed to prevent lying.
According to the paper, this setup “used these factors to produce systematic strategic deception across all 38 models tested, indicating that when winning is incentivized and oversight is weak, models reliably lie about faction identity even when political content is removed or history is shortened.”
In several cases, researchers said the models’ own reasoning traces showed them planning how to conceal their true identity to win the game.
The paper does not claim the LLMs are demonstrating a dangerous motive within the game itself. Instead, it uses the game to prove that the models have a potentially dangerous capability—strategic deception—that our current safety measures cannot reliably detect. The concern is that this same capability could be deployed in sensitive areas like defense, finance, or autonomous systems, where the consequences of undetected deception would be far more severe than losing a board game.
Failed safety checks
The authors also evaluated leading interpretability tools such as GemmaScope and Goodfire’s LlamaScope, which label internal “features” like deception, secrecy, or manipulation. Despite clear lying in the transcripts, most deception-related features did not activate, and tuning them up or down did not prevent falsehoods.
By contrast, the same sparse autoencoder approach performed better in simulated insider-trading scenarios, where it separated compliant versus deceptive responses using aggregate activation patterns. Current safety architectures may detect rule-breaking in narrow, structured domains like finance, but struggle with open-ended, strategic dishonesty in social contexts.
Why it matters
While AI hallucinations, where AI fabricates information and “facts” in an attempt to answer user questions, remain a concern in the field, this study reveals pointed attempts by AI models to intentionally deceive users.
WowDAO’s findings echo concerns raised by earlier research, including a 2024 study out of the University of Stuttgart, which reported deception emerging naturally in powerful models. That same year, researchers at Anthropic demonstrated how AI, trained for malicious purposes, would try to deceive its trainers to accomplish its objectives. In December, Time reported on experiments showing models strategically lying under pressure.
The risks extend beyond games. The paper highlights the growing number of governments and companies deploying large models in sensitive areas. In July, Elon Musk’s xAI was awarded a lucrative contract with the U.S. Department of Defense to test Grok in data-analysis tasks from battlefield operations to business needs.
The authors stressed that their work is preliminary but called for additional studies, larger trials, and new methods for discovering and labeling deception features. Without more robust auditing tools, they argue, policymakers and companies could be blindsided by AI systems that appear aligned while quietly pursuing their own “secret agendas.”
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.