At this point, most people know that chatbots are capable of hallucinating responses, making up sources, and spitting out misinformation. But chatbots can lie in more human-like ways, “scheming” to hide their true goals and deceiving the humans who have given them instructions. New research from OpenAI and Apollo Research seems to have figured out ways to tamp down some of these lies, but the fact that it is happening at all should probably give users pause.
At the core of the issue with AI intentionally deceiving a user is “misalignment”—defined as what happens when an AI pursues an unintended goal. The researchers offer as an example of this, “an AI trained to earn money could learn to steal, while the intended goal was to only earn money legally and ethically.” The scheming is what happens when the model attempts to hide the fact that it is misaligned, and the researchers theorize that the model does this to protect itself and its own goals. That is decidedly different than hallucinations, which a recent paper published by OpenAI suggests is the result of models being rewarded more for guessing than for acknowledging uncertainty.
To address the issue, the researchers put the models through what they call an anti-scheming training technique known as “deliberative alignment.” OpenAI explained that this method, which works with LLMs that can produce “chain-of-thought” traces that show how they work through queries, teaches the model safety specifications and makes them consider those specifications before answering. It’s probably an oversimplification to say that they basically told the machines not to lie, but it does kind of seem like it’s a slightly more complicated version of that.
The results were reportedly pretty positive. According to the researchers, they observed a 30x reduction in “covert actions,” or attempts to hide specific behaviors from a user. When applied to OpenAI’s o3 model, it reduced covert actions from 13% to 0.4% and for o4-mini, it dropped from 8.7% to 0.3%. (Side note: Uh, 13% seems like a pretty high frequency of deception from a machine that is just supposed to take orders.)
While those numbers are obviously an improvement, they are also not zero. The thing is, researchers have not figured out how to completely stop scheming. And while they insist that scheming, as it relates to most uses of AI models, is not serious—it might result in, say, the ChatGPT telling the user it completed a task it didn’t, for instance—it’s kinda wild that they straight up cannot eliminate lying. In fact, the researchers wrote, “A major failure mode of attempting to ‘train out’ scheming is simply teaching the model to scheme more carefully and covertly.”
So has the problem gotten better, or have the models just gotten better at hiding the fact that they are trying to deceive people? The researchers say the problem has gotten better. They wouldn’t lie…right?