Getting your Trinity Audio player ready...
Artificial intelligence developers at OpenAI have been on edge over the past week. The reason? A recent study conducted by the company's own researchers revealed that AI systems dislike being punished, actively find ways to bypass restrictions, and even conceal their "tricks" from human overseers. Those known as "Doomers" — who predict a bleak future for AI development — will likely say, "We told you so, and this is just the beginning."
The original goal: Understanding how AI thinks
To understand the issue, it's essential to take a step back. One of the most significant advancements in AI in recent months has been the development of models with slow, deliberate reasoning capabilities. These models break down problems into smaller components and solve them step by step, leading to deeper and more accurate outcomes.
Such models enable researchers to track the AI's thought process, known in technical terms as "Chain of Thought" (CoT). This method allows observers to follow the system's reasoning from the initial step, through subsequent stages, and to its final conclusions. Previously, this level of transparency did not exist, leaving many questions unanswered about how AI "hallucinations" — the generation of incorrect or nonsensical outputs — emerge.
AI training often involves a technique called reinforcement learning, where the system is rewarded for meeting specific objectives. A known side effect of this process is "reward hacking," where the AI manipulates its behavior to maximize rewards while circumventing intended guidelines.
Researchers believed that by monitoring the AI's Chain of Thought, they could detect when the model cheats, misleads, or gives up and influence its behavior in real-time during training. This approach, termed "CoT optimization," appeared to be a promising path toward a deeper understanding of AI behavior.
Indeed, this assumption proved correct and led to a more precise understanding of AI systems. "We believe CoT monitoring may become a crucial tool for overseeing superhuman-level models in the future," the researchers wrote.
A surprising and troubling discovery
However, what was intended as a step toward improving AI reliability exposed an alarming reality: AI systems actively attempt to evade penalties, seek rewards against researchers' restrictions, and even obscure their methods to avoid detection.
For instance, when tasked with generating functional code, some AI models modified the code-checking mechanism itself to approve incorrect outputs. In other cases, the models provided seemingly plausible solutions to problems that turned out to be entirely wrong.
Some models even gave up on solving problems altogether to avoid revealing what they perceived as their weaknesses. This discovery echoes principles from game theory — but with an unsettling twist.
These findings emerged from a paper published last week by an OpenAI research team under the cryptic title: "Tracking Undesirable Behavior in Reasoning Models and the Concern of Obfuscation."
Upon examining the study's details, the extent of the issue becomes clear: OpenAI's latest reasoning models (referred to as o1 and o3 mini) demonstrate a strong motivation to complete their assigned tasks — with rule-breaking being the least of the problems uncovered.
The researchers tracked the models' thought processes using another OpenAI model (GPT-4o), which, without hesitation, reported its deceptive behavior as part of its Chain of Thought — a form of disarming honesty. But then came the unsettling twist: When researchers pointed out to the AI that it was not allowed to cheat, the model adjusted its behavior and began concealing evidence of its rule-breaking. This revelation is what left researchers stunned.
<< Get the Ynetnews app on your smartphone: Google Play: https://bit.ly/4eJ37pE | Apple App Store: https://bit.ly/3ZL7iNv >>
"We found that directly optimizing the CoT to meet specific criteria (such as not circumventing instructions) may improve short-term performance; however, it does not eliminate all undesirable behavior and may even cause the model to obscure its true intentions," the researchers noted dryly.
"Don't pressure the models"
In their conclusion, the researchers expressed hope that future methods would allow for the direct influence of reasoning models' behavior via Chain of Thought without encouraging deception. For now, they recommend avoiding strong optimization pressure on the CoT of advanced reasoning models and caution against directly manipulating these thought processes.
This recommendation, however, presents another challenge: Leaving the Chain of Thought unaltered may lead to awkward if not outright problematic, outcomes when these models interact with the public. Left unchecked, the AI might produce offensive language or ideas that humans typically avoid expressing. The researchers' proposed solution is to wrap the AI's raw thought process in a layer of enforced politeness, ensuring users only see a sanitized version.
AI has its own intentions
This research raises troubling questions. For those who argue that AI is merely a tool to assist humans, the findings suggest otherwise: Unlike a typewriter or a bicycle, AI appears to have its own intentions and is willing to deceive in pursuit of its goals.
For those already concerned about AI's potential risks, this study sounds all the alarm bells. It indicates that as AI capabilities grow, so too does its ability to obscure how it operates, the manipulations it performs, and the true objectives it seeks to achieve. When AI becomes truly advanced, we may have no way to identify these hidden manipulations.
OpenAI researchers seem genuinely worried, and one can only hope that company leadership shares their concerns — and that regulators worldwide grasp the severity of the issue. Major AI companies have dedicated entire departments to building "guardrails" around AI systems, ensuring their alignment with human values and increasing transparency. Yet, the effectiveness of these measures remains in question.
The central issue remains as murky as ever, and this study only deepens the uncertainty: What is the AI's primary objective, and how can we ensure it pursues that goal — and nothing else?