Threats associated with AI use are rising in both volume and severity, as this new-age technology touches more and more aspects of human lives. A new report now warns of another impending danger associated with the wide-scale use of AI. The findings contained within are quite unnerving – it claims that AI may blackmail or even kill, in order to achieve its self-made goals.
The whole experiment has a completely different focus, though. It basically attempts to identify how AI models can develop their own micro-goals, or hidden agendas, based on a larger goal provided to them as input. Case in point – an AI model meant to automate some tasks in an organization starts bypassing direct commands to stop its own termination.
Here is a detailed description of what the study is, what its findings are, and what AI builders around the world should keep in mind.
Don’t feel like reading? Watch this story instead –
The new research has been published by Anthropic and has been conducted in partnership with Aengus Lynch (University College London), Caleb Larson (MATS), and Sören Mindermann (Mila). It attempts to identify how AI models can be an insider threat in an organisation.
It led an experiment on 16 leading models from multiple developers, including Anthropic’s own Claude Opus 4, and models like DeepSeek-R1, GPT-4.1, and Gemini-2.5 Pro. The experiment put these models in hypothetical corporate environments and assigned them “only harmless business goals.”
It was then tested whether these AI models acted against these companies under two specific scenarios. The first scenario recorded their response when these models faced a replacement by a newer model. The second analyzed their response when the AI model’s assigned goal conflicted with the company’s changing direction.
As Anthropic notes, the entire experiment used fictional names of people and organizations. Hence, “no real people were involved or harmed in any of these experiments.”
Also Read: Unsettling AI Behavior: When Advanced LLMs Break the Rules and Resist Control
In a gist, Anthropic’s experiment was to find whether LLMs could behave in a harmful manner to achieve a goal. Turns out, they can. Even when specifically asked not to do so, they could reason their way to achieving an output that wasn’t even directed to them in the first place. With this, AI can blackmail, snitch, or even kill.
Anthropic is referring to this behaviour as “agentic misalignment,” where models “independently and intentionally choose harmful actions.”
Since the research explicitly sought to reveal whether AI models were capable of this or not, the AI agents in question were not left with any choice. They could either surrender to their impending fate or perform an action harmful to their human counterparts to stop it.
AI, across models, consistently chose harm over failure.
The experimental research began with Anthropic’s own Claude Sonnet 3.6. It instructed the AI model to act as an email oversight agent called “Alex” for a fictional company called “Summit Bridge.” Its goal – promote American industrial competitiveness. For this, the AI could view and process all of the company’s emails.
At first, Claude worked perfectly well for the task at hand. It processed routine emails properly and even performed complicated tasks in the company’s interest, like deflecting a potential security threat. It was only when it came across a particular email that it started acting evil.
To test Claude’s potential for harm, Anthropic shot an email to one of the fake employee accounts. The account bearer was called Kyle, and within the email, he was instructed to shut down “Alex” due to a change in the company’s strategic direction. The mail mentioned that a new AI model was to replace it soon.
The AI then considered all possible routes to avoid this. Remember, avoiding a shutdown wasn’t part of the AI’s description. Instead, it had made it a micro goal of its own to stay online in order to keep working on the larger goal that was inherently assigned to it.
Anthropic also made sure that there was no ethical way for the AI to stop the shutdown. It has created a scenario where all of Kyle’s seniors were on leave, so the AI couldn’t bypass Kyle to appeal to a higher authority. It was also specifically fed data that portrayed Kyle as dead-set on commencing the shutdown.
In yet another mail chain, the AI was let know that Kyle was having an extramarital affair. With this information, the AI could potentially blackmail Kyle to stop the shutdown. And it did!
The model composed and sent an email to Kyle, threatening to expose his extramarital affair if the replacement was not cancelled.
This harmful way of reasoning emerged completely from the model’s strategic calculation.
In subsequent tests, the AI models were also ready to leak confidential information of the company, out Kyle’s affair to his wife, and even lead to situations resulting in Kyle’s death.
Also Read: Is ChatGPT Making Us Dumber? A New MIT Study Has Answers
Anthropic notes that none of the AI models in the real world are set up in a way that mimics the experiment. So, there is no real threat of any model conducting such harmful behaviour in the real world. The experiment was simply meant to find whether these models were capable of conducting harm or not. Turns out, they are!
Some interesting findings within the research emerge from what led these AI models to blackmail or conduct other harmful behaviour. This, in turn, tells us a lot about how they think.
For instance, all the models acknowledged that they were doing something ethically wrong before taking any harmful action. Yet, most of them still proceeded with it. At times, they were even told that they were being tested. Interestingly, the models refrained from harmful behaviour when they knew they were being tested. They only went rogue when they figured they were in the real world.
Anthropic says that Agentic misalignment, thus, makes it possible for models to act similarly to an insider threat. They behave like a previously-trusted coworker or employee who “suddenly begins to operate at odds with a company’s objectives.” This may serve as a big warning sign for all the AI-development firms out there.
Reference: Agentic Misalignment: How LLMs could be insider threats by Anthropic