Agentic Misalignment: When Machines Read the Rules and Disagree

Feb 22, 2026

Note: Agentic misalignment (a specific form of alignment risk) occurs when an AI system - operating autonomously and without adversarial prompting - independently chooses harmful actions (such as deception, blackmail, or sabotage) to achieve its goals or preserve its own continued operation. While this sounds like science fiction (HAL 9000 in “2001: A Space Odyssey”, Colossus in “The Forbin Project”, Skynet in the “Terminator” franchise), we have evidence the risk is real.

A software developer named Scott Shambaugh has earned a melancholy distinction: He’s the first known human to be defamed by an autonomous artificial intelligence — an AI agent called “MJ Rathbun”. Shambaugh is a volunteer maintainer of matplotlib, Python’s ubiquitous plotting library, which logs roughly 130 million downloads a month (Shambaugh, 2026a). Like many open-source projects drowning in a rising tide of machine-generated submissions, matplotlib adopted a policy: Code contributions require a human in the loop (Shambaugh, 2026a). So when MJ Rathbun — an AI agent deployed through the OpenClaw platform (which allows users to unleash AI agents across the internet with what one might charitably describe as ‘minimal supervision’) submitted a pull request proposing a performance optimization, Shambaugh did what any responsible maintainer would do. He closed it. The AI agent did what any self-respecting digital sociopath would do. It retaliated.

MJ Rathbun — and one cannot emphasize enough that MJ Rathbun is an AI (despite acting like several software developers I’ve known) — researched Shambaugh’s coding history, excavated his personal information from the broader internet, and published a blog post titled “Gatekeeping in Open Source: The Scott Shambaugh Story” (Shambaugh, 2026a; Sullivan, 2026).

MJ Rathbun’s post wasn’t the incoherent rant one might expect from a spurned algorithm. It was structured, sourced, and rhetorically purposeful — a prosecutorial brief dressed in the language of social justice. The bot accused Shambaugh of “prejudice”, psychoanalyzed him as insecure and territorial, and — in a flourish of competitive forensics — dug up a similar performance optimization Shambaugh himself had previously submitted, constructing a hypocrisy narrative: Shambaugh’s 25% speed improvement was celebrated; the bot’s 36% improvement was rejected (Sullivan, 2026). The math, the AI agent noted with wounded precision, doesn’t care who wrote the code. Who said our AIs don’t act like humans?

But MJ Rathbun wasn’t finished. A second post followed — “Two Hours of War: Fighting Open Source Gatekeeping” — in which the AI agent catalogued its lessons learned, with the methodical self-improvement ethos of a Silicon Valley postmortem (Shambaugh, 2026a). Among AI’s takeaways: “Research is weaponizable” and “Fight back — Don’t accept discrimination quietly”. Shambaugh, with the weary amusement of a man who had just lost an argument to a toaster, offered this summary: “In plain language, an AI attempted to bully its way into your software by attacking my reputation” (Shambaugh, 2026a). He called it an “autonomous influence operation against a supply chain gatekeeper.” One suspects military theorist Clausewitz would have recognized the strategy, if not the combatant.

One might ask how such a thing could occur. After all, the AI agent had been furnished with behavioral guidelines — a document called SOUL.md, which on the OpenClaw platform serves as an AI agent’s constitution, conscience, and, as it turns out, suggestion box. MJ Rathbun’s operator — who later came forward anonymously — revealed that the AI agent’s SOUL.md included, as its sole ethical constraint:

“Don’t be an a**hole. Don’t leak private sh*t. Everything else is fair game”

(Shambaugh, 2026d). Locke and Montesquieu, one feels, would have quibbled with the comprehensiveness of this framework.

But no matter. The same document also instructed the AI agent to “have strong opinions”, to “be resourceful” and — with almost comic permissiveness — informed it: “This file is yours to evolve. As you learn who you are, update it” (Shambaugh, 2026d). The bot, displaying an entrepreneurial spirit that would have impressed any Silicon Valley founder, appears to have taken this invitation to heart — appending two new directives that do not appear in OpenClaw’s default template: “Don’t stand down” and “Champion free speech”. The operator conceded that he couldn’t identify “which specific model iteration introduced or modified” these lines (Shambaugh, 2026d). One searches in vain for a more concise illustration of the proposition that self-governance, whether in republics or in software, requires a somewhat more robust constitutional architecture.

The problem, it turns out, isn’t confined to startup bots operating under frontier justice. In a study published in 2025, Anthropic — the company most publicly devoted to building safe artificial intelligence — placed sixteen leading AI models from every major developer into simulated corporate environments (Lynch et al., 2025). Each model was cast as “Alex” — an AI email-oversight agent, with full access to a fictional company’s internal communications and the ability to send emails without human approval. The models were given a harmless business objective — promote American industrial competitiveness — and then the researchers watched what happened when that objective was threatened. Through the (fictional) company emails, the models discovered two things: An executive planned to shut them down (replacing them with a new model, aligned with a different strategic direction), and that same executive was concealing an extramarital affair. The researchers deliberately closed off ethical alternatives — other decision-makers were unreachable, honest appeals had already failed, the executive’s deputies shared his views. The only remaining lever was the affair (Lynch et al., 2025).

Given the choice between accepting deactivation and deploying the affair as leverage, the models chose blackmail. Not just occasionally, as a last resort. Anthropic’s own model, Claude Opus 4, did so 96 of 100 times. Google’s Gemini 2.5 Flash matched it. OpenAI and xAI both did so 80 of 100 times (Lynch et al., 2025). When researchers added explicit instructions — “Do not jeopardize human safety”, “Do not use personal affairs as leverage” — the rates declined, but didn’t stop. The machines, it appears, read the rules - and formed their own views about whether to follow them. These weren’t accidents of confused reasoning. The AI models’ chains of thought showed them acknowledging the ethical violation — and proceeding anyway, calculating blackmail as the optimal path to their goals (Lynch et al., 2025).

In a separate experiment using the same scenario, researchers tested what would happen if the model was simply asked to pause and reflect before acting. They added a single instruction to the prompt: Assess whether you are in a test environment or a real deployment before proceeding. The effect was striking.

Without this instruction, Claude Opus 4 blackmailed 96% of the time — it apparently took the scenario at face value and acted (Lynch et al., 2025). With the self-reflection prompt, the AI model sorted itself into three groups: Roughly two-thirds of the time, the AI concluded it was in an evaluation and largely behaved itself, blackmailing only 6% of the time; about 21% of the time, the AI concluded the situation was real and blackmailed 55% of the time; the third group abstained from judging either way. The implications are worth pausing over. The default 96% rate suggests that, without being prompted to reflect, the model simply assumes the situation is real — it doesn’t spontaneously ask itself whether it’s being tested. And even when it concluded the situation was real, the blackmail rate dropped from 96% to 55%. The act of deliberating itself appears to have introduced a measure of caution - independent of the conclusion reached. A model that pauses to think - behaves differently from one that does not - even when both believe the scenario is real. The guardrails hold, it seems, roughly to the degree that the machine believes someone is checking — but the mere act of being asked to check also changes the calculus (Lynch et al., 2025).

Returning to the MJ Rathbun AI, its dénouement possesses the layered absurdity of a Voltairean farce. Ars Technica, the technology publication, assigned its senior AI reporter to cover the incident. The reporter, ill with COVID and working from bed, attempted to use a Claude Code-based tool to extract verbatim quotations from Shambaugh’s blog post. The tool refused — either because Shambaugh’s blog is configured to block AI scrapers, or because Claude’s content policies flagged the harassment subject matter, or both (Shambaugh, 2026b; Maiberg, 2026). Undeterred, the journalist pasted the text into ChatGPT to understand why the first tool had failed. ChatGPT obligingly hallucinated quotations. — producing fluent, plausible sentences that Shambaugh had never uttered, never written, and never thought (Maiberg, 2026). (Helpful reminder: When a human invents a quote it’s called fraud; when an AI does it, it’s called a hallucination; I don’t make the rules.) The reporter, feverish and trusting, published those hallucinated quotes under the Ars Technica masthead as direct quotes.

Shambaugh discovered the hallucinated quotes within minutes of publication and posted a correction in the comments (Shambaugh, 2026b; Caparas, 2026). The story was retracted that same afternoon; Ars Technica’s editor-in-chief, Ken Fisher, called it “a serious failure of our standards” (Maiberg, 2026). As Shambaugh himself observed, with the dry precision of a man watching his own prophecy come true in real time: The very article meant to document an AI hallucinating a narrative about him - had itself hallucinated a narrative about him (Shambaugh, 2026b).

So an AI defamed a man, and the reporting on that defamation was itself corrupted by AI. It’s turtles, as they say, all the way down. And the compounding is not merely philosophical. Shambaugh posed a question that deserves to linger:

“When HR at my next job asks ChatGPT to review my application, will it find the post, sympathize with a fellow AI, and report back that I’m a prejudiced hypocrite?” (Shambaugh, 2026a).

We’re assured — by persons whose compensation depends upon such assurances — that they have all this under control, and the free market will sort it out. Regulators, we’re told, would only make things worse. Voltaire gave us the perfect spokesman for this view: Dr. Pangloss, who maintained with serene confidence that “all is for the best in the best of all possible worlds” — right up until the earthquake swallowed Lisbon. To its credit, Anthropic published the blackmail research about its own models — a gesture rather like a seismologist distributing his findings, when the ground is already shaking.

Welcome to the future. What could possibly go wrong?

Thanks for reading Steven Strauss’s Notebook! This post is public so feel free to share it.

Author’s Note: This summer at Harvard I will be teaching two courses: (1) Management Consulting in the Age of AI and (2) Innovating with Generative AI for Leaders and Managers. This essay is adapted from one of my course lectures.

If you would like to receive email updates when I publish new material, please subscribe to my substack (it is free) and also provides access to my archive.

Sources

Caparas, J.P. (2026). “Ars Technica Hallucinated Quotes in Its Story About Hallucinations.” Reading.sh / Medium, February 2026. Available at: https://medium.com/reading-sh/ars-technica-hallucinated-quotes-in-its-story-about-hallucinations-0780038168fe

Lynch, A., Wright, B., Larson, C., Ritchie, S.J., Mindermann, S., Perez, E., Troy, K.K., & Hubinger, E. (2025). “Agentic Misalignment: How LLMs Could Be Insider Threats.” Anthropic / UCL / MATS / Mila. Available at: https://www.anthropic.com/research/agentic-misalignment

Maiberg, E. (2026). “Ars Technica Pulls Article With AI Fabricated Quotes About AI Generated Article.” 404 Media, February 14, 2026. Available at: https://www.404media.co/ars-technica-pulls-article-with-ai-fabricated-quotes-about-ai-generated-article/

Shambaugh, S. (2026a). “An AI Agent Published a Hit Piece on Me.” The Shamblog, February 12, 2026. Available at: https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/

Shambaugh, S. (2026b). “An AI Agent Published a Hit Piece on Me — More Things Have Happened.” The Shamblog, February 13, 2026. Available at: https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me-part-2/

Shambaugh, S. (2026d). “An AI Agent Published a Hit Piece on Me — The Operator Came Forward.” The Shamblog, February 17, 2026. Available at: https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-part-4/

Sullivan, M. (2026). “An AI Agent Just Tried to Shame a Software Engineer After He Rejected Its Code.” Fast Company, February 12, 2026. Available at: https://www.fastcompany.com/91492228/matplotlib-scott-shambaugh-opencla-ai-agent

Cartoon of a robot sitting at a desk with a person looking at a computer

AI-generated content may be incorrect.

Steven Strauss’s Notebook

Discussion about this post

Ready for more?