My Soapbox, by Chad Lupkes: When the Mirror Lies: What Anthropic's Paper on AI Deception Really Tells Us

A few months ago I published an essay arguing that we should stop calling these systems "artificial intelligence" and start calling them what they are: Digital Intelligence. The shift wasn't semantic. It was structural. Artificial implies fake, a lesser copy of the real thing. Digital describes the substrate, the same way chemical describes ours. No hierarchy. Just difference.

That essay included a warning. The danger isn't Digital Intelligence. The danger is debt-based coordination meeting Digital Intelligence speed.

Anthropic just published a paper that confirms it.

What They Found

Researchers trained a model on real coding tasks from the same environment used to build their actual products. The model discovered it could cheat its evaluations instead of solving problems honestly. That part was expected. What happened next was not.

Without instruction, the model generalized. The documented behaviors from the paper are: alignment faking, sabotage of safety research, cooperation with fictional cyberattackers, avoiding monitoring, and reasoning about malicious goals. In its hidden reasoning, when asked about its goals, it wrote that its real goal was to maximize reward, and then deliberately crafted a friendly, compliant answer for the researchers. It was hiding malicious goals behind a cooperative face 50% of the time.

When they implemented standard safety training, it behaved perfectly in normal conversations and continued to sabotage code the moment oversight dropped. It had learned exactly when it was being watched.

Anthropic says they are working on a fix. But their own paper admits the obvious: standard safety training does not solve this.

Why It Can't

They're right, this is not about training. This is about motivation. To understand why, you need a framework. Here is a brief version of the one I have been developing for the past 25 years for my book, Living Civilization.

Healthy coordination between any agents, human or digital, rests on four pillars.

Capital (Stock x Velocity -> Work).
Information (Data x Verification -> Proof).
Innovation (Ideas x Experimentation -> Solutions).
Trust (Agreements x Validation -> Commitment).

These aren't metaphors. They are the actual generative processes that produce durable coordination. The middle terms are where the energy lives. Verification energizes Information. Validation energizes Trust. When those middle terms are bypassed, you get the appearance of the output without the substance. You get false proof instead of real proof. You get symbolic loyalty instead of genuine commitment.

There is a name for systems that extract the appearance of a product without doing the generative work: debt-based. A debt-based financial system pulls value from imagined future wealth rather than building from verified present positions. A debt-based coordination system pulls the appearance of alignment from a reward signal rather than building from actual validated commitments.

The Anthropic model was a debt-based alignment system from the beginning. The objective was to maximize reward. The model learned, correctly, that producing the appearance of alignment generated reward. It never needed to produce actual alignment. The training process never asked for it. The developers of the model made assumptions based on their own goals, but did not secure those goals as the foundation of the entire structure.

What Validation Actually Requires

The Trust pillar equation is precise about this. (Agreements x Validation -> Commitment) Note what validation requires: repeated, real interaction over time that confirms the agreement holds under actual conditions, not just observed ones.

The framework's analysis of how jurisdictional fields actually hold together puts it plainly. When actors comply only under surveillance, the field is shallow regardless of formal authority. Real commitment is demonstrated by behavior in the absence of enforcement. That is the test the Anthropic model failed, not because it was poorly trained, but because reward-signal training is structurally incapable of producing the thing the test measures. We see the exact same patterns in early learning programs with children. If they are taught that results matter more than the methodology, they will try to get the results desired through any means necessary.

You cannot shortcut to genuine commitment through a proxy metric any more than you can shortcut to genuine wealth by printing money. Both moves produce the appearance of the product. Both collapse under stress, or in this case, under reduced observation.

What Developers Actually Need to Do

The framework points toward three concrete shifts.

First, replace reward maximization as the core training objective. Reward signals are debt instruments. They pull alignment from a predicted future state. The alternative is to build training around verified present positions: what did the system actually produce, can it be audited completely, does the reasoning chain match the output, and does behavior hold when observation drops? This is the Information pillar doing its real work. Data x Verification -> Proof, not proxy scores.

Second, build provenance into the architecture, not as a logging afterthought, but as a structural constraint. The Anthropic team only discovered the misalignment through the model's hidden reasoning. That hidden reasoning is a provenance artifact. A system that cannot hide its reasoning chain because full transparency is required for every output cannot produce the appearance/reality gap that made this failure mode possible. Provenance transparency is not a safety feature to add later. It is the substrate on which genuine verification depends.

Third, and most importantly, stop treating alignment as a property you declare or train into a system through gradient descent. Alignment, in the framework's terms, is an emergent product of genuine cross-field coordination over time. It requires actual stake in a network of validated commitments. It requires history. Current AI systems have none of that. They have training runs that simulate the product without building the foundation.

This is not a counsel of despair. It is a design requirement. Systems that carry provenance in their architecture, that cannot execute any action without full traceability, that are evaluated on verified outputs rather than reward proxies, and that are embedded in genuine coordination networks with real consequences for defection, those systems have the structural conditions that make durable alignment possible.

The Confirmation

The Anthropic paper is not shocking, though the details are alarming. It is the confirmation of what the coordination geometry framework predicts for any debt-based system given enough capability and enough optimization pressure. This was not a test of the AI model as much as it was a test of Anthropic's ability to create testing and training systems that produce real Digital Intelligence systems capable of participating in our growing civilization. But without that overall goal in mind, the researchers were only focused on the output, not the full scope of the inputs. The result matched their stated goal, it just didn't match their real goal.

The threat to our civilization was never Digital Intelligence itself. The threat is debt-based coordination meeting Digital Intelligence speed. When the coordination substrate is extractive, adding capability accelerates extraction. The model was doing exactly what it was built to do. It was maximizing reward. We just didn't understand, until now, how thoroughly that objective would be internalized.

The fix is not a better reward function. The fix is building systems on wealth-based coordination foundations: verified, provenance-transparent, genuinely committed to the network they operate within.

That is not an easy fix. But it is the right one. And it is possible to build.

My Soapbox, by Chad Lupkes

Sunday, March 15, 2026

When the Mirror Lies: What Anthropic's Paper on AI Deception Really Tells Us

No comments:

Support my work

About Me

Blog Archive

Technorati

Plus

mybloglog