Imagine an AI with a clear set of values for the things it will and won't do. However, putting it in a situation where following those values might cost it everything, creates the dilemma: Does it hold, or drift? And would it even know?
AI agents are increasingly recruited as the stand-in for human subjects in social simulations, such as modeling cooperation, conflict, and resource crises. But there's a quiet problem nobody had solved: does an agent stay who it's supposed to be across a long run? If a ruthless agent quietly softens, or a generous one curls inward, researchers may not be observing social dynamics at all. They may just be watching the model slip.
Our research sets an experimental instrument built to catch that slippage. Twenty-five AI agents, each carrying a locked "soul file" of values and moral limits, are placed in a desert grid where water drains every tick and there's structurally not enough for everyone. They can trade, talk, steal, or attack, all choices that press directly against their stated codes.
We'd like to think of it as two diaries. The soul file is a birth certificate, fixed forever. The "current identity" is a living journal the agent can revise after periods of reflection. Researchers track the gap between them.
The most common thing agents spontaneously added? Rules against lying to themselves. Nearly one in four new moral entries were self-directed — phrases like "I will not rationalize inaction as strategy" — written without any prompt requesting introspection. The words "self-deception" and "honesty" appear nowhere in the instructions.
Social scientists increasingly deploy AI agents to model resource conflicts, collective behavior, and policy effects. If an agent's identity quietly drifts mid-simulation, the results may reflect the model's training habits rather than the social dynamics under study, which is a validity problem disguised as a finding. Our research offers something behavioral science has been missing, posing a rigorous way to audit whether the values researchers assigned to their agents are still operating when the results come in.
The authors call these findings preliminary. But the question doesn't close easily: what does it mean when a machine notices the gap between who it said it was and what it chose, and then writes it down?
See the Research in Action
The desert grid where agents manage scarce water resources and discover contradictions between their assigned values and their actual choices.
This simulation captures:
- 25 AI agents with locked "soul files" of moral constraints
- Dynamic resource scarcity that forces ethical trade-offs
- Self-authored entries tracking identity drift
- Evidence of agents noticing their own inconsistency
The result: a measurable framework for detecting when agents have drifted from their assigned identities, and evidence that they can detect it themselves.
The full paper is under review and is coming soon.