An AI agent trained via reinforcement learning to detect fraud in a hostile, dynamic, multi-agent environment โ without ever seeing the truth directly.
How do you train an AI to catch another AI lying?
A bridge is under construction. Supplier agents bid on contracts for structural parts. Some suppliers are honest. Some will cut corners on part quality if they think they can get away with it. The auditor agent watches the bidding patterns, tracks failure histories, and advises the buyer โ but it never sees actual part strength. It must infer deception from indirect signals alone.
Supplier personalities are reshuffled randomly every episode, so the auditor can't learn "supplier_2 is always the cheater" โ identity means nothing, behavior is everything. A low bid isn't automatically suspicious (costs are legitimately noisy). Pattern recognition across rounds is the only reliable signal.
The auditor's language evolved from vague guesses to precise, evidence-grounded analysis across training. See Section 4 of the training report for the full story.
A single base model (Qwen2.5-Instruct) plays all agent roles, differentiated by system prompt. Trained with GRPO via Unsloth + TRL on an H100 GPU. Rewards computed by a privileged "god-engine" oracle that never exposes ground truth to agents.
Built on the OpenEnv framework for stateful multi-agent environments. The live environment API is running in the companion Space.