Learning to Mediate Equilibrium Selection in LLM Games

Miao Liu; Matthew Riemer; Maria Chang; Murray Campbell; Djallel Bouneffouf

ICML 2026

Workshop paper

06 Jul 2026

Learning to Mediate Equilibrium Selection in LLM Games

Abstract

LLM agents in repeated strategic interactions face an equilibrium selection problem: unassisted populations often coordinate on low-welfare equilibria, while rule-based mediators require game-specific calibration. We propose a learned meta-controller as an empirical equilibrium selector, trained via reinforcement learning from welfare feedback alone — no game-specific reward shaping or access to player internals. Two variants matched to credit-assignment structure: PPO for multi-round social dilemmas, and a contextual bandit ( $\gamma=0$ ) for dense per-round settings. PPO significantly outperforms no-intervention and always-intervene baselines, matches mid-tier hand-crafted mediators, and significantly trails only the best rule-based mediator on IPD ( $p=0.045$ ) — all without game-specific tuning. Both variants exhibit emergent selectivity — active in coordination-challenged games, passive where LLMs already self-coordinate — consistent with this arising from the welfare reward rather than the algorithm. A communication ablation reveals that message content contributes to welfare only when paired with learned selectivity: removing content significantly degrades PPO welfare ( $p<0.02$ ) but leaves always-intervene unaffected, consistent with targeted messages carrying information while indiscriminate ones average to noise. Bandit underperforms PPO on social dilemmas but matches rule-based baselines in Bertrand, where dense per-round rewards suit the 1-step approximation. Prompt match matters: controllers generalize poorly across conditions, motivating prompt-conditioned training.