Workshop paper

Learning to Mediate Equilibrium Selection in LLM Games

Abstract

LLM agents in repeated strategic interactions face an equilibrium selection problem: unassisted populations often coordinate on low-welfare equilibria, while rule-based mediators require game-specific calibration. We propose a learned meta-controller as an empirical equilibrium selector, trained via reinforcement learning from welfare feedback alone — no game-specific reward shaping or access to player internals. Two variants matched to credit-assignment structure: PPO for multi-round social dilemmas, and a contextual bandit (γ=0\gamma=0) for dense per-round settings. PPO significantly outperforms no-intervention and always-intervene baselines, matches mid-tier hand-crafted mediators, and significantly trails only the best rule-based mediator on IPD (p=0.045p=0.045) — all without game-specific tuning. Both variants exhibit emergent selectivity — active in coordination-challenged games, passive where LLMs already self-coordinate — consistent with this arising from the welfare reward rather than the algorithm. A communication ablation reveals that message content contributes to welfare only when paired with learned selectivity: removing content significantly degrades PPO welfare (p<0.02p<0.02) but leaves always-intervene unaffected, consistent with targeted messages carrying information while indiscriminate ones average to noise. Bandit underperforms PPO on social dilemmas but matches rule-based baselines in Bertrand, where dense per-round rewards suit the 1-step approximation. Prompt match matters: controllers generalize poorly across conditions, motivating prompt-conditioned training.