Gaetano Rossiello, Shankar Subramaniam
ACM CAIS 2026
LLM agents in repeated strategic interactions face an equilibrium selection problem: unassisted populations often coordinate on low-welfare equilibria, while rule-based mediators require game-specific calibration. We propose a learned meta-controller as an empirical equilibrium selector, trained via reinforcement learning from welfare feedback alone — no game-specific reward shaping or access to player internals. Two variants matched to credit-assignment structure: PPO for multi-round social dilemmas, and a contextual bandit () for dense per-round settings. PPO significantly outperforms no-intervention and always-intervene baselines, matches mid-tier hand-crafted mediators, and significantly trails only the best rule-based mediator on IPD () — all without game-specific tuning. Both variants exhibit emergent selectivity — active in coordination-challenged games, passive where LLMs already self-coordinate — consistent with this arising from the welfare reward rather than the algorithm. A communication ablation reveals that message content contributes to welfare only when paired with learned selectivity: removing content significantly degrades PPO welfare () but leaves always-intervene unaffected, consistent with targeted messages carrying information while indiscriminate ones average to noise. Bandit underperforms PPO on social dilemmas but matches rule-based baselines in Bertrand, where dense per-round rewards suit the 1-step approximation. Prompt match matters: controllers generalize poorly across conditions, motivating prompt-conditioned training.
Gaetano Rossiello, Shankar Subramaniam
ACM CAIS 2026
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Yannis Belkhiter, Seshu Tirupathi, et al.
ICML 2026