Bandits, LLMs, and Agentic AI

Djallel Bouneffouf; Rafael Feraud

AAAI 2026

Tutorial

20 Jan 2026

Bandits, LLMs, and Agentic AI

Abstract

This tutorial offers a comprehensive guide to using multi-armed bandit (MAB) algorithms as a foundation for improving Large Language Models (LLMs) and advancing toward agentic AI systems. The central theme is threefold: understanding bandit algorithms in general, applying bandit methods to LLMs, and extending these ideas to agentic systems, where LLMs act with autonomy, make decisions, and adapt based on feedback.

We begin with the fundamentals of MAB algorithms, focusing on the exploration-exploitation trade-off that lies at the heart of sequential decision-making. Core strategies such as epsilon-greedy, Upper Confidence Bound (UCB), and Thompson Sampling are introduced, with a discussion of their advantages, limitations, and practical use cases. This section builds the conceptual toolkit necessary for applying bandit techniques in more complex settings.

The second part of the tutorial moves to bandit for LLMs, where we show how text generation options—candidate tokens, sentences, or dialogue paths—can be modeled as “arms” in a bandit problem. This framing enables LLMs to adaptively select outputs, guided by feedback in the form of rewards. Key design issues are discussed, including reward modeling in language contexts, scalable exploration policies, and integration into large-scale architectures. We highlight how this approach directly enhances LLM performance in terms of relevance, diversity, and personalization.

The final stage extends these ideas to bandit for agentic systems. Here, the focus is on how bandit-driven mechanisms can support the emergence of autonomy in LLMs. By continuously learning from interaction and feedback, MAB-augmented LLMs develop adaptive behaviors that resemble agency: the ability to choose, refine, and pursue strategies dynamically. We discuss how these traits are critical for real-world applications such as personalized dialogue assistants, adaptive content recommendation, and interactive decision-support systems. This section also addresses challenges around reward alignment, ethical constraints, and computational efficiency, which become especially important as models gain more decision-making power.

Case studies are provided throughout the tutorial, showcasing how the progression from bandit basics to LLM integration to agentic systems creates measurable benefits in terms of engagement, user satisfaction, and long-term adaptability.

By weaving together these three levels—bandit, bandit for LLMs, and bandit for agentic systems—the tutorial demonstrates how MAB algorithms provide a principled pathway toward the next generation of adaptive, autonomous, and intelligent language technologies.

Conference paper