Workshop paper

BOLTZMANN ROUTING FORENERGY-COMPATIBLE MIXTURE OFEXPERTS

Abstract

The Energy Transformer (ET) recasts the forward pass as gradient descent on a scalar energy, connecting attention to Modern Hopfield Networks and associative memory. Scaling ETs via Mixture-of-Experts (MoE) breaks this variational structure: standard router weights depend on the token state, producing a router gradient residual that prevents the MoE output from being any energy's gradient. We propose \textbf{Boltzmann Routing}, which eliminates the external router and derives expert selection from a free-energy functional F=βr1log ⁣eexp(βrEe)\mathcal{F} = -\beta_r^{-1}\log\!\sum_e \exp(-\beta_r E_e). We prove that the negative gradient of~F\mathcal{F} exactly recovers the weighted expert output with zero residual, that the combined system admits a Lyapunov function, and that attention and routing are \emph{dual instances of the same associative retrieval mechanism}. Experiments across three scales (8 to 32 experts) show that Boltzmann routing achieves accuracy comparable to standard MoE (0.440 avg at 8 experts) \emph{without any auxiliary balancing loss}, while a distillation variant maintains near-perfect load balance at 32 experts. A cross-scale analysis reveals a fundamental tension: exact energy compatibility comes at the cost of expert collapse at scale, and collapse count alone does not determine performance, with implications for energy-based routing more broadly.