Byungchul Tak, Shu Tao, et al.
IC2E 2016
The Energy Transformer (ET) recasts the forward pass as gradient descent on a scalar energy, connecting attention to Modern Hopfield Networks and associative memory. Scaling ETs via Mixture-of-Experts (MoE) breaks this variational structure: standard router weights depend on the token state, producing a router gradient residual that prevents the MoE output from being any energy's gradient. We propose \textbf{Boltzmann Routing}, which eliminates the external router and derives expert selection from a free-energy functional . We prove that the negative gradient of~ exactly recovers the weighted expert output with zero residual, that the combined system admits a Lyapunov function, and that attention and routing are \emph{dual instances of the same associative retrieval mechanism}. Experiments across three scales (8 to 32 experts) show that Boltzmann routing achieves accuracy comparable to standard MoE (0.440 avg at 8 experts) \emph{without any auxiliary balancing loss}, while a distillation variant maintains near-perfect load balance at 32 experts. A cross-scale analysis reveals a fundamental tension: exact energy compatibility comes at the cost of expert collapse at scale, and collapse count alone does not determine performance, with implications for energy-based routing more broadly.
Byungchul Tak, Shu Tao, et al.
IC2E 2016
Michael Hersche, Samuel Moor, et al.
ICLR 2026
Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025
Kristjan Greenewald, Yuancheng Yu, et al.
NeurIPS 2024