Workshop paper

Static Benchmarks Are Broken: The Case for Dynamic Evaluation of LLMs

Abstract

Static benchmarks have become the primary tool for measuring large language model (LLM) progress, yet growing evidence suggests they increasingly reward memorization and surface-form pattern-matching over genuine capability. Performance on canonical benchmarks such as MMLU and GSM8k degrades sharply under semantics-preserving perturbations, including answer reordering, surface rephrasing, and distractor addition, a brittleness inconsistent with the robust understanding these benchmarks are meant to certify. We argue this fragility is not an implementation flaw but a structural consequence of fixed evaluation sets in the era of web-scale training. We advocate for dynamic, synthetically generated benchmarks constructed fresh at evaluation time, eliminating instance-level contamination by construction and enabling principled, reproducible evaluation of genuine model capability. The remaining weaknesses of dynamic evaluation are tractable; those of static evaluation are structural.