Gaetano Rossiello, Shankar Subramaniam
ACM CAIS 2026
Static benchmarks have become the primary tool for measuring large language model (LLM) progress, yet growing evidence suggests they increasingly reward memorization and surface-form pattern-matching over genuine capability. Performance on canonical benchmarks such as MMLU and GSM8k degrades sharply under semantics-preserving perturbations, including answer reordering, surface rephrasing, and distractor addition, a brittleness inconsistent with the robust understanding these benchmarks are meant to certify. We argue this fragility is not an implementation flaw but a structural consequence of fixed evaluation sets in the era of web-scale training. We advocate for dynamic, synthetically generated benchmarks constructed fresh at evaluation time, eliminating instance-level contamination by construction and enabling principled, reproducible evaluation of genuine model capability. The remaining weaknesses of dynamic evaluation are tractable; those of static evaluation are structural.
Gaetano Rossiello, Shankar Subramaniam
ACM CAIS 2026
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Yannis Belkhiter, Seshu Tirupathi, et al.
ICML 2026