Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
This work explores the consistency of LLMs in answering multiple times the same question. In particular, we study how known, open-source LLMs respond to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small (2B-10B parameters) vs. medium models (50B-80B), finetuned vs. base models, and other parameters. The paper also examines the effects of requiring answer consistency in repetitive inferences on accuracy and the trade-offs involved in deciding which model best provides both of them, for what we propose some new tools. Results show that the numberof questions which can be answered consistently vary wildly among models but typically in the 50%-85% range for small models and that accuracy among consistent answers correlates to overall accuracy at low inference temperatures. Results for medium-sized models seem to indicate much higher levels of answer consistency.
Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Megh Thakkar, Quentin Fournier, et al.
ACL 2024
Natalia Martinez Gil, Dhaval Patel, et al.
UAI 2024
Muneeza Azmat, Momin Abbas, et al.
NeurIPS 2025