Ruoqi Liu, Pin-Yu Chen, et al.
Patterns
Despite the impressive capability of large language models (LLMs) in solving different downstream tasks, new concerns about proper performance evaluation have been raised, especially for test-data leakage caused by accidentally including them during pretraining, or by indirectly exposing them through API calls for evaluation. Motivated by these, in this paper, we propose a new evaluation workflow that generates steerable synthetic language datasets and proxy tasks for benchmarking the performance of pertained LLMs on sentence classification tasks. This approach allows for better characterization of the joint analysis on the robustness and accuracy of LLMs without risking sensitive information leakage. Verified on various pretrained LLMs, the proposed approach demonstrates promising high correlation with real downstream performance.
Ruoqi Liu, Pin-Yu Chen, et al.
Patterns
Minhao Cheng, Rui Min, et al.
ICML 2023
Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Saiteja Utpala, Alex Gu, et al.
NAACL 2024