Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
The rise of the use of Large Language Models (LLMs) in work has driven the need for robust evaluation methods that align model behavior with human values and preferences. LLM-as-a-judge approaches have emerged as a scalable solution, leveraging LLMs to evaluate generated outputs based on flexible user-defined criteria. However, users often struggle to articulate clear evaluation criteria. In addition, human preferences and criteria definitions evolve, and predefined templates fail to account for context-specific nuances. To address these challenges, we present MetricMate, an interactive tool that supports users in defining and calibrating evaluation criteria for LLM-as-a-judge systems. MetricMate introduces hierarchical criteria definitions and curated examples of success and failure to promote human-AI criteria negotiation and alignment. Additionally, MetricMate learns from users’ interactions with data by enabling users to group data to identify patterns and provide context-specific criteria.
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Michael Muller, Vera Khovanskaya
CHIWORK 2025