Saurabh Paul, Christos Boutsidis, et al.
JMLR
Large language models (LLMs) are rapidly entering judicial workflows, assisting with research, summarization, and drafting. Yet their unverified use by judges risks undermining core tenets of legal legitimacy: accountability, consistency, and transparency. This paper empirically demonstrates interpretive divergence across leading LLMs (GPT‑4, Claude, Gemini, LLaMA) on benchmarked legal tasks, revealing systematic vulnerabilities we classify as omission, injection, and framing loopholes. Using controlled prompt perturbations, semantic embedding comparisons, and cross-model evaluations on CaseHOLD and LexGLUE, we quantify semantic drift and show how minor prompt variations and model design choices materially affect legal conclusions. We argue that these dynamics, if LLMs are deployed without safeguard, will erode due process and public trust in the judiciary. Drawing on insights from judicial behavior literature, we propose a governance framework: multi-model deliberation, independent auditing and certification, and domain-specific validation protocols. Our roadmap aims to ensure LLM assistance augments, rather than undermines, the fairness and legitimacy of judicial decision-making.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A