Enterprise Benchmarks for Large Language Model EvaluationBing ZhangMikio Takeuchiet al.2025NAACL 2025
Challenges and Remedies of Domain-Specific Classifiers as LLM Guardrails: Self-Harm as a Case StudyBing ZhangGuang-Jie Ren2025NAACL 2025
Are Large Language Models Effective in Clinical Trial Design? A Study on Baseline Feature GenerationNafis NeehalBowen Wanget al.2025NAACL 2025
InspectorRAGet: An Introspection Platform for RAG EvaluationBenjamin SznajderKshitij Fadniset al.2025NAACL 2025
DAMAGeR: Deploying Automatic and Manual Approaches to GenAI Red-teamingManish NagireddyMichael Fefferet al.2025NAACL 2025