StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional EvaluationSatyananda KashyapSola Shiraiet al.2025VLDB 2025
Evaluating LLM-based Agents: Foundations, Best Practices and Open ChallengesRoy Bar-HaimArman Cohanet al.2025IJCAI 2025
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language ModelsGeorge KourItay Nakashet al.2025ACL 2025
DOES YOUR MODEL UNDERSTAND GENES? A MODALITY-AGNOSTIC BENCHMARK OF GENE PROPERTIESYoav Kan-TorMichael Morris Danzigeret al.2025ISMB 2025
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You InItay NakashGeorge Kouret al.2025NAACL 2025
Exploring Straightforward Methods for Automatic Conversational Red-TeamingGeorge KourNaama Zwerdlinget al.2025NAACL 2025
ASTER: Natural and Multi-language Unit Test Generation with LLMsRangeet PanMyeongsoo Kimet al.2025ICSE 2025