Workshop paper
End-to-End Learning for Information Gathering
Rares Christian, Pavithra Harsha, et al.
NeurIPS 2025
This talk will focus on designing and evaluating agentic benchmarks with a strong emphasis on in-domain evaluation and real-world task reliability. Drawing from the development of AssetOpsBench, we’ll discuss practical considerations for measuring agent behavior, task completion quality, and decision robustness. The session will highlight what works, what doesn’t, and what matters most when building benchmarks for agent-based systems.
Rares Christian, Pavithra Harsha, et al.
NeurIPS 2025
Shuang Chen, Herbert Freeman
International Journal of Pattern Recognition and Artificial Intelligence
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010
Muneeza Azmat, Momin Abbas, et al.
NeurIPS 2025