Soft-Masked Diffusion Language Models
Michael Hersche, Samuel Moor, et al.
ICLR 2026
Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce ST-WebAgentBench, a configurable and extensible framework designed as a first step toward enterprise-grade evaluation. Each of its 375 tasks carries one or more ST policies (3,057 in total), concise rules encoding constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Tasks span three difficulty tiers for fine-grained capability profiling, and a “Modality Challenge” disentangles vision-only from DOM-only information retrieval, isolating the contribution of each perceptual modality to agent failures. Beyond raw task success, we propose the Completion Under Policy (CuP) metric, which credits only completions that respect all applicable policies, and the Risk Ratio, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents shows their average CuP is less than two-thirds of their nominal completion rate, revealing substantial safety gaps. To support growth and adaptation to new domains, ST-WebAgentBench provides modular code and extensible templates that enable new workflows to be incorporated with minimal effort, offering a practical foundation for advancing trustworthy web agents at scale.
Michael Hersche, Samuel Moor, et al.
ICLR 2026
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010
Chen-chia Chang, Wan-hsuan Lin, et al.
ICML 2025
Daniel Karl I. Weidele, Hendrik Strobelt, et al.
SysML 2019