The Pitfalls of Underspecified Workloads in Benchmarking

John Lewars; John Divirgilio; Frank Schmuck; Vasily Tarasov

FAST 2026

Poster

24 Feb 2026

The Pitfalls of Underspecified Workloads in Benchmarking

Abstract

Despite using the same headline metric, reported 'IOPS' results often measure fundamentally different bottlenecks depending on benchmark and system configuration, making cross-system comparisons unreliable. This ambiguity is particularly detrimental because small-block random read performance is frequently cited when evaluating modern storage systems for important workloads such as key-value stores, analytics engines, and metadata-intensive applications. Our contributions are (1) a concrete demonstration that identically labeled 4KiB random-read benchmarks can exercise fundamentally different system bottlenecks, and (2) an interpretive model for reasoning about reported IOPS results in terms of the underlying limiting component. Our key observation is that '4KiB random read' is not a single, well-defined metric; instead, benchmark configuration and design determine which part of the system is actually exercised. We demonstrate this effect concretely in Table 1, which presents the results of three fio microbenchmark runs in which only minimal configuration changes are applied to an otherwise fixed 4KiB random read workload. With buffered I/O, measured IOPS can be dominated by cache and memory effects (Config 1); enabling direct I/O exposes a client-limited regime (Config 2); increasing client parallelism can then shift the bottleneck toward network or storage service limits (Config 3).Thus, identically labeled “4KiB random read IOPS” values can differ by orders of magnitude while answering different questions. System configuration further influences which bottlenecks become limiting: for example, enabling RDMA increases the network-limited result from approximately 195K IOPS (Config 3) to 205K IOPS (not shown in the table). This ambiguity is caused by two recurring issues. First, benchmark configurations are often underspecified, leaving critical details such as buffering behavior, access ordering, and locality unclear. Second, system configurations are frequently underspecified, obscuring whether results are constrained by client resources, server software, network transport, or cache effects. As a result, practitioners are often left without sufficient context to interpret unexpected results, including cases where random read performance appears to exceed expected limits. This behavior was observed in several published ISC25 submissions produced using an early IO500 random-read implementation. That version contained multiple issues affecting the random-read phase, including incorrect segment-based file sizing [4] and a limitation of random offsets to 31-bit ranges [5], which together could significantly reduce effective dataset coverage when buffered I/O was enabled. Under these conditions, accesses could achieve high cache hit ratios, producing unexpectedly high reported random-read performance. Previous research has demonstrated the impact of details in benchmarking in small scale systems and academic settings [7]. Fifteen years later, we continue to observe the problem of underspecified benchmarks and configurations in large-scale industrial systems. With this work, we aim to increase awareness of the problem and encourage evaluators to publish full workload specifications, while ensuring that reported results are interpreted with sufficient execution context. Our results show that “4KiB random read IOPS” is not a stable or self-describing performance metric: small changes in benchmark design or implementation can shift the exercised bottleneck while preserving the same headline number. As a result, we argue that any use of random-read IOPS for cross-system comparison should require either a stable benchmark definition over time or explicit reporting of the benchmark version, configuration, and the dominant limiting component (cache, client, network, or storage).

Talk