A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Reliability has been, and continues to be a key consideration in the design of the IBM Z mainframe processors, and has resulted in industry-leading performance with little-to-no downtime. In this paper, we analyze the various hardware reliability mechanisms that make the processor resilient to transient errors, and the checker architecture that enables their detection and correction. We characterize the error checking logic in the processor based on a detailed analysis of the actual design. Based on hardware measurements on a real Z processor, we then determine the error checkers that are critical from a timing standpoint, in the event where the supply voltage is scaled. We propose algorithms that optimize checker selection without affecting the RAS coverage and the detection of errors induced both due to SER and voltage scaling. Finally we examine further potential optimizations of checkers based on the logic utilization in representative benchmarks.
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Evaline Ju, Kelly Abuelsaad
KubeCon EU 2026
Ilias Iliadis
International Journal On Advances In Networks And Services
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025