Conference paper

Enterprise-Scale RAS in IBM Z Memory Subsystem

Abstract

Enterprise systems demand memory architectures that combine low latency, high capacity, and exceptional RAS for mission-critical workloads. Building on decades of IBM Z innovation in buffered memory and Redundant Array of Independent Memory (RAIM) based protection, the z16 and z17 generations introduce an improved high-speed, OpenCAPI Memory Interface (OMI) and an enhanced eight-channel RAIM architecture which reduces the DRAM footprint. This design preserves IBM Z’s historically low-latency access model while scaling system memory up to 64 TB and significantly reducing pin count. Multi-layered RAS mechanisms enable continuous operation despite DIMM, memory buffer, DRAM, or link faults. Innovations include Reed-Solomon RAIM correction, channel tagging, staggered refresh, CRC-protected OMI replay, dynamic lane degrade, background scrubbing with refetch validation, per-rank DRAM marking, and redundant voltage regulation failover and concurrent memory service strategies, which together help ensure near-zero downtime. Together, these innovations extend IBM Z’s long-standing legacy of delivering high-capacity, low-latency, and high-reliability memory subsystems for enterprise-scale resilience.