Workshop paper

Tile Efficiency is not System Efficiency – CIM architecture studies of LLMs and other large DNNs

Abstract

To achieve system-level benefits, compute-in-memory tiles need to be integrated into heterogeneous architectures alongside general and application-specific digital compute cores, together with a high-bandwidth and reconfigurable on-chip routing fabric that can deliver the right vectors to the right locations for just-in-time DNN compute. In the first part of my talk, I will review some of IBM’s work in developing weight-stationary analog compute cores with a focus on the design choices and optimizations for high tile efficiency. I will then provide a brief introduction to heterogeneous architectures for CIM systems followed by architectural studies of DNNs identifying auxiliary operations that bottleneck the performance. Finally, I will highlight the issue of achieving true weight-stationarity in large models such as Mixture-of-Expert (MoE) Transformer models, and the system-level benefits that such an architecture can achieve.