Juan Miguel De Haro, Rubén Cano, et al.
IPDPS 2022
Generative AI models such as LLMs have emerged as the state-of-the-art in various machine learning applications, including vision, speech recognition, code generation, and machine translation. These large transformer-based models surpass traditional machine learning methods, albeit at the expense of hundreds of ExaOps in computation. Hardware specialization and acceleration play a crucial role in enhancing the operational efficiency of these models, in turn necessitating synergistic cross-layer design across algorithms, hardware, and software.
In this talk, I will focus on the challenges and opportunities introduced by these models (e.g., trade-offs between compute vs memory BW). Recent advances in AI algorithms and reduced precision/quantization techniques have led to improvements in compute efficiency while maintaining the same level of accuracy. System architectures need to be tailor made to mimic the communication patterns of LLMs which the software can then leverage to feed data to the compute engines, achieving high sustained compute utilization. The talk will present this holistic approach adopted in the design of the recently announced IBM Spyre accelerator.
Juan Miguel De Haro, Rubén Cano, et al.
IPDPS 2022
David Stutz, Nandhini Chandramoorthy, et al.
MLSys 2021
Eric A. Joseph
AVS 2023
Stefano Ambrogio
MRS Spring Meeting 2022