A Scalable Multi-TeraOPS Core for AI Training and Inference

Sunil Shukla; Bruce Fleischer; Matt Ziegler; Joel Silberman; Jinwook Oh; Vijayalakshmi Srinivasan; Jungwook Choi; Silvia Müller; Ankur Agrawal; Tina Babinsky; Nianzheng Cao; Chia-Yu Chen; Pierce Chuang; Thomas Fox; George Gristede; Michael Guillorn; Howard Haynie; Michael Klaiber; Dongsoo Lee; Shih Hsien Lo; Gary Maier; Michael Scheuermann; Swagath Venkataramani; Christos Vezyrtzis; Naigang Wang; Fanchieh Yee; Ching Zhou; Pong-Fei Lu; Brian Curran; Leland Chang; Kailash Gopalakrishnan

doi:10.1109/LSSC.2019.2902738

IEEE SSC-L

Paper

01 Dec 2018

A Scalable Multi-TeraOPS Core for AI Training and Inference

View publication

Abstract

This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture to provide high throughput and an on-chip scratchpad hierarchy to meet the bandwidth demands of the compute units. A custom 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, and 9 mantissa bits has also been developed for high model accuracy in training and inference as well as 1b/2b (binary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14-nm CMOS.

Paper