Haruki Imai, Kumiko Maeda, et al.
ITSC 2013
Overlapping computations and communication is a key to accelerating stencil applications on parallel computers, especially for GPU clusters. However, such programming is a time-consuming part of the stencil application development. To address this problem, we developed an automatic code generation tool to produce a parallel stencil application with latency hiding automatically from its dataflow model. With this tool, users visually construct the workflows of stencil applications in a dataflow programming model. Our dataflow compiler determines a data decomposition policy for each application, and generates source code that overlaps the stencil computations and communication (MPI and PCIe). We demonstrate two types of overlapping models, a CPU-GPU hybrid execution model and a GPU-only model. We use a CFD benchmark computing 19-point 3D stencils to evaluate our scheduling performance, which results in 1.45 TFLOPS in single precision on a cluster with 64 Tesla C1060 GPUs. © 2012 IEEE.
Haruki Imai, Kumiko Maeda, et al.
ITSC 2013
Michiharu Kudo, Kumiko Maeda, et al.
SCC 2016
Guojing Cong, Huifang Wen, et al.
IPDPS 2012
Alessandro Morari, Roberto Gioiosa, et al.
IPDPS 2012