Sarala Arunagiri, Yipkei Kwok, et al.
Int. J. Parallel Program
The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability. © 2007 IEEE.
Sarala Arunagiri, Yipkei Kwok, et al.
Int. J. Parallel Program
Hai Huang, Kang G. Shin
MSST 2007
Roman Pletka, Christian Cachin
MSST 2007
Seetharami R. Seelam, Patricia J. Teller
VEe 2007