Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing. Copyright 2006 ACM.
Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
Raymond Wu, Jie Lu
ITA Conference 2007
Pradip Bose
VTS 1998
Ehud Altman, Kenneth R. Brown, et al.
PRX Quantum