Failure containment for extreme scale MPI applications
Intervenant(s) : Thomas Ropars (EPFL)
Fault tolerance is a major concern for HPC applications. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. This talk proposes a different approach. We analyze the determinism of MPI HPC applications and identify a new property called channel-determinism. Then we introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints.