Séminaire REGALRSS

Failure containment for extreme scale MPI applications

03/02/2014
Intervenant(s) : Thomas Ropars (EPFL)
Fault tolerance is a major concern for HPC applications. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. This talk proposes a different approach. We analyze the determinism of MPI HPC applications and identify a new property called channel-determinism. Then we introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints.

Pierre.Sens (at) nulllip6.fr
Mentions légales
Carte du site