A New Hierarchical Fault Tolerance Protocol for MPI HPC Applications & Unified model for fault tolerance protocols
Intervenant(s) : Amina Guermouche (UVSQ)
High performance computing will probably reach exascale in this decade. At such a scale, the mean time between failures is expected to be a few hours. The talk will address two different aspects of fault tolerance at exascale. First, I will present a protocol that combines coordinated checkpointing and message logging. It is used on clusters of processes. Many protocols based on this idea already exist in the litterature.
This talk presents a new hierarchical protocol that logs only message payload unlike all existing hierarchical protocols. It is based on a study of MPI applications. The study shows that many MPI applications are "send-deterministic", and in many cases, the communication patterns of the application allow creating groups of processes. Then, I will present a unifed model used to compare coordinated checkpointing protocols and hierarchical protocols. The goal of the model is to help the evaluation of both protocols in order to choose the best suited protocol for exascale architecture.