Resilience in high-level parallel programming languages

Friday, March 1, 2019
Speaker(s) : Sara Hamouda (Australian National University & Inria)

The consistent trends of increasing core count and decreasing mean-time-to-failure in supercomputers make supporting resilience a necessity in HPC programming models. High-level programming models, that aim to simplify parallel programming, often prefer supporting resilience transparently to shift the whole burden of fault tolerance from the programmer to the runtime system. This approach, despite its productivity advantage, limits the programmer’s ability to tailor more efficient fault tolerance techniques for certain applications. We argue that high-level programming models can bridge the gap between resilience, productivity and performance by supporting "multi-resolution resilience" through efficient and composable resilient abstractions. In this talk, we will explain the multi-resolution resilience approach in the context of the X10 programming language -- a productive HPC programming language developed by IBM. We will describe the X10 programming model, its resilient constructs, and how they can be used for building resilient application frameworks at different levels of abstraction. We will also describe how we tackled the high resilience overhead of these constructs and demonstrate the achieved performance using micro-benchmarks and applications running under multiple process failures

More details here …
marc.shapiro (at)