Skip to main navigation Skip to search Skip to main content

Reducing the overhead of message logging in fault-tolerant HPC applications

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

With the exascale era within reach, the high performance computing community is preparing to embrace the challenges associated with extreme-scale systems. Resilience raises as one of the major hurdles in making those systems usable for the advance of science and industry. Message logging is a well-known strategy to provide fault tolerance, one that is promising due to its ability to avoid global restart. However, message-logging protocols may suffer considerable overhead if implemented for the general case. This paper introduces a new messagelogging protocol that leverages the benefits of a flexible parallel programming paradigm. We evaluate the protocol using a particular type of applications and demonstrate it can keep a low performance penalization when scaling up to 128,000 cores.

Original languageEnglish
Title of host publicationHigh Performance Computing - 3rd Latin American Conference, CARLA 2016, Revised Selected Papers
EditorsCarlos Jaime Barrios Hernandez, Isidoro Gitler, Jaime Klapp
PublisherSpringer Verlag
Pages204-218
Number of pages15
ISBN (Print)9783319579719
DOIs
StatePublished - 2017
Event3rd Latin American Conference on High Performance Computing, CARLA 2016 - Mexico City, Mexico
Duration: 29 Aug 20162 Sep 2016

Publication series

NameCommunications in Computer and Information Science
Volume697
ISSN (Print)1865-0929

Conference

Conference3rd Latin American Conference on High Performance Computing, CARLA 2016
Country/TerritoryMexico
CityMexico City
Period29/08/162/09/16

Keywords

  • Fault tolerance
  • Message logging
  • Resilience

Fingerprint

Dive into the research topics of 'Reducing the overhead of message logging in fault-tolerant HPC applications'. Together they form a unique fingerprint.

Cite this