Skip to main navigation Skip to search Skip to main content

A fault-tolerance protocol for parallel applications with communication imbalance

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The predicted failure rates of future supercomputers loom the groundbreaking research large machines are expected to foster. Therefore, resilient extreme-scale applications are an absolute necessity to effectively use the new generation of supercomputers. Rollback-recovery techniques have been traditionally used in HPC to provide resilience. Among those techniques, message logging provides the appealing features of saving energy, accelerating recovery, and having low performance penalty. Its increased memory consumption is, however, an important downside. This paper introduces memory-constrained message logging (MCML), a general framework for decreasing the memory footprint of message-logging protocols. In particular, we demonstrate the effectiveness of MCML in maintaining message logging feasible for applications with substantial communication imbalance. This type of applications appear in many scientific fields. We present experimental results with several parallel codes running on up to 4,096 cores. Using those results and an analytical model, we predict MCML can reduce execution time up to 25% and energy consumption up to 15%, at extreme scale.

Original languageEnglish
Title of host publicationProceedings - IEEE 27th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015
PublisherIEEE Computer Society
Pages162-169
Number of pages8
ISBN (Electronic)9781467380119
DOIs
StatePublished - 12 Jan 2016
Event27th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015 - Florianopolis, Brazil
Duration: 18 Oct 201521 Oct 2015

Publication series

NameProceedings - Symposium on Computer Architecture and High Performance Computing
Volume2016-January
ISSN (Print)1550-6533

Conference

Conference27th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015
Country/TerritoryBrazil
CityFlorianopolis
Period18/10/1521/10/15

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 7 - Affordable and Clean Energy
    SDG 7 Affordable and Clean Energy

Keywords

  • Communication imbalance
  • Fault tolerance
  • Message logging

Fingerprint

Dive into the research topics of 'A fault-tolerance protocol for parallel applications with communication imbalance'. Together they form a unique fingerprint.

Cite this