Skip to main navigation Skip to search Skip to main content

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

This paper studies the use of Redundant Multi-Threading (RMT) to detect Silent Data Corruptions in HPC applications. To understand if it can be a viable solution in an HPC context, we study two software optimizations to reduce RMT performance overhead by reducing the amount of data exchanged between the replicated threads. We conduct experiments with representative HPC workloads to measure the performance gains obtained through these optimizations, and the error detection coverage they achieve. In the best case, when running on a processor that features Simultaneous Multi-Threading, our results show that the overhead can be as low as 1.4 × without significantly reducing the ability to detect data corruptions.

Original languageEnglish
Title of host publicationEuro-Par 2020
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2020 International Workshops, 2020, Revised Selected Papers
EditorsBartosz Balis, Dora B. Heras, Laura Antonelli, Andrea Bracciali, Thomas Gruber, Jin Hyun-Wook, Michael Kuhn, Stephen L. Scott, Didem Unat, Roman Wyrzykowski
PublisherSpringer Science and Business Media Deutschland GmbH
Pages290-302
Number of pages13
ISBN (Print)9783030715922
DOIs
StatePublished - 2021
EventWorkshops held at the 26th International Conference on Parallel and Distributed Computing, Euro-Par 2020 - Virtual, Online
Duration: 24 Aug 202025 Aug 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12480 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceWorkshops held at the 26th International Conference on Parallel and Distributed Computing, Euro-Par 2020
CityVirtual, Online
Period24/08/2025/08/20

Keywords

  • HPC
  • Redundant multi-threading
  • Silent data corruptions

Fingerprint

Dive into the research topics of 'On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading'. Together they form a unique fingerprint.

Cite this