Skip to main navigation Skip to search Skip to main content

Using Checkpoint Alteration to Gauge Fault Sensitivity of HPC Scientific Applications

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The impact of High Performance Computing (HPC) is ubiquitous. The main techniques of HPC (advanced computer simulation, sophisticated generative artificial intelligence, and big-data analytics) have been successfully applied to science, engineering, and business. The rapid progress in those fields permits the solution to highly complex problems of our society. To maintain a fast-moving innovation rate, it is fundamental to have a sustainable progress of HPC infrastructure. However, large-scale computers are prone to fail due to several factors. For one, the stark number of components is gigantic. The more components of a system, the more likely the system will fail. For another, those components are built using a shrinking feature size, making them more sensitive to voltage fluctuations, high-energy radiation, and temperature. Future HPC systems will experiment a higher rate of silent data corruption (SDC), faults that corrupt the state of the system, but does not necessarily provoke an abrupt error on the system. This paper examines the sensitivity of HPC codes to bitflips by using a portable mechanism called checkpoint alteration. The results show that HPC applications are resilient to single-bit errors, showing most of the time no alteration to the expected results. However, multiple-bit errors are often times disastrous for executions.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages453-462
Number of pages10
ISBN (Electronic)9798331526436
DOIs
StatePublished - 2025
Event2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025 - Milan, Italy
Duration: 3 Jun 20257 Jun 2025

Publication series

NameProceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025

Conference

Conference2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
Country/TerritoryItaly
CityMilan
Period3/06/257/06/25

Keywords

  • checkpoint alteration
  • fault sensitivity
  • HPC applications

Fingerprint

Dive into the research topics of 'Using Checkpoint Alteration to Gauge Fault Sensitivity of HPC Scientific Applications'. Together they form a unique fingerprint.

Cite this