TY - GEN
T1 - Using Checkpoint Alteration to Gauge Fault Sensitivity of HPC Scientific Applications
AU - Rojas, Elvis
AU - Todd, Luis Carlos N.
AU - Meneses, Esteban
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The impact of High Performance Computing (HPC) is ubiquitous. The main techniques of HPC (advanced computer simulation, sophisticated generative artificial intelligence, and big-data analytics) have been successfully applied to science, engineering, and business. The rapid progress in those fields permits the solution to highly complex problems of our society. To maintain a fast-moving innovation rate, it is fundamental to have a sustainable progress of HPC infrastructure. However, large-scale computers are prone to fail due to several factors. For one, the stark number of components is gigantic. The more components of a system, the more likely the system will fail. For another, those components are built using a shrinking feature size, making them more sensitive to voltage fluctuations, high-energy radiation, and temperature. Future HPC systems will experiment a higher rate of silent data corruption (SDC), faults that corrupt the state of the system, but does not necessarily provoke an abrupt error on the system. This paper examines the sensitivity of HPC codes to bitflips by using a portable mechanism called checkpoint alteration. The results show that HPC applications are resilient to single-bit errors, showing most of the time no alteration to the expected results. However, multiple-bit errors are often times disastrous for executions.
AB - The impact of High Performance Computing (HPC) is ubiquitous. The main techniques of HPC (advanced computer simulation, sophisticated generative artificial intelligence, and big-data analytics) have been successfully applied to science, engineering, and business. The rapid progress in those fields permits the solution to highly complex problems of our society. To maintain a fast-moving innovation rate, it is fundamental to have a sustainable progress of HPC infrastructure. However, large-scale computers are prone to fail due to several factors. For one, the stark number of components is gigantic. The more components of a system, the more likely the system will fail. For another, those components are built using a shrinking feature size, making them more sensitive to voltage fluctuations, high-energy radiation, and temperature. Future HPC systems will experiment a higher rate of silent data corruption (SDC), faults that corrupt the state of the system, but does not necessarily provoke an abrupt error on the system. This paper examines the sensitivity of HPC codes to bitflips by using a portable mechanism called checkpoint alteration. The results show that HPC applications are resilient to single-bit errors, showing most of the time no alteration to the expected results. However, multiple-bit errors are often times disastrous for executions.
KW - checkpoint alteration
KW - fault sensitivity
KW - HPC applications
UR - https://www.scopus.com/pages/publications/105015354691
U2 - 10.1109/IPDPSW66978.2025.00071
DO - 10.1109/IPDPSW66978.2025.00071
M3 - Contribución a la conferencia
AN - SCOPUS:105015354691
T3 - Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
SP - 453
EP - 462
BT - Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
Y2 - 3 June 2025 through 7 June 2025
ER -