Using Checkpoint Alteration to Gauge Fault Sensitivity of HPC Scientific Applications

Elvis Rojas, Luis Carlos N. Todd, Esteban Meneses

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

The impact of High Performance Computing (HPC) is ubiquitous. The main techniques of HPC (advanced computer simulation, sophisticated generative artificial intelligence, and big-data analytics) have been successfully applied to science, engineering, and business. The rapid progress in those fields permits the solution to highly complex problems of our society. To maintain a fast-moving innovation rate, it is fundamental to have a sustainable progress of HPC infrastructure. However, large-scale computers are prone to fail due to several factors. For one, the stark number of components is gigantic. The more components of a system, the more likely the system will fail. For another, those components are built using a shrinking feature size, making them more sensitive to voltage fluctuations, high-energy radiation, and temperature. Future HPC systems will experiment a higher rate of silent data corruption (SDC), faults that corrupt the state of the system, but does not necessarily provoke an abrupt error on the system. This paper examines the sensitivity of HPC codes to bitflips by using a portable mechanism called checkpoint alteration. The results show that HPC applications are resilient to single-bit errors, showing most of the time no alteration to the expected results. However, multiple-bit errors are often times disastrous for executions.

Idioma originalInglés
Título de la publicación alojadaProceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
EditorialInstitute of Electrical and Electronics Engineers Inc.
Páginas453-462
Número de páginas10
ISBN (versión digital)9798331526436
DOI
EstadoPublicada - 2025
Evento2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025 - Milan, Italia
Duración: 3 jun 20257 jun 2025

Serie de la publicación

NombreProceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025

Conferencia

Conferencia2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025
País/TerritorioItalia
CiudadMilan
Período3/06/257/06/25

Huella

Profundice en los temas de investigación de 'Using Checkpoint Alteration to Gauge Fault Sensitivity of HPC Scientific Applications'. En conjunto forman una huella única.

Citar esto