Reducing the overhead of message logging in fault-tolerant HPC applications

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

With the exascale era within reach, the high performance computing community is preparing to embrace the challenges associated with extreme-scale systems. Resilience raises as one of the major hurdles in making those systems usable for the advance of science and industry. Message logging is a well-known strategy to provide fault tolerance, one that is promising due to its ability to avoid global restart. However, message-logging protocols may suffer considerable overhead if implemented for the general case. This paper introduces a new messagelogging protocol that leverages the benefits of a flexible parallel programming paradigm. We evaluate the protocol using a particular type of applications and demonstrate it can keep a low performance penalization when scaling up to 128,000 cores.

Idioma originalInglés
Título de la publicación alojadaHigh Performance Computing - 3rd Latin American Conference, CARLA 2016, Revised Selected Papers
EditoresCarlos Jaime Barrios Hernandez, Isidoro Gitler, Jaime Klapp
EditorialSpringer Verlag
Páginas204-218
Número de páginas15
ISBN (versión impresa)9783319579719
DOI
EstadoPublicada - 2017
Evento3rd Latin American Conference on High Performance Computing, CARLA 2016 - Mexico City, México
Duración: 29 ago 20162 sept 2016

Serie de la publicación

NombreCommunications in Computer and Information Science
Volumen697
ISSN (versión impresa)1865-0929

Conferencia

Conferencia3rd Latin American Conference on High Performance Computing, CARLA 2016
País/TerritorioMéxico
CiudadMexico City
Período29/08/162/09/16

Huella

Profundice en los temas de investigación de 'Reducing the overhead of message logging in fault-tolerant HPC applications'. En conjunto forman una huella única.

Citar esto