A fault-tolerance protocol for parallel applications with communication imbalance

Esteban Meneses, Laxmikant V. Kale

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

The predicted failure rates of future supercomputers loom the groundbreaking research large machines are expected to foster. Therefore, resilient extreme-scale applications are an absolute necessity to effectively use the new generation of supercomputers. Rollback-recovery techniques have been traditionally used in HPC to provide resilience. Among those techniques, message logging provides the appealing features of saving energy, accelerating recovery, and having low performance penalty. Its increased memory consumption is, however, an important downside. This paper introduces memory-constrained message logging (MCML), a general framework for decreasing the memory footprint of message-logging protocols. In particular, we demonstrate the effectiveness of MCML in maintaining message logging feasible for applications with substantial communication imbalance. This type of applications appear in many scientific fields. We present experimental results with several parallel codes running on up to 4,096 cores. Using those results and an analytical model, we predict MCML can reduce execution time up to 25% and energy consumption up to 15%, at extreme scale.

Idioma originalInglés
Título de la publicación alojadaProceedings - IEEE 27th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015
EditorialIEEE Computer Society
Páginas162-169
Número de páginas8
ISBN (versión digital)9781467380119
DOI
EstadoPublicada - 12 ene 2016
Evento27th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015 - Florianopolis, Brasil
Duración: 18 oct 201521 oct 2015

Serie de la publicación

NombreProceedings - Symposium on Computer Architecture and High Performance Computing
Volumen2016-January
ISSN (versión impresa)1550-6533

Conferencia

Conferencia27th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2015
País/TerritorioBrasil
CiudadFlorianopolis
Período18/10/1521/10/15

Huella

Profundice en los temas de investigación de 'A fault-tolerance protocol for parallel applications with communication imbalance'. En conjunto forman una huella única.

Citar esto