On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

Diego Pérez, Thomas Ropars, Esteban Meneses

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

This paper studies the use of Redundant Multi-Threading (RMT) to detect Silent Data Corruptions in HPC applications. To understand if it can be a viable solution in an HPC context, we study two software optimizations to reduce RMT performance overhead by reducing the amount of data exchanged between the replicated threads. We conduct experiments with representative HPC workloads to measure the performance gains obtained through these optimizations, and the error detection coverage they achieve. In the best case, when running on a processor that features Simultaneous Multi-Threading, our results show that the overhead can be as low as 1.4 × without significantly reducing the ability to detect data corruptions.

Idioma originalInglés
Título de la publicación alojadaEuro-Par 2020
Subtítulo de la publicación alojadaParallel Processing Workshops - Euro-Par 2020 International Workshops, 2020, Revised Selected Papers
EditoresBartosz Balis, Dora B. Heras, Laura Antonelli, Andrea Bracciali, Thomas Gruber, Jin Hyun-Wook, Michael Kuhn, Stephen L. Scott, Didem Unat, Roman Wyrzykowski
EditorialSpringer Science and Business Media Deutschland GmbH
Páginas290-302
Número de páginas13
ISBN (versión impresa)9783030715922
DOI
EstadoPublicada - 2021
EventoWorkshops held at the 26th International Conference on Parallel and Distributed Computing, Euro-Par 2020 - Virtual, Online
Duración: 24 ago 202025 ago 2020

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen12480 LNCS
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

ConferenciaWorkshops held at the 26th International Conference on Parallel and Distributed Computing, Euro-Par 2020
CiudadVirtual, Online
Período24/08/2025/08/20

Huella

Profundice en los temas de investigación de 'On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading'. En conjunto forman una huella única.

Citar esto