ACR: Automatic checkpoint/restart for soft and hard error protection

Xiang Ni, Esteban Meneses, Nikhil Jain, Laxmikant V. Kalé

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

50 Citas (Scopus)

Resumen

As machines increase in scale, many researchers have pre-dicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft er-ror rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holis-tic methodology for automatically detecting and recovering from soft or hard faults with minimal application interven-tion. This is demonstrated by ACR: an automatic check-point/restart framework that performs application replica-tion and automatically adapts the checkpoint period using online information about the current failure rate. ACR per-forms an application- and user-oblivious recovery. We em-pirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interac-tion between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.

Idioma originalInglés
Título de la publicación alojadaProceedings of SC 2013
Subtítulo de la publicación alojadaThe International Conference for High Performance Computing, Networking, Storage and Analysis
EditorialIEEE Computer Society
ISBN (versión impresa)9781450323789
DOI
EstadoPublicada - 2013
Publicado de forma externa
Evento2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, Estados Unidos
Duración: 17 nov 201322 nov 2013

Serie de la publicación

NombreInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (versión impresa)2167-4329
ISSN (versión digital)2167-4337

Conferencia

Conferencia2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
País/TerritorioEstados Unidos
CiudadDenver, CO
Período17/11/1322/11/13

Huella

Profundice en los temas de investigación de 'ACR: Automatic checkpoint/restart for soft and hard error protection'. En conjunto forman una huella única.

Citar esto