Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

Elvis Rojas, Diego Perez, Jon C. Calhoun, Leonardo Bautista Gomez, Terry Jones, Esteban Meneses

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

8 Citas (Scopus)

Resumen

The convergence of artificial intelligence, highperformance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework - so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bitflips with a minimal impact on accuracy convergence.

Idioma originalInglés
Título de la publicación alojadaProceedings - 2021 IEEE International Conference on Cluster Computing, Cluster 2021
EditorialInstitute of Electrical and Electronics Engineers Inc.
Páginas492-503
Número de páginas12
ISBN (versión digital)9781728196664
DOI
EstadoPublicada - 2021
Evento2021 IEEE International Conference on Cluster Computing, Cluster 2021 - Virtual, Portland, Estados Unidos
Duración: 7 sept 202110 sept 2021

Serie de la publicación

NombreProceedings - IEEE International Conference on Cluster Computing, ICCC
Volumen2021-September
ISSN (versión impresa)1552-5244

Conferencia

Conferencia2021 IEEE International Conference on Cluster Computing, Cluster 2021
País/TerritorioEstados Unidos
CiudadVirtual, Portland
Período7/09/2110/09/21

Huella

Profundice en los temas de investigación de 'Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration'. En conjunto forman una huella única.

Citar esto