Exploring the Effects of Silent Data Corruption in Distributed Deep Learning Training

Elvis Rojas, Diego Perez, Esteban Meneses

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

2 Citas (Scopus)

Resumen

The profound impact of recent developments in artificial intelligence is unquestionable. The applications of deep learning models are everywhere, from advanced natural language processing to highly accurate prediction of extreme weather. Those models have been continuously increasing in complexity, becoming much more powerful than their original versions. In addition, data to train the models is becoming more available as technological infrastructures sense and collect more readings. Consequently, distributed deep learning training is often times necessary to handle intricate models and massive datasets. Running a distributed training strategy on a supercomputer exposes the models to all the considerations of a large-scale machine; reliability is one of them. As supercomputers integrate a colossal number of components, each fabricated on an ever decreasing feature-size, faults are common during execution of programs. A particular type of fault, silent data corruption, is troublesome because the system does not crash and does not immediately give an evident sign of an error. We set out to explore the effects of that type of faults by inspecting how distributed deep learning training strategies cope with bit-flips that affect their internal data structures. We used checkpoint alteration, a technique that permits the study of this phenomenon on different distributed training platforms and with different deep learning frameworks. We evaluated two distributed learning libraries (Distributed Data Parallel and Horovod) and found out Horovod is slightly more resilient to SDCs. However, fault propagation is similar in both cases, and the model is more sensitive to SDCs than the optimizer.

Idioma originalInglés
Título de la publicación alojadaProceedings - 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2022
EditorialIEEE Computer Society
Páginas21-30
Número de páginas10
ISBN (versión digital)9781665451550
DOI
EstadoPublicada - 2022
Evento34th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2022 - Bordeaux, Francia
Duración: 2 nov 20225 nov 2022

Serie de la publicación

NombreProceedings - Symposium on Computer Architecture and High Performance Computing
ISSN (versión impresa)1550-6533

Conferencia

Conferencia34th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2022
País/TerritorioFrancia
CiudadBordeaux
Período2/11/225/11/22

Huella

Profundice en los temas de investigación de 'Exploring the Effects of Silent Data Corruption in Distributed Deep Learning Training'. En conjunto forman una huella única.

Citar esto