Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Elvis Rojas, Fabricio Quirós-Corella, Terry Jones, Esteban Meneses

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

5 Citas (Scopus)

Resumen

Artificial intelligence is a transforming technology for creating new scientific discoveries, services, and products. Its full potential is achieved when massive data repositories and large-scale computing systems are available. Both factors are becoming easier to obtain daily as sensor networks constantly create open-data archives, and Moore’s law still makes supercomputing power more accessible. However, as deep learning models become larger to tackle data complexity, researchers must determine how to speed up training in those models. This paper uses an experimental approach to try to understand the algorithms and trade-offs associated with distributed deep learning. This study used the Summit supercomputer at Oak Ridge National Laboratory to determine that existing distributed deep learning mechanisms scale in execution time. However, as more nodes are used, accuracy degrades significantly. To solve this, several hyper-parameters must be tuned. The results show that optimizing those parameters is a nontrivial task. We also evaluated the impact of other scaling techniques, such as mixed precision and adaptive parameter optimization.

Idioma originalInglés
Título de la publicación alojadaHigh Performance Computing - 8th Latin American Conference, CARLA 2021, Revised Selected Papers
EditoresIsidoro Gitler, Carlos Jaime Barrios Hernández, Esteban Meneses
EditorialSpringer Science and Business Media Deutschland GmbH
Páginas177-192
Número de páginas16
ISBN (versión impresa)9783031042089
DOI
EstadoPublicada - 2022
Evento8th Latin American High Performance Computing Conference, CARLA 2021 - Virtual, Online
Duración: 6 oct 20218 oct 2021

Serie de la publicación

NombreCommunications in Computer and Information Science
Volumen1540 CCIS
ISSN (versión impresa)1865-0929
ISSN (versión digital)1865-0937

Conferencia

Conferencia8th Latin American High Performance Computing Conference, CARLA 2021
CiudadVirtual, Online
Período6/10/218/10/21

Huella

Profundice en los temas de investigación de 'Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch'. En conjunto forman una huella única.

Citar esto