Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework

Elvis Rojas, Michael Knobloch, Nour Daoud, Esteban Meneses, Bernd Mohr

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

Deep Learning (DL) applications are used to solve complex problems efficiently. These applications require complex neural network models composed of millions of parameters and huge amounts of data for proper training. This is only possible by parallelizing the necessary computations by so-called distributed deep learning (DDL) frameworks over many GPUs distributed over multiple nodes of a HPC cluster. These frameworks mostly utilize the compute power of the GPUs and use only a small portion of the available compute power of the CPUs in the nodes for I/O and inter-process communication, leaving many CPU cores idle and unused. The more powerful the base CPU in the cluster nodes, the more compute resources are wasted. In this paper, we investigate how much of this unutilized compute resources could be used for executing other applications without lowering the performance of the DDL frameworks. In our experiments, we executed a noise-generation application, which generates a very-high memory, network or I/O load, in parallel with DDL frameworks, and use HPC profiling and tracing techniques to determine whether and how the generated noise is affecting the performance of the DDL frameworks. Early results indicate that it might be possible to utilize the idle cores for jobs of other users without affecting the performance of the DDL applications in a negative way.

Idioma originalInglés
Título de la publicación alojadaProceedings - 2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
EditorialInstitute of Electrical and Electronics Engineers Inc.
Páginas516-522
Número de páginas7
ISBN (versión digital)9781665498562
DOI
EstadoPublicada - 2022
Evento2022 IEEE International Conference on Cluster Computing, CLUSTER 2022 - Heidelberg, Alemania
Duración: 6 sept 20229 sept 2022

Serie de la publicación

NombreProceedings - IEEE International Conference on Cluster Computing, ICCC
Volumen2022-September
ISSN (versión impresa)1552-5244

Conferencia

Conferencia2022 IEEE International Conference on Cluster Computing, CLUSTER 2022
País/TerritorioAlemania
CiudadHeidelberg
Período6/09/229/09/22

Huella

Profundice en los temas de investigación de 'Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework'. En conjunto forman una huella única.

Citar esto