Towards a model to estimate the reliability of large-scale hybrid supercomputers

Elvis Rojas, Esteban Meneses, Terry Jones, Don Maxwell

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multi-year failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercomputers, yet it is flexible enough to alter the composition of the machine and be able to predict resilience of future or hypothetical systems.

Idioma originalInglés
Título de la publicación alojadaEuro-Par 2020
Subtítulo de la publicación alojadaParallel Processing - 26th International Conference on Parallel and Distributed Computing, Proceedings
EditoresMaciej Malawski, Krzysztof Rzadca
EditorialSpringer
Páginas37-51
Número de páginas15
ISBN (versión impresa)9783030576745
DOI
EstadoPublicada - 2020
Evento26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020 - Warsaw, Polonia
Duración: 24 ago 202028 ago 2020

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen12247 LNCS
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

Conferencia26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020
País/TerritorioPolonia
CiudadWarsaw
Período24/08/2028/08/20

Huella

Profundice en los temas de investigación de 'Towards a model to estimate the reliability of large-scale hybrid supercomputers'. En conjunto forman una huella única.

Citar esto