TY - GEN
T1 - Dynamic load balance for optimized message logging in fault tolerant HPC applications
AU - Meneses, Esteban
AU - Kalé, Laxmikant V.
AU - Bronevetsky, Greg
PY - 2011
Y1 - 2011
N2 - Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team check pointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.
AB - Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team check pointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.
KW - causal message logging
KW - fault tolerance
KW - load balancing
UR - http://www.scopus.com/inward/record.url?scp=80955167907&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2011.39
DO - 10.1109/CLUSTER.2011.39
M3 - Contribución a la conferencia
AN - SCOPUS:80955167907
SN - 9780769545165
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 281
EP - 289
BT - Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011
T2 - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011
Y2 - 26 September 2011 through 30 September 2011
ER -