TY - GEN
T1 - Team-based message logging
T2 - 10th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2010
AU - Meneses, Esteban
AU - Mendes, Celso L.
AU - Kalé, Laxmikant V.
PY - 2010
Y1 - 2010
N2 - Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.
AB - Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.
UR - http://www.scopus.com/inward/record.url?scp=77954923590&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2010.110
DO - 10.1109/CCGRID.2010.110
M3 - Contribución a la conferencia
AN - SCOPUS:77954923590
SN - 9781424469871
T3 - CCGrid 2010 - 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing
SP - 697
EP - 702
BT - CCGrid 2010 - 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing
Y2 - 17 May 2010 through 20 May 2010
ER -