Scalable group-based checkpoint/restart for large-scale message-passing systems

Ho, JCY; Wang, CL; Lau, FCM

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/IPDPS.2008.4536302
Scopus: eid_2-s2.0-51049086184

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Scalable group-based checkpoint/restart for large-scale message-passing systems

Title	Scalable group-based checkpoint/restart for large-scale message-passing systems
Authors	Ho, JCY Wang, CL Lau, FCM
Issue Date	2008
Publisher	IEEE Computer Society.
Citation	The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, USA, 14-18 April 2008. In IEEE Symposium on Parallel and Distributed Processing Proceedings, 2008, p. article no. 4536302 How to Cite? DOI: http://dx.doi.org/10.1109/IPDPS.2008.4536302
Abstract	The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.
Persistent Identifier	http://hdl.handle.net/10722/93186
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Ho, JCY	en_HK
dc.contributor.author	Wang, CL	en_HK
dc.contributor.author	Lau, FCM	en_HK
dc.date.accessioned	2010-09-25T14:53:30Z	-
dc.date.available	2010-09-25T14:53:30Z	-
dc.date.issued	2008	en_HK
dc.identifier.citation	The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, USA, 14-18 April 2008. In IEEE Symposium on Parallel and Distributed Processing Proceedings, 2008, p. article no. 4536302	en_HK
dc.identifier.uri	http://hdl.handle.net/10722/93186	-
dc.description.abstract	The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.	en_HK
dc.language	eng	en_HK
dc.publisher	IEEE Computer Society.	en_HK
dc.relation.ispartof	IEEE Symposium on Parallel and Distributed Processing Proceedings	en_HK
dc.rights	©2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.	-
dc.title	Scalable group-based checkpoint/restart for large-scale message-passing systems	en_HK
dc.type	Conference_Paper	en_HK
dc.identifier.email	Wang, CL:clwang@cs.hku.hk	en_HK
dc.identifier.email	Lau, FCM:fcmlau@cs.hku.hk	en_HK
dc.identifier.authority	Wang, CL=rp00183	en_HK
dc.identifier.authority	Lau, FCM=rp00221	en_HK
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1109/IPDPS.2008.4536302	en_HK
dc.identifier.scopus	eid_2-s2.0-51049086184	en_HK
dc.identifier.hkuros	149518	en_HK
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-51049086184&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.scopusauthorid	Ho, JCY=36952423300	en_HK
dc.identifier.scopusauthorid	Wang, CL=7501646188	en_HK
dc.identifier.scopusauthorid	Lau, FCM=7102749723	en_HK
dc.customcontrol.immutable	sml 160106 - amend	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Scalable group-based checkpoint/restart for large-scale message-passing systems

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats