File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Scalable group-based checkpoint/restart for large-scale message-passing systems

TitleScalable group-based checkpoint/restart for large-scale message-passing systems
Authors
Issue Date2008
PublisherIEEE Computer Society.
Citation
The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, USA, 14-18 April 2008. In IEEE Symposium on Parallel and Distributed Processing Proceedings, 2008, p. article no. 4536302 How to Cite?
AbstractThe ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.
Persistent Identifierhttp://hdl.handle.net/10722/93186
References

 

DC FieldValueLanguage
dc.contributor.authorHo, JCYen_HK
dc.contributor.authorWang, CLen_HK
dc.contributor.authorLau, FCMen_HK
dc.date.accessioned2010-09-25T14:53:30Z-
dc.date.available2010-09-25T14:53:30Z-
dc.date.issued2008en_HK
dc.identifier.citationThe 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, USA, 14-18 April 2008. In IEEE Symposium on Parallel and Distributed Processing Proceedings, 2008, p. article no. 4536302en_HK
dc.identifier.urihttp://hdl.handle.net/10722/93186-
dc.description.abstractThe ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.en_HK
dc.languageengen_HK
dc.publisherIEEE Computer Society.en_HK
dc.relation.ispartofIEEE Symposium on Parallel and Distributed Processing Proceedingsen_HK
dc.rights©2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.-
dc.titleScalable group-based checkpoint/restart for large-scale message-passing systemsen_HK
dc.typeConference_Paperen_HK
dc.identifier.emailWang, CL:clwang@cs.hku.hken_HK
dc.identifier.emailLau, FCM:fcmlau@cs.hku.hken_HK
dc.identifier.authorityWang, CL=rp00183en_HK
dc.identifier.authorityLau, FCM=rp00221en_HK
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.1109/IPDPS.2008.4536302en_HK
dc.identifier.scopuseid_2-s2.0-51049086184en_HK
dc.identifier.hkuros149518en_HK
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-51049086184&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.scopusauthoridHo, JCY=36952423300en_HK
dc.identifier.scopusauthoridWang, CL=7501646188en_HK
dc.identifier.scopusauthoridLau, FCM=7102749723en_HK
dc.customcontrol.immutablesml 160106 - amend-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats