File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/IPDPS.2008.4536302
- Scopus: eid_2-s2.0-51049086184
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: Scalable group-based checkpoint/restart for large-scale message-passing systems
Title | Scalable group-based checkpoint/restart for large-scale message-passing systems |
---|---|
Authors | |
Issue Date | 2008 |
Publisher | IEEE Computer Society. |
Citation | The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, USA, 14-18 April 2008. In IEEE Symposium on Parallel and Distributed Processing Proceedings, 2008, p. article no. 4536302 How to Cite? |
Abstract | The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE. |
Persistent Identifier | http://hdl.handle.net/10722/93186 |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Ho, JCY | en_HK |
dc.contributor.author | Wang, CL | en_HK |
dc.contributor.author | Lau, FCM | en_HK |
dc.date.accessioned | 2010-09-25T14:53:30Z | - |
dc.date.available | 2010-09-25T14:53:30Z | - |
dc.date.issued | 2008 | en_HK |
dc.identifier.citation | The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, USA, 14-18 April 2008. In IEEE Symposium on Parallel and Distributed Processing Proceedings, 2008, p. article no. 4536302 | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/93186 | - |
dc.description.abstract | The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE. | en_HK |
dc.language | eng | en_HK |
dc.publisher | IEEE Computer Society. | en_HK |
dc.relation.ispartof | IEEE Symposium on Parallel and Distributed Processing Proceedings | en_HK |
dc.rights | ©2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. | - |
dc.title | Scalable group-based checkpoint/restart for large-scale message-passing systems | en_HK |
dc.type | Conference_Paper | en_HK |
dc.identifier.email | Wang, CL:clwang@cs.hku.hk | en_HK |
dc.identifier.email | Lau, FCM:fcmlau@cs.hku.hk | en_HK |
dc.identifier.authority | Wang, CL=rp00183 | en_HK |
dc.identifier.authority | Lau, FCM=rp00221 | en_HK |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.1109/IPDPS.2008.4536302 | en_HK |
dc.identifier.scopus | eid_2-s2.0-51049086184 | en_HK |
dc.identifier.hkuros | 149518 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-51049086184&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.scopusauthorid | Ho, JCY=36952423300 | en_HK |
dc.identifier.scopusauthorid | Wang, CL=7501646188 | en_HK |
dc.identifier.scopusauthorid | Lau, FCM=7102749723 | en_HK |
dc.customcontrol.immutable | sml 160106 - amend | - |