File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Optimization of cloud task processing with checkpoint-restart mechanism

TitleOptimization of cloud task processing with checkpoint-restart mechanism
Authors
KeywordsCloud Computing
Checkpoint-Restart Mechanism
Optimal Checkpointing Interval
Google
BLCR
Issue Date2013
PublisherAssociation for Computing Machinery (ACM).
Citation
The 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC13), Denver, CO., 17-21 November 2013. In Proceedings of SC13, 2013, article no. 64 How to Cite?
AbstractIn this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young’s formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average.
Persistent Identifierhttp://hdl.handle.net/10722/191545
ISBN
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorDi, Sen_US
dc.contributor.authorRobert, Yen_US
dc.contributor.authorVivien, Fen_US
dc.contributor.authorKondo, Den_US
dc.contributor.authorWang, CLen_US
dc.contributor.authorCappello, Fen_US
dc.date.accessioned2013-10-15T07:10:15Z-
dc.date.available2013-10-15T07:10:15Z-
dc.date.issued2013en_US
dc.identifier.citationThe 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC13), Denver, CO., 17-21 November 2013. In Proceedings of SC13, 2013, article no. 64en_US
dc.identifier.isbn978-1-4503-2378-9-
dc.identifier.urihttp://hdl.handle.net/10722/191545-
dc.description.abstractIn this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young’s formula by 3-10 percent, reducing wallclock lengths by 50-100 seconds per job on average.-
dc.languageengen_US
dc.publisherAssociation for Computing Machinery (ACM).-
dc.relation.ispartofProceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysisen_US
dc.subjectCloud Computing-
dc.subjectCheckpoint-Restart Mechanism-
dc.subjectOptimal Checkpointing Interval-
dc.subjectGoogle-
dc.subjectBLCR-
dc.titleOptimization of cloud task processing with checkpoint-restart mechanismen_US
dc.typeConference_Paperen_US
dc.identifier.emailWang, CL: clwang@cs.hku.hken_US
dc.identifier.authorityWang, CL=rp00183en_US
dc.description.naturelink_to_OA_fulltext-
dc.identifier.doi10.1145/2503210.2503217-
dc.identifier.scopuseid_2-s2.0-84899679452-
dc.identifier.hkuros225318en_US
dc.identifier.isiWOS:000345856900065-
dc.publisher.placeUnited States-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats