File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

TitleOptimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
Authors
KeywordsResource management; deep learning
Issue Date2018
PublisherAssociation for Computing Machinery.
Citation
Proceedings of the Thirteenth European Conference on Computer Systems (EuroSys 2018), Porto, Portugal, 23-26 April 2018, article 3 How to Cite?
AbstractDeep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
Persistent Identifierhttp://hdl.handle.net/10722/259647
ISBN
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorPeng, Y-
dc.contributor.authorBao, Y-
dc.contributor.authorChen, Y-
dc.contributor.authorWu, C-
dc.contributor.authorGuo, C-
dc.date.accessioned2018-09-03T04:11:26Z-
dc.date.available2018-09-03T04:11:26Z-
dc.date.issued2018-
dc.identifier.citationProceedings of the Thirteenth European Conference on Computer Systems (EuroSys 2018), Porto, Portugal, 23-26 April 2018, article 3-
dc.identifier.isbn978-1-4503-5584-1-
dc.identifier.urihttp://hdl.handle.net/10722/259647-
dc.description.abstractDeep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.-
dc.languageeng-
dc.publisherAssociation for Computing Machinery.-
dc.relation.ispartofEuroSys (European Conference on Computer Systems), 2018-
dc.rightsEuroSys (European Conference on Computer Systems), 2018. Copyright © Association for Computing Machinery.-
dc.subjectResource management; deep learning-
dc.titleOptimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters-
dc.typeConference_Paper-
dc.identifier.emailWu, C: cwu@cs.hku.hk-
dc.identifier.authorityWu, C=rp01397-
dc.identifier.doi10.1145/3190508.3190517-
dc.identifier.scopuseid_2-s2.0-85052020091-
dc.identifier.hkuros288752-
dc.identifier.spagearticle 3-
dc.identifier.epagearticle 3-
dc.identifier.isiWOS:000460467600003-
dc.publisher.placeNew York, NY-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats