Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

Peng, Y; Bao, Y; Chen, Y; Wu, C; Guo, C

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3190508.3190517
Scopus: eid_2-s2.0-85052020091
WOS: WOS:000460467600003

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

Title	Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
Authors	Peng, Y Bao, Y Chen, Y Wu, C Guo, C
Keywords	Resource management; deep learning
Issue Date	2018
Publisher	Association for Computing Machinery.
Citation	Proceedings of the Thirteenth European Conference on Computer Systems (EuroSys 2018), Porto, Portugal, 23-26 April 2018, article 3 How to Cite? DOI: http://dx.doi.org/10.1145/3190508.3190517
Abstract	Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
Persistent Identifier	http://hdl.handle.net/10722/259647
ISBN	978-1-4503-5584-1
ISI Accession Number ID	WOS:000460467600003

DC Field	Value	Language
dc.contributor.author	Peng, Y	-
dc.contributor.author	Bao, Y	-
dc.contributor.author	Chen, Y	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Guo, C	-
dc.date.accessioned	2018-09-03T04:11:26Z	-
dc.date.available	2018-09-03T04:11:26Z	-
dc.date.issued	2018	-
dc.identifier.citation	Proceedings of the Thirteenth European Conference on Computer Systems (EuroSys 2018), Porto, Portugal, 23-26 April 2018, article 3	-
dc.identifier.isbn	978-1-4503-5584-1	-
dc.identifier.uri	http://hdl.handle.net/10722/259647	-
dc.description.abstract	Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery.	-
dc.relation.ispartof	EuroSys (European Conference on Computer Systems), 2018	-
dc.rights	EuroSys (European Conference on Computer Systems), 2018. Copyright © Association for Computing Machinery.	-
dc.subject	Resource management; deep learning	-
dc.title	Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.identifier.doi	10.1145/3190508.3190517	-
dc.identifier.scopus	eid_2-s2.0-85052020091	-
dc.identifier.hkuros	288752	-
dc.identifier.spage	article 3	-
dc.identifier.epage	article 3	-
dc.identifier.isi	WOS:000460467600003	-
dc.publisher.place	New York, NY	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats