File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1145/3190508.3190517
- Scopus: eid_2-s2.0-85052020091
- WOS: WOS:000460467600003
Supplementary
- Citations:
- Appears in Collections:
Conference Paper: Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
Title | Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters |
---|---|
Authors | |
Keywords | Resource management; deep learning |
Issue Date | 2018 |
Publisher | Association for Computing Machinery. |
Citation | Proceedings of the Thirteenth European Conference on Computer Systems (EuroSys 2018), Porto, Portugal, 23-26 April 2018, article 3 How to Cite? |
Abstract | Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively. |
Persistent Identifier | http://hdl.handle.net/10722/259647 |
ISBN | |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Peng, Y | - |
dc.contributor.author | Bao, Y | - |
dc.contributor.author | Chen, Y | - |
dc.contributor.author | Wu, C | - |
dc.contributor.author | Guo, C | - |
dc.date.accessioned | 2018-09-03T04:11:26Z | - |
dc.date.available | 2018-09-03T04:11:26Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | Proceedings of the Thirteenth European Conference on Computer Systems (EuroSys 2018), Porto, Portugal, 23-26 April 2018, article 3 | - |
dc.identifier.isbn | 978-1-4503-5584-1 | - |
dc.identifier.uri | http://hdl.handle.net/10722/259647 | - |
dc.description.abstract | Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively. | - |
dc.language | eng | - |
dc.publisher | Association for Computing Machinery. | - |
dc.relation.ispartof | EuroSys (European Conference on Computer Systems), 2018 | - |
dc.rights | EuroSys (European Conference on Computer Systems), 2018. Copyright © Association for Computing Machinery. | - |
dc.subject | Resource management; deep learning | - |
dc.title | Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Wu, C: cwu@cs.hku.hk | - |
dc.identifier.authority | Wu, C=rp01397 | - |
dc.identifier.doi | 10.1145/3190508.3190517 | - |
dc.identifier.scopus | eid_2-s2.0-85052020091 | - |
dc.identifier.hkuros | 288752 | - |
dc.identifier.spage | article 3 | - |
dc.identifier.epage | article 3 | - |
dc.identifier.isi | WOS:000460467600003 | - |
dc.publisher.place | New York, NY | - |