File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/INFOCOM.2019.8737460
- Scopus: eid_2-s2.0-85068226756
- WOS: WOS:000480426400057
Supplementary
- Citations:
- Appears in Collections:
Conference Paper: Deep Learning-based Job Placement in Distributed Machine Learning Clusters
Title | Deep Learning-based Job Placement in Distributed Machine Learning Clusters |
---|---|
Authors | |
Issue Date | 2019 |
Publisher | IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359 |
Citation | Proceedings of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 29 April-2 May 2019, p. 505-513 How to Cite? |
Abstract | Production machine learning (ML) clusters commonly host a variety of distributed ML workloads, e.g., speech recognition, machine translation. While server sharing among jobs improves resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers (e.g., Mesos) are interference-oblivious in their job placement, causing suboptimal resource efficiency. Interference-aware job placement has been studied in the literature, but was treated using detailed workload profiling and interference modeling, which is not a general solution. This paper presents Harmony, a deep learning-driven ML cluster scheduler that places training jobs in a manner that minimizes interference and maximizes performance (i.e., training completion time). Harmony is based on a carefully designed deep reinforcement learning (DRL) framework augmented with reward modeling. The DRL employs state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary reward prediction model, which is trained using historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 25% in terms of average job completion time. |
Persistent Identifier | http://hdl.handle.net/10722/273017 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Bao, Y | - |
dc.contributor.author | Peng, Y | - |
dc.contributor.author | Wu, C | - |
dc.date.accessioned | 2019-08-06T09:20:58Z | - |
dc.date.available | 2019-08-06T09:20:58Z | - |
dc.date.issued | 2019 | - |
dc.identifier.citation | Proceedings of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 29 April-2 May 2019, p. 505-513 | - |
dc.identifier.uri | http://hdl.handle.net/10722/273017 | - |
dc.description.abstract | Production machine learning (ML) clusters commonly host a variety of distributed ML workloads, e.g., speech recognition, machine translation. While server sharing among jobs improves resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers (e.g., Mesos) are interference-oblivious in their job placement, causing suboptimal resource efficiency. Interference-aware job placement has been studied in the literature, but was treated using detailed workload profiling and interference modeling, which is not a general solution. This paper presents Harmony, a deep learning-driven ML cluster scheduler that places training jobs in a manner that minimizes interference and maximizes performance (i.e., training completion time). Harmony is based on a carefully designed deep reinforcement learning (DRL) framework augmented with reward modeling. The DRL employs state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary reward prediction model, which is trained using historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 25% in terms of average job completion time. | - |
dc.language | eng | - |
dc.publisher | IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359 | - |
dc.relation.ispartof | IEEE INFOCOM - IEEE Conference on Computer Communications | - |
dc.rights | IEEE INFOCOM - IEEE Conference on Computer Communications. Copyright © IEEE Computer Society. | - |
dc.rights | ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | - |
dc.title | Deep Learning-based Job Placement in Distributed Machine Learning Clusters | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Wu, C: cwu@cs.hku.hk | - |
dc.identifier.authority | Wu, C=rp01397 | - |
dc.identifier.doi | 10.1109/INFOCOM.2019.8737460 | - |
dc.identifier.scopus | eid_2-s2.0-85068226756 | - |
dc.identifier.hkuros | 299709 | - |
dc.identifier.spage | 505 | - |
dc.identifier.epage | 513 | - |
dc.identifier.isi | WOS:000480426400057 | - |
dc.publisher.place | United States | - |