Deep Learning-based Job Placement in Distributed Machine Learning Clusters

Bao, Y; Peng, Y; Wu, C

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/INFOCOM.2019.8737460
Scopus: eid_2-s2.0-85068226756

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Deep Learning-based Job Placement in Distributed Machine Learning Clusters

Title	Deep Learning-based Job Placement in Distributed Machine Learning Clusters
Authors	Bao, Y Peng, Y Wu, C
Issue Date	2019
Publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359
Citation	Proceedings of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 29 April-2 May 2019, p. 505-513 How to Cite? DOI: http://dx.doi.org/10.1109/INFOCOM.2019.8737460
Abstract	Production machine learning (ML) clusters commonly host a variety of distributed ML workloads, e.g., speech recognition, machine translation. While server sharing among jobs improves resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers (e.g., Mesos) are interference-oblivious in their job placement, causing suboptimal resource efficiency. Interference-aware job placement has been studied in the literature, but was treated using detailed workload profiling and interference modeling, which is not a general solution. This paper presents Harmony, a deep learning-driven ML cluster scheduler that places training jobs in a manner that minimizes interference and maximizes performance (i.e., training completion time). Harmony is based on a carefully designed deep reinforcement learning (DRL) framework augmented with reward modeling. The DRL employs state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary reward prediction model, which is trained using historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 25% in terms of average job completion time.
Persistent Identifier	http://hdl.handle.net/10722/273017

DC Field	Value	Language
dc.contributor.author	Bao, Y	-
dc.contributor.author	Peng, Y	-
dc.contributor.author	Wu, C	-
dc.date.accessioned	2019-08-06T09:20:58Z	-
dc.date.available	2019-08-06T09:20:58Z	-
dc.date.issued	2019	-
dc.identifier.citation	Proceedings of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 29 April-2 May 2019, p. 505-513	-
dc.identifier.uri	http://hdl.handle.net/10722/273017	-
dc.description.abstract	Production machine learning (ML) clusters commonly host a variety of distributed ML workloads, e.g., speech recognition, machine translation. While server sharing among jobs improves resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers (e.g., Mesos) are interference-oblivious in their job placement, causing suboptimal resource efficiency. Interference-aware job placement has been studied in the literature, but was treated using detailed workload profiling and interference modeling, which is not a general solution. This paper presents Harmony, a deep learning-driven ML cluster scheduler that places training jobs in a manner that minimizes interference and maximizes performance (i.e., training completion time). Harmony is based on a carefully designed deep reinforcement learning (DRL) framework augmented with reward modeling. The DRL employs state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary reward prediction model, which is trained using historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 25% in terms of average job completion time.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359	-
dc.relation.ispartof	IEEE INFOCOM - IEEE Conference on Computer Communications	-
dc.rights	IEEE INFOCOM - IEEE Conference on Computer Communications. Copyright © IEEE Computer Society.	-
dc.rights	©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.title	Deep Learning-based Job Placement in Distributed Machine Learning Clusters	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.identifier.doi	10.1109/INFOCOM.2019.8737460	-
dc.identifier.scopus	eid_2-s2.0-85068226756	-
dc.identifier.hkuros	299709	-
dc.identifier.spage	505	-
dc.identifier.epage	513	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Deep Learning-based Job Placement in Distributed Machine Learning Clusters

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats