File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Deep Learning-based Job Placement in Distributed Machine Learning Clusters

TitleDeep Learning-based Job Placement in Distributed Machine Learning Clusters
Authors
Issue Date2019
PublisherIEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359
Citation
Proceedings of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 29 April-2 May 2019, p. 505-513 How to Cite?
AbstractProduction machine learning (ML) clusters commonly host a variety of distributed ML workloads, e.g., speech recognition, machine translation. While server sharing among jobs improves resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers (e.g., Mesos) are interference-oblivious in their job placement, causing suboptimal resource efficiency. Interference-aware job placement has been studied in the literature, but was treated using detailed workload profiling and interference modeling, which is not a general solution. This paper presents Harmony, a deep learning-driven ML cluster scheduler that places training jobs in a manner that minimizes interference and maximizes performance (i.e., training completion time). Harmony is based on a carefully designed deep reinforcement learning (DRL) framework augmented with reward modeling. The DRL employs state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary reward prediction model, which is trained using historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 25% in terms of average job completion time.
Persistent Identifierhttp://hdl.handle.net/10722/273017

 

DC FieldValueLanguage
dc.contributor.authorBao, Y-
dc.contributor.authorPeng, Y-
dc.contributor.authorWu, C-
dc.date.accessioned2019-08-06T09:20:58Z-
dc.date.available2019-08-06T09:20:58Z-
dc.date.issued2019-
dc.identifier.citationProceedings of IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 29 April-2 May 2019, p. 505-513-
dc.identifier.urihttp://hdl.handle.net/10722/273017-
dc.description.abstractProduction machine learning (ML) clusters commonly host a variety of distributed ML workloads, e.g., speech recognition, machine translation. While server sharing among jobs improves resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers (e.g., Mesos) are interference-oblivious in their job placement, causing suboptimal resource efficiency. Interference-aware job placement has been studied in the literature, but was treated using detailed workload profiling and interference modeling, which is not a general solution. This paper presents Harmony, a deep learning-driven ML cluster scheduler that places training jobs in a manner that minimizes interference and maximizes performance (i.e., training completion time). Harmony is based on a carefully designed deep reinforcement learning (DRL) framework augmented with reward modeling. The DRL employs state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary reward prediction model, which is trained using historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 25% in terms of average job completion time.-
dc.languageeng-
dc.publisherIEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359-
dc.relation.ispartofIEEE INFOCOM - IEEE Conference on Computer Communications-
dc.rightsIEEE INFOCOM - IEEE Conference on Computer Communications. Copyright © IEEE Computer Society.-
dc.rights©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.-
dc.titleDeep Learning-based Job Placement in Distributed Machine Learning Clusters-
dc.typeConference_Paper-
dc.identifier.emailWu, C: cwu@cs.hku.hk-
dc.identifier.authorityWu, C=rp01397-
dc.identifier.doi10.1109/INFOCOM.2019.8737460-
dc.identifier.scopuseid_2-s2.0-85068226756-
dc.identifier.hkuros299709-
dc.identifier.spage505-
dc.identifier.epage513-
dc.publisher.placeUnited States-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats