Optimal scheduling of distributed DNN training in machine learning clusters

Bao, Yixin; 包逸歆

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Optimal scheduling of distributed DNN training in machine learning clusters

Title	Optimal scheduling of distributed DNN training in machine learning clusters
Authors	Bao, Yixin 包逸歆
Advisors	Advisor(s):Wu, C
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Bao, Y. [包逸歆]. (2020). Optimal scheduling of distributed DNN training in machine learning clusters. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Nowadays, large-scale distributed machine learning systems have been deployed in many leading IT firms to support various analytics and intelligence services, such as computer vision, natural language processing, speech recognition, etc. Distributed machine learning jobs have become a common workload in data centers, where the deep neural networks (DNNs) are trained repeatedly. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. The issues are more complicated when multiple jobs are executed in a shared cluster, e.g., different jobs may compete for the computation or bandwidth resources; interference may occur due to co-located jobs; jobs arrive online, and we do not know any future information while making scheduling decisions, etc. This thesis proposes a series of job scheduling algorithms to maximize resource utilization and hence, expedite job training progress. To be specific, we propose three online algorithms to tackle this problem systematically, including OASiS, Harmony, and PACE. In OASiS, we design an online job scheduling algorithm for distributed training jobs using parameter server architecture to maximize the overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. During the training process, we dynamically scale in or scale out the resources of each job according to the availability in the machine learning cluster to optimize cluster resource utilization. The simulation and experiment results show that we can make 50% improvement compared with the heuristic solutions. In Harmony, we propose a deep reinforcement learning based approach for task placement of distributed deep learning jobs with parameter server architecture, to avoid the interference across jobs in both computation and communication in a shared GPU cluster. Our goal is to maximize the overall training speed of the cluster. To generate a large number of training samples with accurate reward, we propose an auxiliary neural network to simulate the interaction with the environment, i.e., the cluster. We do offline training in batch and produce decisions by online inference. The evaluation shows that it outperforms representative schedulers by 25% in terms of average job completion time. In PACE, we design a communication scheduler to preemptively schedule gradient synchronization among workers of a distributed all-reduce DNN training job. Our goal is to minimize the training time of one iteration by maximally overlapping network communication with backward and forward computation. We build a mathematical model and produce the optimal scheduling solution by transferring it to a convex optimization problem with unimodular constraint. To mitigate network overhead, we also fuse multiple all-reduce operators into a larger one to improve bandwidth usage. We implement PACE in MXNet with Horovod as the communication library. Our experiments show that PACE accelerates training with different DNN models by up to 36%, compared with the state-of-the-art communication scheduling policies.
Degree	Doctor of Philosophy
Subject	Machine learning Electronic data processing - Distributed processing
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/283135

DC Field	Value	Language
dc.contributor.advisor	Wu, C	-
dc.contributor.author	Bao, Yixin	-
dc.contributor.author	包逸歆	-
dc.date.accessioned	2020-06-10T01:02:17Z	-
dc.date.available	2020-06-10T01:02:17Z	-
dc.date.issued	2020	-
dc.identifier.citation	Bao, Y. [包逸歆]. (2020). Optimal scheduling of distributed DNN training in machine learning clusters. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/283135	-
dc.description.abstract	Nowadays, large-scale distributed machine learning systems have been deployed in many leading IT firms to support various analytics and intelligence services, such as computer vision, natural language processing, speech recognition, etc. Distributed machine learning jobs have become a common workload in data centers, where the deep neural networks (DNNs) are trained repeatedly. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. The issues are more complicated when multiple jobs are executed in a shared cluster, e.g., different jobs may compete for the computation or bandwidth resources; interference may occur due to co-located jobs; jobs arrive online, and we do not know any future information while making scheduling decisions, etc. This thesis proposes a series of job scheduling algorithms to maximize resource utilization and hence, expedite job training progress. To be specific, we propose three online algorithms to tackle this problem systematically, including OASiS, Harmony, and PACE. In OASiS, we design an online job scheduling algorithm for distributed training jobs using parameter server architecture to maximize the overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. During the training process, we dynamically scale in or scale out the resources of each job according to the availability in the machine learning cluster to optimize cluster resource utilization. The simulation and experiment results show that we can make 50% improvement compared with the heuristic solutions. In Harmony, we propose a deep reinforcement learning based approach for task placement of distributed deep learning jobs with parameter server architecture, to avoid the interference across jobs in both computation and communication in a shared GPU cluster. Our goal is to maximize the overall training speed of the cluster. To generate a large number of training samples with accurate reward, we propose an auxiliary neural network to simulate the interaction with the environment, i.e., the cluster. We do offline training in batch and produce decisions by online inference. The evaluation shows that it outperforms representative schedulers by 25% in terms of average job completion time. In PACE, we design a communication scheduler to preemptively schedule gradient synchronization among workers of a distributed all-reduce DNN training job. Our goal is to minimize the training time of one iteration by maximally overlapping network communication with backward and forward computation. We build a mathematical model and produce the optimal scheduling solution by transferring it to a convex optimization problem with unimodular constraint. To mitigate network overhead, we also fuse multiple all-reduce operators into a larger one to improve bandwidth usage. We implement PACE in MXNet with Horovod as the communication library. Our experiments show that PACE accelerates training with different DNN models by up to 36%, compared with the state-of-the-art communication scheduling policies.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Electronic data processing - Distributed processing	-
dc.title	Optimal scheduling of distributed DNN training in machine learning clusters	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044242097803414	-

File Download

Supplementary

postgraduate thesis: Optimal scheduling of distributed DNN training in machine learning clusters

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats