File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Optimal scheduling of distributed DNN training in machine learning clusters
Title | Optimal scheduling of distributed DNN training in machine learning clusters |
---|---|
Authors | |
Advisors | Advisor(s):Wu, C |
Issue Date | 2020 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Bao, Y. [包逸歆]. (2020). Optimal scheduling of distributed DNN training in machine learning clusters. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Nowadays, large-scale distributed machine learning systems have been deployed in many leading IT firms to support various analytics and intelligence services, such as computer vision, natural language processing, speech recognition, etc. Distributed machine learning jobs have become a common workload in data centers, where the deep neural networks (DNNs) are trained repeatedly. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. The issues are more complicated when multiple jobs are executed in a shared cluster, e.g., different jobs may compete for the computation or bandwidth resources; interference may occur due to co-located jobs; jobs arrive online, and we do not know any future information while making scheduling decisions, etc. This thesis proposes a series of job scheduling algorithms to maximize resource utilization and hence, expedite job training progress. To be specific, we propose three online algorithms to tackle this problem systematically, including OASiS, Harmony, and PACE.
In OASiS, we design an online job scheduling algorithm for distributed training jobs using parameter server architecture to maximize the overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. During the training process, we dynamically scale in or scale out the resources of each job according to the availability in the machine learning cluster to optimize cluster resource utilization. The simulation and experiment results show that we can make 50% improvement compared with the heuristic solutions.
In Harmony, we propose a deep reinforcement learning based approach for task placement of distributed deep learning jobs with parameter server architecture, to avoid the interference across jobs in both computation and communication in a shared GPU cluster. Our goal is to maximize the overall training speed of the cluster. To generate a large number of training samples with accurate reward, we propose an auxiliary neural network to simulate the interaction with the environment, i.e., the cluster. We do offline training in batch and produce decisions by online inference. The evaluation shows that it outperforms representative schedulers by 25% in terms of average job completion time.
In PACE, we design a communication scheduler to preemptively schedule gradient synchronization among workers of a distributed all-reduce DNN training job. Our goal is to minimize the training time of one iteration by maximally overlapping network communication with backward and forward computation. We build a mathematical model and produce the optimal scheduling solution by transferring it to a convex optimization problem with unimodular constraint. To mitigate network overhead, we also fuse multiple all-reduce operators into a larger one to improve bandwidth usage. We implement PACE in MXNet with Horovod as the communication library. Our experiments show that PACE accelerates training with different DNN models by up to 36%, compared with the state-of-the-art communication scheduling policies. |
Degree | Doctor of Philosophy |
Subject | Machine learning Electronic data processing - Distributed processing |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/283135 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Wu, C | - |
dc.contributor.author | Bao, Yixin | - |
dc.contributor.author | 包逸歆 | - |
dc.date.accessioned | 2020-06-10T01:02:17Z | - |
dc.date.available | 2020-06-10T01:02:17Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Bao, Y. [包逸歆]. (2020). Optimal scheduling of distributed DNN training in machine learning clusters. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/283135 | - |
dc.description.abstract | Nowadays, large-scale distributed machine learning systems have been deployed in many leading IT firms to support various analytics and intelligence services, such as computer vision, natural language processing, speech recognition, etc. Distributed machine learning jobs have become a common workload in data centers, where the deep neural networks (DNNs) are trained repeatedly. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. The issues are more complicated when multiple jobs are executed in a shared cluster, e.g., different jobs may compete for the computation or bandwidth resources; interference may occur due to co-located jobs; jobs arrive online, and we do not know any future information while making scheduling decisions, etc. This thesis proposes a series of job scheduling algorithms to maximize resource utilization and hence, expedite job training progress. To be specific, we propose three online algorithms to tackle this problem systematically, including OASiS, Harmony, and PACE. In OASiS, we design an online job scheduling algorithm for distributed training jobs using parameter server architecture to maximize the overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. During the training process, we dynamically scale in or scale out the resources of each job according to the availability in the machine learning cluster to optimize cluster resource utilization. The simulation and experiment results show that we can make 50% improvement compared with the heuristic solutions. In Harmony, we propose a deep reinforcement learning based approach for task placement of distributed deep learning jobs with parameter server architecture, to avoid the interference across jobs in both computation and communication in a shared GPU cluster. Our goal is to maximize the overall training speed of the cluster. To generate a large number of training samples with accurate reward, we propose an auxiliary neural network to simulate the interaction with the environment, i.e., the cluster. We do offline training in batch and produce decisions by online inference. The evaluation shows that it outperforms representative schedulers by 25% in terms of average job completion time. In PACE, we design a communication scheduler to preemptively schedule gradient synchronization among workers of a distributed all-reduce DNN training job. Our goal is to minimize the training time of one iteration by maximally overlapping network communication with backward and forward computation. We build a mathematical model and produce the optimal scheduling solution by transferring it to a convex optimization problem with unimodular constraint. To mitigate network overhead, we also fuse multiple all-reduce operators into a larger one to improve bandwidth usage. We implement PACE in MXNet with Horovod as the communication library. Our experiments show that PACE accelerates training with different DNN models by up to 36%, compared with the state-of-the-art communication scheduling policies. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Machine learning | - |
dc.subject.lcsh | Electronic data processing - Distributed processing | - |
dc.title | Optimal scheduling of distributed DNN training in machine learning clusters | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2020 | - |
dc.identifier.mmsid | 991044242097803414 | - |