Optimization of DNN training in distributed environments

Yi, Xiaodong; 易晓东

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Optimization of DNN training in distributed environments

Title	Optimization of DNN training in distributed environments
Authors	Yi, Xiaodong 易晓东
Advisors	Advisor(s):Wu, C
Issue Date	2021
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yi, X. [易晓东]. (2021). Optimization of DNN training in distributed environments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Nowadays, increasingly large DNN models lead to the demand of large-scale distributed machine learning systems, which have been deployed in many leading IT firms to support various analytics and intelligence services. Distributed machine learning jobs have become a common workload in data centers. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. It is important to efficiently train the model by carefully designing the device placement, types of parallelism communication methods and scheduling strategies. To tackle this problem systematically, we propose three system work including FastT, HeteroG, and Disco. In FastT, we propose a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over multiple GPUs, for expedited model training. We propose white-box algorithms to compute the strategies with small computing resource consumption in a short time. FastT decides execution order and achieves fine-grained parallelism by splitting some operations on the critical path to further improve the processing speed. It achieves up to 59.4% speedup compared with pure data parallelism with the larger solution space. Then we study the acceleration of distributed training in a heterogeneous environment. In HeteroG, we propose an automatic module to accelerate deep neural network training in heterogeneous GPU clusters. HeteroG enables highly-efficient distributed training over heterogeneous devices, by automatically converting a single-GPU training model to a distributed one according to the deep learning graph and available resources. It embraces operation-level hybrid parallelism, communication architecture selection and execution scheduling, based on a carefully designed strategy framework exploiting both GNN-based learning and combinatorial optimization. We compare HeteroG with existing parallelism schemes and show that it achieves up-to 222% training speed-up. HeteroG also enables efficient training of large models over a set of heterogeneous devices where simple parallelism is infeasible. Finally, we investigate the efficient distributed training with Deep Learning compilers. In Disco, we propose a module to generate optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A graph neural network-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare Disco with existing DL fusion schemes and show that it achieves good training speed-up close to the fully overlapping case.
Degree	Doctor of Philosophy
Subject	Neural networks (Computer science)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/306963

DC Field	Value	Language
dc.contributor.advisor	Wu, C	-
dc.contributor.author	Yi, Xiaodong	-
dc.contributor.author	易晓东	-
dc.date.accessioned	2021-11-03T04:36:36Z	-
dc.date.available	2021-11-03T04:36:36Z	-
dc.date.issued	2021	-
dc.identifier.citation	Yi, X. [易晓东]. (2021). Optimization of DNN training in distributed environments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/306963	-
dc.description.abstract	Nowadays, increasingly large DNN models lead to the demand of large-scale distributed machine learning systems, which have been deployed in many leading IT firms to support various analytics and intelligence services. Distributed machine learning jobs have become a common workload in data centers. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. It is important to efficiently train the model by carefully designing the device placement, types of parallelism communication methods and scheduling strategies. To tackle this problem systematically, we propose three system work including FastT, HeteroG, and Disco. In FastT, we propose a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over multiple GPUs, for expedited model training. We propose white-box algorithms to compute the strategies with small computing resource consumption in a short time. FastT decides execution order and achieves fine-grained parallelism by splitting some operations on the critical path to further improve the processing speed. It achieves up to 59.4% speedup compared with pure data parallelism with the larger solution space. Then we study the acceleration of distributed training in a heterogeneous environment. In HeteroG, we propose an automatic module to accelerate deep neural network training in heterogeneous GPU clusters. HeteroG enables highly-efficient distributed training over heterogeneous devices, by automatically converting a single-GPU training model to a distributed one according to the deep learning graph and available resources. It embraces operation-level hybrid parallelism, communication architecture selection and execution scheduling, based on a carefully designed strategy framework exploiting both GNN-based learning and combinatorial optimization. We compare HeteroG with existing parallelism schemes and show that it achieves up-to 222% training speed-up. HeteroG also enables efficient training of large models over a set of heterogeneous devices where simple parallelism is infeasible. Finally, we investigate the efficient distributed training with Deep Learning compilers. In Disco, we propose a module to generate optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A graph neural network-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare Disco with existing DL fusion schemes and show that it achieves good training speed-up close to the fully overlapping case.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Neural networks (Computer science)	-
dc.title	Optimization of DNN training in distributed environments	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2021	-
dc.identifier.mmsid	991044437603403414	-

File Download

Supplementary

postgraduate thesis: Optimization of DNN training in distributed environments

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats