File Download
Supplementary

postgraduate thesis: Optimization of DNN training in distributed environments

TitleOptimization of DNN training in distributed environments
Authors
Advisors
Advisor(s):Wu, C
Issue Date2021
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yi, X. [易晓东]. (2021). Optimization of DNN training in distributed environments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractNowadays, increasingly large DNN models lead to the demand of large-scale distributed machine learning systems, which have been deployed in many leading IT firms to support various analytics and intelligence services. Distributed machine learning jobs have become a common workload in data centers. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. It is important to efficiently train the model by carefully designing the device placement, types of parallelism communication methods and scheduling strategies. To tackle this problem systematically, we propose three system work including FastT, HeteroG, and Disco. In FastT, we propose a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over multiple GPUs, for expedited model training. We propose white-box algorithms to compute the strategies with small computing resource consumption in a short time. FastT decides execution order and achieves fine-grained parallelism by splitting some operations on the critical path to further improve the processing speed. It achieves up to 59.4% speedup compared with pure data parallelism with the larger solution space. Then we study the acceleration of distributed training in a heterogeneous environment. In HeteroG, we propose an automatic module to accelerate deep neural network training in heterogeneous GPU clusters. HeteroG enables highly-efficient distributed training over heterogeneous devices, by automatically converting a single-GPU training model to a distributed one according to the deep learning graph and available resources. It embraces operation-level hybrid parallelism, communication architecture selection and execution scheduling, based on a carefully designed strategy framework exploiting both GNN-based learning and combinatorial optimization. We compare HeteroG with existing parallelism schemes and show that it achieves up-to 222% training speed-up. HeteroG also enables efficient training of large models over a set of heterogeneous devices where simple parallelism is infeasible. Finally, we investigate the efficient distributed training with Deep Learning compilers. In Disco, we propose a module to generate optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A graph neural network-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare Disco with existing DL fusion schemes and show that it achieves good training speed-up close to the fully overlapping case.
DegreeDoctor of Philosophy
SubjectNeural networks (Computer science)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/306963

 

DC FieldValueLanguage
dc.contributor.advisorWu, C-
dc.contributor.authorYi, Xiaodong-
dc.contributor.author易晓东-
dc.date.accessioned2021-11-03T04:36:36Z-
dc.date.available2021-11-03T04:36:36Z-
dc.date.issued2021-
dc.identifier.citationYi, X. [易晓东]. (2021). Optimization of DNN training in distributed environments. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/306963-
dc.description.abstractNowadays, increasingly large DNN models lead to the demand of large-scale distributed machine learning systems, which have been deployed in many leading IT firms to support various analytics and intelligence services. Distributed machine learning jobs have become a common workload in data centers. However, training DNN jobs is computation-intensive and time-consuming, mainly because of the increasing DNN model size and training data size. It is important to efficiently train the model by carefully designing the device placement, types of parallelism communication methods and scheduling strategies. To tackle this problem systematically, we propose three system work including FastT, HeteroG, and Disco. In FastT, we propose a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over multiple GPUs, for expedited model training. We propose white-box algorithms to compute the strategies with small computing resource consumption in a short time. FastT decides execution order and achieves fine-grained parallelism by splitting some operations on the critical path to further improve the processing speed. It achieves up to 59.4% speedup compared with pure data parallelism with the larger solution space. Then we study the acceleration of distributed training in a heterogeneous environment. In HeteroG, we propose an automatic module to accelerate deep neural network training in heterogeneous GPU clusters. HeteroG enables highly-efficient distributed training over heterogeneous devices, by automatically converting a single-GPU training model to a distributed one according to the deep learning graph and available resources. It embraces operation-level hybrid parallelism, communication architecture selection and execution scheduling, based on a carefully designed strategy framework exploiting both GNN-based learning and combinatorial optimization. We compare HeteroG with existing parallelism schemes and show that it achieves up-to 222% training speed-up. HeteroG also enables efficient training of large models over a set of heterogeneous devices where simple parallelism is infeasible. Finally, we investigate the efficient distributed training with Deep Learning compilers. In Disco, we propose a module to generate optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A graph neural network-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare Disco with existing DL fusion schemes and show that it achieves good training speed-up close to the fully overlapping case.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNeural networks (Computer science)-
dc.titleOptimization of DNN training in distributed environments-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2021-
dc.identifier.mmsid991044437603403414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats