File Download
Supplementary

postgraduate thesis: Performance diagnosis and optimization for deep neural network training and inference

TitlePerformance diagnosis and optimization for deep neural network training and inference
Authors
Advisors
Advisor(s):Wu, C
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Hu, H. [胡汉鵬]. (2024). Performance diagnosis and optimization for deep neural network training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractIn recent years, Deep Neural Networks (DNNs) have achieved remarkable success across a wide range of machine learning (ML) applications, such as image classification and natural language processing. The growing popularity of Large Language Models (LLMs) has led to a surge in data volumes and DNN model size. As a consequence, DNN training and inference are becoming increasingly resource-intensive and time-consuming. However, it’s complex to diagnose, model, and optimize the performance of DNN training and inference jobs due to the diversity of DNN models, devices, and optimization techniques. Especially, distributed DNN training with Parameter Server (PS) architecture, which has been widely applied to accelerate training by exploiting multiple devices, is faced with the heterogeneity problem of device compute capacity and network bandwidth, posing significant challenges to parameter synchronization with the PS architecture. Therefore, we introduce three system designs, dPRO, CDMPP, and ADSP, to automatically diagnose, model, and optimize the performance of DNN training and inference jobs. dPRO is a toolkit to automatically diagnose performance issues and expedite dis- tributed DNN training. dPRO includes (1) an efficient profiler that collects runtime traces, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an op- timizer that effectively identifies performance bottlenecks and explores optimization strategies for training acceleration. We implement dPRO with multiple deep learning frameworks (TensorFlow, MXNet) and communication schemes (AllReduce and PS). Extensive experiments show that dPRO predicts the performance of distributed training in various settings with < 5% errors in most cases and finds optimization strategies with up to 3.48⇥ speed-up over the baselines. To accurately model the performance of different DNN models on various devices, we propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We also develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to adapt to unseen devices. Our experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. Additionally, we design ADSP, a novel parameter synchronization model for distributed machine learning (ML) under the PS architecture with heterogeneous devices. To eliminate the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is that faster devices continue training while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability, and adaptability to large heterogeneity.
DegreeDoctor of Philosophy
SubjectDeep learning (Machine learning)
Neural networks (Computer science)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/341570

 

DC FieldValueLanguage
dc.contributor.advisorWu, C-
dc.contributor.authorHu, Hanpeng-
dc.contributor.author胡汉鵬-
dc.date.accessioned2024-03-18T09:56:02Z-
dc.date.available2024-03-18T09:56:02Z-
dc.date.issued2024-
dc.identifier.citationHu, H. [胡汉鵬]. (2024). Performance diagnosis and optimization for deep neural network training and inference. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/341570-
dc.description.abstractIn recent years, Deep Neural Networks (DNNs) have achieved remarkable success across a wide range of machine learning (ML) applications, such as image classification and natural language processing. The growing popularity of Large Language Models (LLMs) has led to a surge in data volumes and DNN model size. As a consequence, DNN training and inference are becoming increasingly resource-intensive and time-consuming. However, it’s complex to diagnose, model, and optimize the performance of DNN training and inference jobs due to the diversity of DNN models, devices, and optimization techniques. Especially, distributed DNN training with Parameter Server (PS) architecture, which has been widely applied to accelerate training by exploiting multiple devices, is faced with the heterogeneity problem of device compute capacity and network bandwidth, posing significant challenges to parameter synchronization with the PS architecture. Therefore, we introduce three system designs, dPRO, CDMPP, and ADSP, to automatically diagnose, model, and optimize the performance of DNN training and inference jobs. dPRO is a toolkit to automatically diagnose performance issues and expedite dis- tributed DNN training. dPRO includes (1) an efficient profiler that collects runtime traces, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an op- timizer that effectively identifies performance bottlenecks and explores optimization strategies for training acceleration. We implement dPRO with multiple deep learning frameworks (TensorFlow, MXNet) and communication schemes (AllReduce and PS). Extensive experiments show that dPRO predicts the performance of distributed training in various settings with < 5% errors in most cases and finds optimization strategies with up to 3.48⇥ speed-up over the baselines. To accurately model the performance of different DNN models on various devices, we propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We also develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to adapt to unseen devices. Our experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. Additionally, we design ADSP, a novel parameter synchronization model for distributed machine learning (ML) under the PS architecture with heterogeneous devices. To eliminate the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is that faster devices continue training while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability, and adaptability to large heterogeneity.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshDeep learning (Machine learning)-
dc.subject.lcshNeural networks (Computer science)-
dc.titlePerformance diagnosis and optimization for deep neural network training and inference-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044781601903414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats