File Download
Supplementary

postgraduate thesis: Towards fast distributed training with generic and efficient communication scheduling

TitleTowards fast distributed training with generic and efficient communication scheduling
Authors
Advisors
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Jin, A. [金阿龙]. (2023). Towards fast distributed training with generic and efficient communication scheduling. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractWith massive computing resources and sizeable high-quality datasets, machine learning (ML) has attained state-of-the-art performance for many practical applications. To speed up the ML training, data parallelism across multiple workers is widely adopted, where each worker has a full copy of the model parameters, trains on a subset of the input dataset, and synchronizes computed gradients with others. In distributed training, popular ML frameworks employ either the Parameter Server or All-Reduce architecture to facilitate the gradient exchange. Nevertheless, due to frequent gradient synchronization, the communication cost often stands out, which may take up to 90% of the training time. Communication scheduling has thus been adopted to accelerate the training procedure. In this thesis, to minimize the communication cost and training time, we focus on designing practical and efficient schedulers for both Parameter Server and All-Reduce. In Parameter Server, the global parameters are stored in logically centralized servers, via which workers can exchange gradients synchronously or asynchronously. To reduce the communication cost for model synchronization, prior work overlaps communication with computation for deep neural networks. Yet, these methods require not only layer-by-layer model structures but also tailor-made efforts in runtime profiling and hyperparameter tuning. To make the overlapping optimization more efficient, we propose a new Parameter Server framework. Particularly, our solution decouples the dependencies between push and pull operations, and allows workers to eagerly pull the global parameters. As a result, both push and pull operations can be easily paralleled with computations. Moreover, the overlapping strategy offers a new way to address the straggler problem, where stale updates greatly retard the training procedure. This is because, with adequate information available to workers, they can explicitly adjust their local learning rates to limit the adverse impact of stale updates on the global parameters. Experiments show that our prototype could save up to 54% less time for each iteration and up to 37% fewer iterations for model convergence, compared with those widely-used schemes. On the other hand, in All-Reduce, there are no centralized servers, and workers only communicate with their neighbors. To mitigate the synchronization cost, efficient scheduling of all-reduce operations is essential for minimizing training time. However, existing schedulers either are agnostic to ML workload characteristics, or require runtime profiling of workloads and fine-tuning of system hyperparameters, rendering them less efficient or not generic. To this end, we propose a new All-Reduce scheduler that is both efficient and generic. Specifically, we decompose the layer-to-layer dependencies between the communication of one iteration and the computation of the next, thereby allowing all-reduce operations to be paralleled with computations; and employ a simple approximation for the parameters to guarantee the training quality. Besides, we demonstrate that other techniques like tensor fusion can be easily incorporated into the proposed scheduler. Experiments show that our prototype can accelerate training on representative vision and language tasks by up to 29.2%, compared with popular All-Reduce schedulers.
DegreeDoctor of Philosophy
SubjectParallel scheduling (Computer scheduling)
Machine learning
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/342873

 

DC FieldValueLanguage
dc.contributor.advisorYeung, LK-
dc.contributor.advisorTam, WLV-
dc.contributor.authorJin, Along-
dc.contributor.author金阿龙-
dc.date.accessioned2024-05-07T01:22:02Z-
dc.date.available2024-05-07T01:22:02Z-
dc.date.issued2023-
dc.identifier.citationJin, A. [金阿龙]. (2023). Towards fast distributed training with generic and efficient communication scheduling. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/342873-
dc.description.abstractWith massive computing resources and sizeable high-quality datasets, machine learning (ML) has attained state-of-the-art performance for many practical applications. To speed up the ML training, data parallelism across multiple workers is widely adopted, where each worker has a full copy of the model parameters, trains on a subset of the input dataset, and synchronizes computed gradients with others. In distributed training, popular ML frameworks employ either the Parameter Server or All-Reduce architecture to facilitate the gradient exchange. Nevertheless, due to frequent gradient synchronization, the communication cost often stands out, which may take up to 90% of the training time. Communication scheduling has thus been adopted to accelerate the training procedure. In this thesis, to minimize the communication cost and training time, we focus on designing practical and efficient schedulers for both Parameter Server and All-Reduce. In Parameter Server, the global parameters are stored in logically centralized servers, via which workers can exchange gradients synchronously or asynchronously. To reduce the communication cost for model synchronization, prior work overlaps communication with computation for deep neural networks. Yet, these methods require not only layer-by-layer model structures but also tailor-made efforts in runtime profiling and hyperparameter tuning. To make the overlapping optimization more efficient, we propose a new Parameter Server framework. Particularly, our solution decouples the dependencies between push and pull operations, and allows workers to eagerly pull the global parameters. As a result, both push and pull operations can be easily paralleled with computations. Moreover, the overlapping strategy offers a new way to address the straggler problem, where stale updates greatly retard the training procedure. This is because, with adequate information available to workers, they can explicitly adjust their local learning rates to limit the adverse impact of stale updates on the global parameters. Experiments show that our prototype could save up to 54% less time for each iteration and up to 37% fewer iterations for model convergence, compared with those widely-used schemes. On the other hand, in All-Reduce, there are no centralized servers, and workers only communicate with their neighbors. To mitigate the synchronization cost, efficient scheduling of all-reduce operations is essential for minimizing training time. However, existing schedulers either are agnostic to ML workload characteristics, or require runtime profiling of workloads and fine-tuning of system hyperparameters, rendering them less efficient or not generic. To this end, we propose a new All-Reduce scheduler that is both efficient and generic. Specifically, we decompose the layer-to-layer dependencies between the communication of one iteration and the computation of the next, thereby allowing all-reduce operations to be paralleled with computations; and employ a simple approximation for the parameters to guarantee the training quality. Besides, we demonstrate that other techniques like tensor fusion can be easily incorporated into the proposed scheduler. Experiments show that our prototype can accelerate training on representative vision and language tasks by up to 29.2%, compared with popular All-Reduce schedulers. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshParallel scheduling (Computer scheduling)-
dc.subject.lcshMachine learning-
dc.titleTowards fast distributed training with generic and efficient communication scheduling-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2023-
dc.identifier.mmsid991044695779303414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats