Towards fast distributed training with generic and efficient communication scheduling

Jin, Along; 金阿龙

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Towards fast distributed training with generic and efficient communication scheduling

Title	Towards fast distributed training with generic and efficient communication scheduling
Authors	Jin, Along 金阿龙
Advisors	Advisor(s):Yeung, LK Tam, WLV
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Jin, A. [金阿龙]. (2023). Towards fast distributed training with generic and efficient communication scheduling. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	With massive computing resources and sizeable high-quality datasets, machine learning (ML) has attained state-of-the-art performance for many practical applications. To speed up the ML training, data parallelism across multiple workers is widely adopted, where each worker has a full copy of the model parameters, trains on a subset of the input dataset, and synchronizes computed gradients with others. In distributed training, popular ML frameworks employ either the Parameter Server or All-Reduce architecture to facilitate the gradient exchange. Nevertheless, due to frequent gradient synchronization, the communication cost often stands out, which may take up to 90% of the training time. Communication scheduling has thus been adopted to accelerate the training procedure. In this thesis, to minimize the communication cost and training time, we focus on designing practical and efficient schedulers for both Parameter Server and All-Reduce. In Parameter Server, the global parameters are stored in logically centralized servers, via which workers can exchange gradients synchronously or asynchronously. To reduce the communication cost for model synchronization, prior work overlaps communication with computation for deep neural networks. Yet, these methods require not only layer-by-layer model structures but also tailor-made efforts in runtime profiling and hyperparameter tuning. To make the overlapping optimization more efficient, we propose a new Parameter Server framework. Particularly, our solution decouples the dependencies between push and pull operations, and allows workers to eagerly pull the global parameters. As a result, both push and pull operations can be easily paralleled with computations. Moreover, the overlapping strategy offers a new way to address the straggler problem, where stale updates greatly retard the training procedure. This is because, with adequate information available to workers, they can explicitly adjust their local learning rates to limit the adverse impact of stale updates on the global parameters. Experiments show that our prototype could save up to 54% less time for each iteration and up to 37% fewer iterations for model convergence, compared with those widely-used schemes. On the other hand, in All-Reduce, there are no centralized servers, and workers only communicate with their neighbors. To mitigate the synchronization cost, efficient scheduling of all-reduce operations is essential for minimizing training time. However, existing schedulers either are agnostic to ML workload characteristics, or require runtime profiling of workloads and fine-tuning of system hyperparameters, rendering them less efficient or not generic. To this end, we propose a new All-Reduce scheduler that is both efficient and generic. Specifically, we decompose the layer-to-layer dependencies between the communication of one iteration and the computation of the next, thereby allowing all-reduce operations to be paralleled with computations; and employ a simple approximation for the parameters to guarantee the training quality. Besides, we demonstrate that other techniques like tensor fusion can be easily incorporated into the proposed scheduler. Experiments show that our prototype can accelerate training on representative vision and language tasks by up to 29.2%, compared with popular All-Reduce schedulers.
Degree	Doctor of Philosophy
Subject	Parallel scheduling (Computer scheduling) Machine learning
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/342873

DC Field	Value	Language
dc.contributor.advisor	Yeung, LK	-
dc.contributor.advisor	Tam, WLV	-
dc.contributor.author	Jin, Along	-
dc.contributor.author	金阿龙	-
dc.date.accessioned	2024-05-07T01:22:02Z	-
dc.date.available	2024-05-07T01:22:02Z	-
dc.date.issued	2023	-
dc.identifier.citation	Jin, A. [金阿龙]. (2023). Towards fast distributed training with generic and efficient communication scheduling. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/342873	-
dc.description.abstract	With massive computing resources and sizeable high-quality datasets, machine learning (ML) has attained state-of-the-art performance for many practical applications. To speed up the ML training, data parallelism across multiple workers is widely adopted, where each worker has a full copy of the model parameters, trains on a subset of the input dataset, and synchronizes computed gradients with others. In distributed training, popular ML frameworks employ either the Parameter Server or All-Reduce architecture to facilitate the gradient exchange. Nevertheless, due to frequent gradient synchronization, the communication cost often stands out, which may take up to 90% of the training time. Communication scheduling has thus been adopted to accelerate the training procedure. In this thesis, to minimize the communication cost and training time, we focus on designing practical and efficient schedulers for both Parameter Server and All-Reduce. In Parameter Server, the global parameters are stored in logically centralized servers, via which workers can exchange gradients synchronously or asynchronously. To reduce the communication cost for model synchronization, prior work overlaps communication with computation for deep neural networks. Yet, these methods require not only layer-by-layer model structures but also tailor-made efforts in runtime profiling and hyperparameter tuning. To make the overlapping optimization more efficient, we propose a new Parameter Server framework. Particularly, our solution decouples the dependencies between push and pull operations, and allows workers to eagerly pull the global parameters. As a result, both push and pull operations can be easily paralleled with computations. Moreover, the overlapping strategy offers a new way to address the straggler problem, where stale updates greatly retard the training procedure. This is because, with adequate information available to workers, they can explicitly adjust their local learning rates to limit the adverse impact of stale updates on the global parameters. Experiments show that our prototype could save up to 54% less time for each iteration and up to 37% fewer iterations for model convergence, compared with those widely-used schemes. On the other hand, in All-Reduce, there are no centralized servers, and workers only communicate with their neighbors. To mitigate the synchronization cost, efficient scheduling of all-reduce operations is essential for minimizing training time. However, existing schedulers either are agnostic to ML workload characteristics, or require runtime profiling of workloads and fine-tuning of system hyperparameters, rendering them less efficient or not generic. To this end, we propose a new All-Reduce scheduler that is both efficient and generic. Specifically, we decompose the layer-to-layer dependencies between the communication of one iteration and the computation of the next, thereby allowing all-reduce operations to be paralleled with computations; and employ a simple approximation for the parameters to guarantee the training quality. Besides, we demonstrate that other techniques like tensor fusion can be easily incorporated into the proposed scheduler. Experiments show that our prototype can accelerate training on representative vision and language tasks by up to 29.2%, compared with popular All-Reduce schedulers.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Parallel scheduling (Computer scheduling)	-
dc.subject.lcsh	Machine learning	-
dc.title	Towards fast distributed training with generic and efficient communication scheduling	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2023	-
dc.identifier.mmsid	991044695779303414	-

File Download

Supplementary

postgraduate thesis: Towards fast distributed training with generic and efficient communication scheduling

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats