Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Li, Fanxin; Zhao, Shixiong; Qing, Yuhao; Chen, Xusheng; Guan, Xiuxian; Wang, Sen; Zhang, Gong; Cui, Heming

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPDS.2023.3247883
Scopus: eid_2-s2.0-85149364640
WOS: WOS:000958537300002
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Title	Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models
Authors	Li, Fanxin Zhao, Shixiong Qing, Yuhao Chen, Xusheng Guan, Xiuxian Wang, Sen Zhang, Gong Cui, Heming
Keywords	3D parallelism deep learning distributed training DNN GPU machine learning pipeline parallelism
Issue Date	15-May-2023
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 5, p. 1432-1449 How to Cite? DOI: http://dx.doi.org/10.1109/TPDS.2023.3247883
Abstract	Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments’ computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%–42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.
Persistent Identifier	http://hdl.handle.net/10722/331903
ISSN	1045-9219 2023 Impact Factor: 5.6 2023 SCImago Journal Rankings: 2.340
ISI Accession Number ID	WOS:000958537300002

DC Field	Value	Language
dc.contributor.author	Li, Fanxin	-
dc.contributor.author	Zhao, Shixiong	-
dc.contributor.author	Qing, Yuhao	-
dc.contributor.author	Chen, Xusheng	-
dc.contributor.author	Guan, Xiuxian	-
dc.contributor.author	Wang, Sen	-
dc.contributor.author	Zhang, Gong	-
dc.contributor.author	Cui, Heming	-
dc.date.accessioned	2023-09-28T04:59:29Z	-
dc.date.available	2023-09-28T04:59:29Z	-
dc.date.issued	2023-05-15	-
dc.identifier.citation	IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 5, p. 1432-1449	-
dc.identifier.issn	1045-9219	-
dc.identifier.uri	http://hdl.handle.net/10722/331903	-
dc.description.abstract	<p>Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments’ computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%–42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.<br></p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Parallel and Distributed Systems	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	3D parallelism	-
dc.subject	deep learning	-
dc.subject	distributed training	-
dc.subject	DNN	-
dc.subject	GPU	-
dc.subject	machine learning	-
dc.subject	pipeline parallelism	-
dc.title	Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1109/TPDS.2023.3247883	-
dc.identifier.scopus	eid_2-s2.0-85149364640	-
dc.identifier.volume	34	-
dc.identifier.issue	5	-
dc.identifier.spage	1432	-
dc.identifier.epage	1449	-
dc.identifier.eissn	1558-2183	-
dc.identifier.isi	WOS:000958537300002	-
dc.identifier.issnl	1045-9219	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats