File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPDS.2023.3247883
- Scopus: eid_2-s2.0-85149364640
- WOS: WOS:000958537300002
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models
Title | Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models |
---|---|
Authors | |
Keywords | 3D parallelism deep learning distributed training DNN GPU machine learning pipeline parallelism |
Issue Date | 15-May-2023 |
Publisher | Institute of Electrical and Electronics Engineers |
Citation | IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 5, p. 1432-1449 How to Cite? |
Abstract | Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments’ computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%–42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs. |
Persistent Identifier | http://hdl.handle.net/10722/331903 |
ISSN | 2023 Impact Factor: 5.6 2023 SCImago Journal Rankings: 2.340 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Li, Fanxin | - |
dc.contributor.author | Zhao, Shixiong | - |
dc.contributor.author | Qing, Yuhao | - |
dc.contributor.author | Chen, Xusheng | - |
dc.contributor.author | Guan, Xiuxian | - |
dc.contributor.author | Wang, Sen | - |
dc.contributor.author | Zhang, Gong | - |
dc.contributor.author | Cui, Heming | - |
dc.date.accessioned | 2023-09-28T04:59:29Z | - |
dc.date.available | 2023-09-28T04:59:29Z | - |
dc.date.issued | 2023-05-15 | - |
dc.identifier.citation | IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 5, p. 1432-1449 | - |
dc.identifier.issn | 1045-9219 | - |
dc.identifier.uri | http://hdl.handle.net/10722/331903 | - |
dc.description.abstract | <p>Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments’ computational tasks. We realized this idea in our Fold3D training system. Extensive evaluation shows Fold3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%–42.1% training throughput improvement compared to four notable baselines over various settings; Fold3D 's high performance scaled to many GPUs.<br></p> | - |
dc.language | eng | - |
dc.publisher | Institute of Electrical and Electronics Engineers | - |
dc.relation.ispartof | IEEE Transactions on Parallel and Distributed Systems | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject | 3D parallelism | - |
dc.subject | deep learning | - |
dc.subject | distributed training | - |
dc.subject | DNN | - |
dc.subject | GPU | - |
dc.subject | machine learning | - |
dc.subject | pipeline parallelism | - |
dc.title | Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models | - |
dc.type | Article | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.doi | 10.1109/TPDS.2023.3247883 | - |
dc.identifier.scopus | eid_2-s2.0-85149364640 | - |
dc.identifier.volume | 34 | - |
dc.identifier.issue | 5 | - |
dc.identifier.spage | 1432 | - |
dc.identifier.epage | 1449 | - |
dc.identifier.eissn | 1558-2183 | - |
dc.identifier.isi | WOS:000958537300002 | - |
dc.identifier.issnl | 1045-9219 | - |