File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism
Title | Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism |
---|---|
Authors | |
Keywords | Distributed system Neural networks Pipeline parallelism |
Issue Date | 2022 |
Publisher | Association for Computing Machinery. |
Citation | The 13th ACM Symposium on Cloud Computing (SOCC’22), San Francisco, CA, United States, November 8-10, 2022. In SoCC '22: Proceedings of the 13th Symposium on Cloud Computing, p. 403-418 How to Cite? |
Abstract | Deep neural networks (DNNs) with trillions of parameters have emerged, e.g., Mixture-of-Experts (MoE) models. Training models of this scale requires sophisticated parallelization strategies like the newly proposed SPMD parallelism, that shards each tensor along different dimensions. A common problem using SPMD is that computation stalls during communication due to data dependencies, resulting in low
GPU utilization and long training time. We present a general technique to accelerate SPMD-based DNN training by maximizing computation-communication overlap and automatic SPMD strategy search. The key idea is to duplicate the DNN model into two copies that have no dependency,
and interleave their execution such that computation of one copy overlaps with communication of the other. We propose a dynamic programming algorithm to automatically identify optimized sharding strategies that minimize model training time by maximally enabling computation-communication overlap. Experiments show that our designs achieve up to 61% training speed-up as compared to existing frameworks |
Persistent Identifier | http://hdl.handle.net/10722/320624 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zhang, S | - |
dc.contributor.author | Diao, L | - |
dc.contributor.author | Wu, C | - |
dc.contributor.author | Wang, S | - |
dc.contributor.author | Lin, W | - |
dc.date.accessioned | 2022-10-21T07:56:50Z | - |
dc.date.available | 2022-10-21T07:56:50Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | The 13th ACM Symposium on Cloud Computing (SOCC’22), San Francisco, CA, United States, November 8-10, 2022. In SoCC '22: Proceedings of the 13th Symposium on Cloud Computing, p. 403-418 | - |
dc.identifier.uri | http://hdl.handle.net/10722/320624 | - |
dc.description.abstract | Deep neural networks (DNNs) with trillions of parameters have emerged, e.g., Mixture-of-Experts (MoE) models. Training models of this scale requires sophisticated parallelization strategies like the newly proposed SPMD parallelism, that shards each tensor along different dimensions. A common problem using SPMD is that computation stalls during communication due to data dependencies, resulting in low GPU utilization and long training time. We present a general technique to accelerate SPMD-based DNN training by maximizing computation-communication overlap and automatic SPMD strategy search. The key idea is to duplicate the DNN model into two copies that have no dependency, and interleave their execution such that computation of one copy overlaps with communication of the other. We propose a dynamic programming algorithm to automatically identify optimized sharding strategies that minimize model training time by maximally enabling computation-communication overlap. Experiments show that our designs achieve up to 61% training speed-up as compared to existing frameworks | - |
dc.language | eng | - |
dc.publisher | Association for Computing Machinery. | - |
dc.relation.ispartof | SoCC '22: Proceedings of the 13th Symposium on Cloud Computing | - |
dc.rights | SoCC '22: Proceedings of the 13th Symposium on Cloud Computing. Copyright © Association for Computing Machinery. | - |
dc.subject | Distributed system | - |
dc.subject | Neural networks | - |
dc.subject | Pipeline parallelism | - |
dc.title | Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Wu, C: cwu@cs.hku.hk | - |
dc.identifier.authority | Wu, C=rp01397 | - |
dc.identifier.doi | 10.1145/3542929.3563487 | - |
dc.identifier.hkuros | 340525 | - |
dc.identifier.spage | 403 | - |
dc.identifier.epage | 418 | - |
dc.publisher.place | United States | - |