File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPDS.2025.3583983
- Scopus: eid_2-s2.0-105009426329
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models
| Title | PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models |
|---|---|
| Authors | |
| Keywords | 3D parallelism Deep Learning distributed training DNN GPU Machine Learning pipeline parallelism |
| Issue Date | 1-Jan-2025 |
| Publisher | Institute of Electrical and Electronics Engineers |
| Citation | IEEE Transactions on Parallel and Distributed Systems, 2025, v. 36, n. 9, p. 1872-1889 How to Cite? |
| Abstract | Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, PIPEMESH improves training throughput on commodity clouds by 20.1% to 33.8%. |
| Persistent Identifier | http://hdl.handle.net/10722/361925 |
| ISSN | 2023 Impact Factor: 5.6 2023 SCImago Journal Rankings: 2.340 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Li, Fanxin | - |
| dc.contributor.author | Zhao, Shixiong | - |
| dc.contributor.author | Qing, Yuhao | - |
| dc.contributor.author | Jiang, Jianyu | - |
| dc.contributor.author | Chen, Xusheng | - |
| dc.contributor.author | Cui, Heming | - |
| dc.date.accessioned | 2025-09-17T00:32:05Z | - |
| dc.date.available | 2025-09-17T00:32:05Z | - |
| dc.date.issued | 2025-01-01 | - |
| dc.identifier.citation | IEEE Transactions on Parallel and Distributed Systems, 2025, v. 36, n. 9, p. 1872-1889 | - |
| dc.identifier.issn | 1045-9219 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/361925 | - |
| dc.description.abstract | <p>Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, PIPEMESH improves training throughput on commodity clouds by 20.1% to 33.8%.</p> | - |
| dc.language | eng | - |
| dc.publisher | Institute of Electrical and Electronics Engineers | - |
| dc.relation.ispartof | IEEE Transactions on Parallel and Distributed Systems | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | 3D parallelism | - |
| dc.subject | Deep Learning | - |
| dc.subject | distributed training | - |
| dc.subject | DNN | - |
| dc.subject | GPU | - |
| dc.subject | Machine Learning | - |
| dc.subject | pipeline parallelism | - |
| dc.title | PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models | - |
| dc.type | Article | - |
| dc.description.nature | published_or_final_version | - |
| dc.identifier.doi | 10.1109/TPDS.2025.3583983 | - |
| dc.identifier.scopus | eid_2-s2.0-105009426329 | - |
| dc.identifier.volume | 36 | - |
| dc.identifier.issue | 9 | - |
| dc.identifier.spage | 1872 | - |
| dc.identifier.epage | 1889 | - |
| dc.identifier.eissn | 1558-2183 | - |
| dc.identifier.issnl | 1045-9219 | - |
