File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Article: PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models

TitlePIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models
Authors
Keywords3D parallelism
Deep Learning
distributed training
DNN
GPU
Machine Learning
pipeline parallelism
Issue Date1-Jan-2025
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Parallel and Distributed Systems, 2025, v. 36, n. 9, p. 1872-1889 How to Cite?
Abstract

Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, PIPEMESH improves training throughput on commodity clouds by 20.1% to 33.8%.


Persistent Identifierhttp://hdl.handle.net/10722/361925
ISSN
2023 Impact Factor: 5.6
2023 SCImago Journal Rankings: 2.340

 

DC FieldValueLanguage
dc.contributor.authorLi, Fanxin-
dc.contributor.authorZhao, Shixiong-
dc.contributor.authorQing, Yuhao-
dc.contributor.authorJiang, Jianyu-
dc.contributor.authorChen, Xusheng-
dc.contributor.authorCui, Heming-
dc.date.accessioned2025-09-17T00:32:05Z-
dc.date.available2025-09-17T00:32:05Z-
dc.date.issued2025-01-01-
dc.identifier.citationIEEE Transactions on Parallel and Distributed Systems, 2025, v. 36, n. 9, p. 1872-1889-
dc.identifier.issn1045-9219-
dc.identifier.urihttp://hdl.handle.net/10722/361925-
dc.description.abstract<p>Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, PIPEMESH improves training throughput on commodity clouds by 20.1% to 33.8%.</p>-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Parallel and Distributed Systems-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject3D parallelism-
dc.subjectDeep Learning-
dc.subjectdistributed training-
dc.subjectDNN-
dc.subjectGPU-
dc.subjectMachine Learning-
dc.subjectpipeline parallelism-
dc.titlePIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models-
dc.typeArticle-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.1109/TPDS.2025.3583983-
dc.identifier.scopuseid_2-s2.0-105009426329-
dc.identifier.volume36-
dc.identifier.issue9-
dc.identifier.spage1872-
dc.identifier.epage1889-
dc.identifier.eissn1558-2183-
dc.identifier.issnl1045-9219-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats