PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models

Li, Fanxin; Zhao, Shixiong; Qing, Yuhao; Jiang, Jianyu; Chen, Xusheng; Cui, Heming

File Download

content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPDS.2025.3583983
Scopus: eid_2-s2.0-105009426329
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models

Title	PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models
Authors	Li, Fanxin Zhao, Shixiong Qing, Yuhao Jiang, Jianyu Chen, Xusheng Cui, Heming
Keywords	3D parallelism Deep Learning distributed training DNN GPU Machine Learning pipeline parallelism
Issue Date	1-Jan-2025
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Parallel and Distributed Systems, 2025, v. 36, n. 9, p. 1872-1889 How to Cite? DOI: http://dx.doi.org/10.1109/TPDS.2025.3583983
Abstract	Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, PIPEMESH improves training throughput on commodity clouds by 20.1% to 33.8%.
Persistent Identifier	http://hdl.handle.net/10722/361925
ISSN	1045-9219 2023 Impact Factor: 5.6 2023 SCImago Journal Rankings: 2.340

DC Field	Value	Language
dc.contributor.author	Li, Fanxin	-
dc.contributor.author	Zhao, Shixiong	-
dc.contributor.author	Qing, Yuhao	-
dc.contributor.author	Jiang, Jianyu	-
dc.contributor.author	Chen, Xusheng	-
dc.contributor.author	Cui, Heming	-
dc.date.accessioned	2025-09-17T00:32:05Z	-
dc.date.available	2025-09-17T00:32:05Z	-
dc.date.issued	2025-01-01	-
dc.identifier.citation	IEEE Transactions on Parallel and Distributed Systems, 2025, v. 36, n. 9, p. 1872-1889	-
dc.identifier.issn	1045-9219	-
dc.identifier.uri	http://hdl.handle.net/10722/361925	-
dc.description.abstract	<p>Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, PIPEMESH improves training throughput on commodity clouds by 20.1% to 33.8%.</p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Parallel and Distributed Systems	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	3D parallelism	-
dc.subject	Deep Learning	-
dc.subject	distributed training	-
dc.subject	DNN	-
dc.subject	GPU	-
dc.subject	Machine Learning	-
dc.subject	pipeline parallelism	-
dc.title	PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1109/TPDS.2025.3583983	-
dc.identifier.scopus	eid_2-s2.0-105009426329	-
dc.identifier.volume	36	-
dc.identifier.issue	9	-
dc.identifier.spage	1872	-
dc.identifier.epage	1889	-
dc.identifier.eissn	1558-2183	-
dc.identifier.issnl	1045-9219	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: PIPEMESH: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats