SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training

Chen, Y; Xie, C; Ma, M; Gu, J; Peng, Y; Lin, H; Wu, C; Zhu, Y

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training

Title	SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training
Authors	Chen, Y Xie, C Ma, M Gu, J Peng, Y Lin, H Wu, C Zhu, Y
Issue Date	2022
Publisher	Curran Associates, Inc.
Citation	Thirty-Sixth Conference on Neural Information Processing Systems, new Orleans, Louisiana, United States, November 28, 2022 How to Cite?
Abstract	Data parallelism across multiple machines is widely adopted for accelerating distributed deep learning, but it is hard to achieve linear speedup due to the heavy communication. In this paper, we propose SAPipe, a performant system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. We have implemented SAPipe in the BytePS framework, compatible to both TensorFlow and PyTorch. Our experiments show that SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%.
Persistent Identifier	http://hdl.handle.net/10722/320327

DC Field	Value	Language
dc.contributor.author	Chen, Y	-
dc.contributor.author	Xie, C	-
dc.contributor.author	Ma, M	-
dc.contributor.author	Gu, J	-
dc.contributor.author	Peng, Y	-
dc.contributor.author	Lin, H	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Zhu, Y	-
dc.date.accessioned	2022-10-21T07:51:14Z	-
dc.date.available	2022-10-21T07:51:14Z	-
dc.date.issued	2022	-
dc.identifier.citation	Thirty-Sixth Conference on Neural Information Processing Systems, new Orleans, Louisiana, United States, November 28, 2022	-
dc.identifier.uri	http://hdl.handle.net/10722/320327	-
dc.description.abstract	Data parallelism across multiple machines is widely adopted for accelerating distributed deep learning, but it is hard to achieve linear speedup due to the heavy communication. In this paper, we propose SAPipe, a performant system that pushes the training speed of data parallelism to its fullest extent. By introducing partial staleness, the communication overlaps the computation with minimal staleness in SAPipe. To mitigate additional problems incurred by staleness, SAPipe adopts staleness compensation techniques including weight prediction and delay compensation with provably lower error bounds. Additionally, SAPipe presents an algorithm-system co-design with runtime optimization to minimize system overhead for the staleness training pipeline and staleness compensation. We have implemented SAPipe in the BytePS framework, compatible to both TensorFlow and PyTorch. Our experiments show that SAPipe achieves up to 157% speedups over BytePS (non-stale), and outperforms PipeSGD in accuracy by up to 13.7%.	-
dc.language	eng	-
dc.publisher	Curran Associates, Inc.	-
dc.title	SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.identifier.hkuros	340524	-
dc.publisher.place	United States	-

File Download

Supplementary

Conference Paper: SAPipe: Staleness-Aware Pipeline for Data Parallel DNN Training

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats