DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Fan, S; Rong, Y; Meng, C; Cao, Z; Wang, S; Zheng, Z; Wu, C; Long, G; Yang, J; Xia, L; Diao, L; Liu, X; Lin, W

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3437801.3441593
Scopus: eid_2-s2.0-85101713868

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Title	DAPPLE: A Pipelined Data Parallel Approach for Training Large Models
Authors	Fan, S Rong, Y Meng, C Cao, Z Wang, S Zheng, Z Wu, C Long, G Yang, J Xia, L Diao, L Liu, X Lin, W
Keywords	deep learning data parallelism pipeline parallelism hybrid parallelism
Issue Date	2021
Publisher	Association for Computing Machinery (ACM)
Citation	Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '21), Virtual Conference, Republic of Korea, 27 February 2021, p. 431-445 How to Cite? DOI: http://dx.doi.org/10.1145/3437801.3441593
Abstract	It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23× speedup under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6× speedup of training throughput and saves 12% of memory consumption at the same time.
Persistent Identifier	http://hdl.handle.net/10722/301415
ISBN	9781450382946

DC Field	Value	Language
dc.contributor.author	Fan, S	-
dc.contributor.author	Rong, Y	-
dc.contributor.author	Meng, C	-
dc.contributor.author	Cao, Z	-
dc.contributor.author	Wang, S	-
dc.contributor.author	Zheng, Z	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Long, G	-
dc.contributor.author	Yang, J	-
dc.contributor.author	Xia, L	-
dc.contributor.author	Diao, L	-
dc.contributor.author	Liu, X	-
dc.contributor.author	Lin, W	-
dc.date.accessioned	2021-07-27T08:10:44Z	-
dc.date.available	2021-07-27T08:10:44Z	-
dc.date.issued	2021	-
dc.identifier.citation	Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '21), Virtual Conference, Republic of Korea, 27 February 2021, p. 431-445	-
dc.identifier.isbn	9781450382946	-
dc.identifier.uri	http://hdl.handle.net/10722/301415	-
dc.description.abstract	It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23× speedup under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6× speedup of training throughput and saves 12% of memory consumption at the same time.	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery (ACM)	-
dc.relation.ispartof	Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming	-
dc.subject	deep learning	-
dc.subject	data parallelism	-
dc.subject	pipeline parallelism	-
dc.subject	hybrid parallelism	-
dc.title	DAPPLE: A Pipelined Data Parallel Approach for Training Large Models	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1145/3437801.3441593	-
dc.identifier.scopus	eid_2-s2.0-85101713868	-
dc.identifier.hkuros	323510	-
dc.identifier.spage	431	-
dc.identifier.epage	445	-
dc.publisher.place	New York, NY	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats