Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Bao, Y; Peng, Y; Chen, Y; Wu, C

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/INFOCOM41043.2020.9155446
Scopus: eid_2-s2.0-85090287372
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Title	Preemptive All-reduce Scheduling for Expediting Distributed DNN Training
Authors	Bao, Y Peng, Y Chen, Y Wu, C
Issue Date	2020
Publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359
Citation	Proceedings of IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, Toronto, ON, Canada, 6-9 July 2020, p. 626-635 How to Cite? DOI: http://dx.doi.org/10.1109/INFOCOM41043.2020.9155446
Abstract	Data-parallel training is widely used for scaling DNN training over large datasets, using the parameter server or all-reduce architecture. Communication scheduling has been promising to accelerate distributed DNN training, which aims to overlap communication with computation by scheduling the order of communication operations. We identify two limitations of previous communication scheduling work. First, layer-wise computation graph has been a common assumption, while modern machine learning frameworks (e.g., TensorFlow) use a sophisticated directed acyclic graph (DAG) representation as the execution model. Second, the default sizes of tensors are often less than optimal for transmission scheduling and bandwidth utilization. We propose PACE, a communication scheduler that preemptively schedules (potentially fused) all-reduce tensors based on the DAG of DNN training, guaranteeing maximal overlapping of communication with computation and high bandwidth utilization. The scheduler contains two integrated modules: given a DAG, we identify the best tensor-preemptive communication schedule that minimizes the training time; exploiting the optimal communication scheduling as an oracle, a dynamic programming approach is developed for generating a good DAG, which merges small communication tensors for efficient bandwidth utilization. Experiments in a GPU testbed show that PACE accelerates training with representative system configurations, achieving up to 36% speed-up compared with state-of-the-art solutions.
Persistent Identifier	http://hdl.handle.net/10722/301418
ISSN	0743-166X 2020 SCImago Journal Rankings: 1.183

DC Field	Value	Language
dc.contributor.author	Bao, Y	-
dc.contributor.author	Peng, Y	-
dc.contributor.author	Chen, Y	-
dc.contributor.author	Wu, C	-
dc.date.accessioned	2021-07-27T08:10:46Z	-
dc.date.available	2021-07-27T08:10:46Z	-
dc.date.issued	2020	-
dc.identifier.citation	Proceedings of IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, Toronto, ON, Canada, 6-9 July 2020, p. 626-635	-
dc.identifier.issn	0743-166X	-
dc.identifier.uri	http://hdl.handle.net/10722/301418	-
dc.description.abstract	Data-parallel training is widely used for scaling DNN training over large datasets, using the parameter server or all-reduce architecture. Communication scheduling has been promising to accelerate distributed DNN training, which aims to overlap communication with computation by scheduling the order of communication operations. We identify two limitations of previous communication scheduling work. First, layer-wise computation graph has been a common assumption, while modern machine learning frameworks (e.g., TensorFlow) use a sophisticated directed acyclic graph (DAG) representation as the execution model. Second, the default sizes of tensors are often less than optimal for transmission scheduling and bandwidth utilization. We propose PACE, a communication scheduler that preemptively schedules (potentially fused) all-reduce tensors based on the DAG of DNN training, guaranteeing maximal overlapping of communication with computation and high bandwidth utilization. The scheduler contains two integrated modules: given a DAG, we identify the best tensor-preemptive communication schedule that minimizes the training time; exploiting the optimal communication scheduling as an oracle, a dynamic programming approach is developed for generating a good DAG, which merges small communication tensors for efficient bandwidth utilization. Experiments in a GPU testbed show that PACE accelerates training with representative system configurations, achieving up to 36% speed-up compared with state-of-the-art solutions.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000359	-
dc.relation.ispartof	IEEE INFOCOM - IEEE Conference on Computer Communications	-
dc.rights	IEEE INFOCOM - IEEE Conference on Computer Communications. Copyright © IEEE Computer Society.	-
dc.rights	©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.title	Preemptive All-reduce Scheduling for Expediting Distributed DNN Training	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/INFOCOM41043.2020.9155446	-
dc.identifier.scopus	eid_2-s2.0-85090287372	-
dc.identifier.hkuros	323516	-
dc.identifier.spage	626	-
dc.identifier.epage	635	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats