Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment

Zhang, Shiwei; Yi, Xiaodong; Diao, Lansong; Wu, Chuan; Wang, Siyu; Lin, Wei

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPDS.2023.3243261
Scopus: eid_2-s2.0-85148426116
WOS: WOS:000942294200003
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment

Title	Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment
Authors	Zhang, Shiwei Yi, Xiaodong Diao, Lansong Wu, Chuan Wang, Siyu Lin, Wei
Keywords	Distributed systems machine learning
Issue Date	15-Apr-2023
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 4, p. 1281-1293 How to Cite? DOI: http://dx.doi.org/10.1109/TPDS.2023.3243261
Abstract	This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.
Persistent Identifier	http://hdl.handle.net/10722/331804
ISSN	1045-9219 2023 Impact Factor: 5.6 2023 SCImago Journal Rankings: 2.340
ISI Accession Number ID	WOS:000942294200003

DC Field	Value	Language
dc.contributor.author	Zhang, Shiwei	-
dc.contributor.author	Yi, Xiaodong	-
dc.contributor.author	Diao, Lansong	-
dc.contributor.author	Wu, Chuan	-
dc.contributor.author	Wang, Siyu	-
dc.contributor.author	Lin, Wei	-
dc.date.accessioned	2023-09-21T06:59:04Z	-
dc.date.available	2023-09-21T06:59:04Z	-
dc.date.issued	2023-04-15	-
dc.identifier.citation	IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 4, p. 1281-1293	-
dc.identifier.issn	1045-9219	-
dc.identifier.uri	http://hdl.handle.net/10722/331804	-
dc.description.abstract	<p>This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.<br></p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Parallel and Distributed Systems	-
dc.subject	Distributed systems	-
dc.subject	machine learning	-
dc.title	Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPDS.2023.3243261	-
dc.identifier.scopus	eid_2-s2.0-85148426116	-
dc.identifier.volume	34	-
dc.identifier.issue	4	-
dc.identifier.spage	1281	-
dc.identifier.epage	1293	-
dc.identifier.eissn	1558-2183	-
dc.identifier.isi	WOS:000942294200003	-
dc.identifier.issnl	1045-9219	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats