File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment

TitleExpediting Distributed DNN Training With Device Topology-Aware Graph Deployment
Authors
KeywordsDistributed systems
machine learning
Issue Date15-Apr-2023
PublisherInstitute of Electrical and Electronics Engineers
Citation
IEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 4, p. 1281-1293 How to Cite?
Abstract

This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.


Persistent Identifierhttp://hdl.handle.net/10722/331804
ISSN
2023 Impact Factor: 5.6
2023 SCImago Journal Rankings: 2.340
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorZhang, Shiwei-
dc.contributor.authorYi, Xiaodong-
dc.contributor.authorDiao, Lansong-
dc.contributor.authorWu, Chuan-
dc.contributor.authorWang, Siyu-
dc.contributor.authorLin, Wei-
dc.date.accessioned2023-09-21T06:59:04Z-
dc.date.available2023-09-21T06:59:04Z-
dc.date.issued2023-04-15-
dc.identifier.citationIEEE Transactions on Parallel and Distributed Systems, 2023, v. 34, n. 4, p. 1281-1293-
dc.identifier.issn1045-9219-
dc.identifier.urihttp://hdl.handle.net/10722/331804-
dc.description.abstract<p>This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.<br></p>-
dc.languageeng-
dc.publisherInstitute of Electrical and Electronics Engineers-
dc.relation.ispartofIEEE Transactions on Parallel and Distributed Systems-
dc.subjectDistributed systems-
dc.subjectmachine learning-
dc.titleExpediting Distributed DNN Training With Device Topology-Aware Graph Deployment-
dc.typeArticle-
dc.identifier.doi10.1109/TPDS.2023.3243261-
dc.identifier.scopuseid_2-s2.0-85148426116-
dc.identifier.volume34-
dc.identifier.issue4-
dc.identifier.spage1281-
dc.identifier.epage1293-
dc.identifier.eissn1558-2183-
dc.identifier.isiWOS:000942294200003-
dc.identifier.issnl1045-9219-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats