File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

TitleA Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks
Authors
Issue Date2022
Citation
Advances in Neural Information Processing Systems, 2022, v. 35 How to Cite?
AbstractIn distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works, and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have O (1/Nε4) iteration complexity and O(1/ε3) communication complexity for finding an ε-stationary point in the homogeneous data setting, where N is the number of machines. This indicates that our algorithm enjoys linear speedup and reduced communication rounds. Our proof relies on novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.
Persistent Identifierhttp://hdl.handle.net/10722/329974
ISSN
2020 SCImago Journal Rankings: 1.399

 

DC FieldValueLanguage
dc.contributor.authorLiu, Mingrui-
dc.contributor.authorZhuang, Zhenxun-
dc.contributor.authorLei, Yunwen-
dc.contributor.authorLiao, Chunyang-
dc.date.accessioned2023-08-09T03:36:54Z-
dc.date.available2023-08-09T03:36:54Z-
dc.date.issued2022-
dc.identifier.citationAdvances in Neural Information Processing Systems, 2022, v. 35-
dc.identifier.issn1049-5258-
dc.identifier.urihttp://hdl.handle.net/10722/329974-
dc.description.abstractIn distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works, and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have O (1/Nε4) iteration complexity and O(1/ε3) communication complexity for finding an ε-stationary point in the homogeneous data setting, where N is the number of machines. This indicates that our algorithm enjoys linear speedup and reduced communication rounds. Our proof relies on novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.-
dc.languageeng-
dc.relation.ispartofAdvances in Neural Information Processing Systems-
dc.titleA Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks-
dc.typeConference_Paper-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.scopuseid_2-s2.0-85160843066-
dc.identifier.volume35-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats