File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Elastic parameter server load distribution in deep learning clusters

TitleElastic parameter server load distribution in deep learning clusters
Authors
Issue Date2020
PublisherAssociation for Computing Machinery (ACM).
Citation
Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC '20), Virtual Confernce, USA, 19-21 October 2020, p. 507-521 How to Cite?
AbstractIn distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.
Persistent Identifierhttp://hdl.handle.net/10722/301294
ISBN

 

DC FieldValueLanguage
dc.contributor.authorChen, Y-
dc.contributor.authorPeng, Y-
dc.contributor.authorBao, Y-
dc.contributor.authorWu, C-
dc.contributor.authorZhu, Y-
dc.contributor.authorGuo, C-
dc.date.accessioned2021-07-27T08:08:59Z-
dc.date.available2021-07-27T08:08:59Z-
dc.date.issued2020-
dc.identifier.citationProceedings of the 11th ACM Symposium on Cloud Computing (SoCC '20), Virtual Confernce, USA, 19-21 October 2020, p. 507-521-
dc.identifier.isbn9781450381376-
dc.identifier.urihttp://hdl.handle.net/10722/301294-
dc.description.abstractIn distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.-
dc.languageeng-
dc.publisherAssociation for Computing Machinery (ACM).-
dc.relation.ispartofProceedings of the 11th ACM Symposium on Cloud Computing-
dc.titleElastic parameter server load distribution in deep learning clusters-
dc.typeConference_Paper-
dc.identifier.emailWu, C: cwu@cs.hku.hk-
dc.identifier.authorityWu, C=rp01397-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1145/3419111.3421307-
dc.identifier.scopuseid_2-s2.0-85095426145-
dc.identifier.hkuros323513-
dc.identifier.spage507-
dc.identifier.epage521-
dc.publisher.placeNew York, NY-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats