Elastic parameter server load distribution in deep learning clusters

Chen, Y; Peng, Y; Bao, Y; Wu, C; Zhu, Y; Guo, C

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3419111.3421307
Scopus: eid_2-s2.0-85095426145

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Elastic parameter server load distribution in deep learning clusters

Title	Elastic parameter server load distribution in deep learning clusters
Authors	Chen, Y Peng, Y Bao, Y Wu, C Zhu, Y Guo, C
Issue Date	2020
Publisher	Association for Computing Machinery (ACM).
Citation	Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC '20), Virtual Confernce, USA, 19-21 October 2020, p. 507-521 How to Cite? DOI: http://dx.doi.org/10.1145/3419111.3421307
Abstract	In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.
Persistent Identifier	http://hdl.handle.net/10722/301294
ISBN	9781450381376

DC Field	Value	Language
dc.contributor.author	Chen, Y	-
dc.contributor.author	Peng, Y	-
dc.contributor.author	Bao, Y	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Zhu, Y	-
dc.contributor.author	Guo, C	-
dc.date.accessioned	2021-07-27T08:08:59Z	-
dc.date.available	2021-07-27T08:08:59Z	-
dc.date.issued	2020	-
dc.identifier.citation	Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC '20), Virtual Confernce, USA, 19-21 October 2020, p. 507-521	-
dc.identifier.isbn	9781450381376	-
dc.identifier.uri	http://hdl.handle.net/10722/301294	-
dc.description.abstract	In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery (ACM).	-
dc.relation.ispartof	Proceedings of the 11th ACM Symposium on Cloud Computing	-
dc.title	Elastic parameter server load distribution in deep learning clusters	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1145/3419111.3421307	-
dc.identifier.scopus	eid_2-s2.0-85095426145	-
dc.identifier.hkuros	323513	-
dc.identifier.spage	507	-
dc.identifier.epage	521	-
dc.publisher.place	New York, NY	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Elastic parameter server load distribution in deep learning clusters

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats