File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1145/3419111.3421307
- Scopus: eid_2-s2.0-85095426145
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: Elastic parameter server load distribution in deep learning clusters
Title | Elastic parameter server load distribution in deep learning clusters |
---|---|
Authors | |
Issue Date | 2020 |
Publisher | Association for Computing Machinery (ACM). |
Citation | Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC '20), Virtual Confernce, USA, 19-21 October 2020, p. 507-521 How to Cite? |
Abstract | In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. |
Persistent Identifier | http://hdl.handle.net/10722/301294 |
ISBN |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Chen, Y | - |
dc.contributor.author | Peng, Y | - |
dc.contributor.author | Bao, Y | - |
dc.contributor.author | Wu, C | - |
dc.contributor.author | Zhu, Y | - |
dc.contributor.author | Guo, C | - |
dc.date.accessioned | 2021-07-27T08:08:59Z | - |
dc.date.available | 2021-07-27T08:08:59Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC '20), Virtual Confernce, USA, 19-21 October 2020, p. 507-521 | - |
dc.identifier.isbn | 9781450381376 | - |
dc.identifier.uri | http://hdl.handle.net/10722/301294 | - |
dc.description.abstract | In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings. | - |
dc.language | eng | - |
dc.publisher | Association for Computing Machinery (ACM). | - |
dc.relation.ispartof | Proceedings of the 11th ACM Symposium on Cloud Computing | - |
dc.title | Elastic parameter server load distribution in deep learning clusters | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Wu, C: cwu@cs.hku.hk | - |
dc.identifier.authority | Wu, C=rp01397 | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1145/3419111.3421307 | - |
dc.identifier.scopus | eid_2-s2.0-85095426145 | - |
dc.identifier.hkuros | 323513 | - |
dc.identifier.spage | 507 | - |
dc.identifier.epage | 521 | - |
dc.publisher.place | New York, NY | - |