Characterizing Deep Learning Training Workloads on Alibaba-PAI

Wang, M; Meng, C; Long, G; Wu, C; Yang, J; Lin, W; Jia, Y

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/IISWC47752.2019.9042047
Scopus: eid_2-s2.0-85083107261
WOS: WOS:000651379900017

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Characterizing Deep Learning Training Workloads on Alibaba-PAI

Title	Characterizing Deep Learning Training Workloads on Alibaba-PAI
Authors	Wang, M Meng, C Long, G Wu, C Yang, J Lin, W Jia, Y
Issue Date	2019
Publisher	IEEE. The Journal's web site is located at https://ieeexplore.ieee.org/xpl/conhome/1000819/all-proceedings
Citation	Proceedings of 2019 IEEE International Symposium on Workload Characterization (IISWC), Orlando, FL, USA, 3-5 November 2019, p. 189-202 How to Cite? DOI: http://dx.doi.org/10.1109/IISWC47752.2019.9042047
Abstract	Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.
Persistent Identifier	http://hdl.handle.net/10722/301419
ISBN	9781728140469
ISI Accession Number ID	WOS:000651379900017

DC Field	Value	Language
dc.contributor.author	Wang, M	-
dc.contributor.author	Meng, C	-
dc.contributor.author	Long, G	-
dc.contributor.author	Wu, C	-
dc.contributor.author	Yang, J	-
dc.contributor.author	Lin, W	-
dc.contributor.author	Jia, Y	-
dc.date.accessioned	2021-07-27T08:10:47Z	-
dc.date.available	2021-07-27T08:10:47Z	-
dc.date.issued	2019	-
dc.identifier.citation	Proceedings of 2019 IEEE International Symposium on Workload Characterization (IISWC), Orlando, FL, USA, 3-5 November 2019, p. 189-202	-
dc.identifier.isbn	9781728140469	-
dc.identifier.uri	http://hdl.handle.net/10722/301419	-
dc.description.abstract	Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.	-
dc.language	eng	-
dc.publisher	IEEE. The Journal's web site is located at https://ieeexplore.ieee.org/xpl/conhome/1000819/all-proceedings	-
dc.relation.ispartof	IEEE International Symposium on Workload Characterization (IISWC)	-
dc.rights	IEEE International Symposium on Workload Characterization (IISWC). Copyright © IEEE.	-
dc.rights	©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.title	Characterizing Deep Learning Training Workloads on Alibaba-PAI	-
dc.type	Conference_Paper	-
dc.identifier.email	Wu, C: cwu@cs.hku.hk	-
dc.identifier.authority	Wu, C=rp01397	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/IISWC47752.2019.9042047	-
dc.identifier.scopus	eid_2-s2.0-85083107261	-
dc.identifier.hkuros	323518	-
dc.identifier.spage	189	-
dc.identifier.epage	202	-
dc.identifier.isi	WOS:000651379900017	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Characterizing Deep Learning Training Workloads on Alibaba-PAI

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats