HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

Zhao, H; HAN, Z; Yang, Z; Zhang, Q; Yang, F; Zhou, L; Yang, M; Lau, FCM; Wang, Y; Xiong, Y; Wang, B

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

Title	HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
Authors	Zhao, H HAN, Z Yang, Z Zhang, Q Yang, F Zhou, L Yang, M Lau, FCM Wang, Y Xiong, Y Wang, B
Issue Date	2020
Publisher	The USENIX Association.
Citation	Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20), Banff, Alberta, Canada, 4-6 November 2020, p. 515-532 How to Cite?
Abstract	Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.
Persistent Identifier	http://hdl.handle.net/10722/293458
ISBN	9781939133199

DC Field	Value	Language
dc.contributor.author	Zhao, H	-
dc.contributor.author	HAN, Z	-
dc.contributor.author	Yang, Z	-
dc.contributor.author	Zhang, Q	-
dc.contributor.author	Yang, F	-
dc.contributor.author	Zhou, L	-
dc.contributor.author	Yang, M	-
dc.contributor.author	Lau, FCM	-
dc.contributor.author	Wang, Y	-
dc.contributor.author	Xiong, Y	-
dc.contributor.author	Wang, B	-
dc.date.accessioned	2020-11-23T08:17:03Z	-
dc.date.available	2020-11-23T08:17:03Z	-
dc.date.issued	2020	-
dc.identifier.citation	Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20), Banff, Alberta, Canada, 4-6 November 2020, p. 515-532	-
dc.identifier.isbn	9781939133199	-
dc.identifier.uri	http://hdl.handle.net/10722/293458	-
dc.description.abstract	Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.	-
dc.language	eng	-
dc.publisher	The USENIX Association.	-
dc.relation.ispartof	The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20)	-
dc.title	HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees	-
dc.type	Conference_Paper	-
dc.identifier.email	Lau, FCM: fcmlau@cs.hku.hk	-
dc.identifier.authority	Lau, FCM=rp00221	-
dc.identifier.hkuros	319180	-
dc.identifier.spage	515	-
dc.identifier.epage	532	-

File Download

Supplementary

Conference Paper: HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats