File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
Title | HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees |
---|---|
Authors | |
Issue Date | 2020 |
Publisher | The USENIX Association. |
Citation | Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20), Banff, Alberta, Canada, 4-6 November 2020, p. 515-532 How to Cite? |
Abstract | Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee.
HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely.
HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization.
With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety. |
Persistent Identifier | http://hdl.handle.net/10722/293458 |
ISBN |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zhao, H | - |
dc.contributor.author | HAN, Z | - |
dc.contributor.author | Yang, Z | - |
dc.contributor.author | Zhang, Q | - |
dc.contributor.author | Yang, F | - |
dc.contributor.author | Zhou, L | - |
dc.contributor.author | Yang, M | - |
dc.contributor.author | Lau, FCM | - |
dc.contributor.author | Wang, Y | - |
dc.contributor.author | Xiong, Y | - |
dc.contributor.author | Wang, B | - |
dc.date.accessioned | 2020-11-23T08:17:03Z | - |
dc.date.available | 2020-11-23T08:17:03Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20), Banff, Alberta, Canada, 4-6 November 2020, p. 515-532 | - |
dc.identifier.isbn | 9781939133199 | - |
dc.identifier.uri | http://hdl.handle.net/10722/293458 | - |
dc.description.abstract | Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety. | - |
dc.language | eng | - |
dc.publisher | The USENIX Association. | - |
dc.relation.ispartof | The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20) | - |
dc.title | HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Lau, FCM: fcmlau@cs.hku.hk | - |
dc.identifier.authority | Lau, FCM=rp00221 | - |
dc.identifier.hkuros | 319180 | - |
dc.identifier.spage | 515 | - |
dc.identifier.epage | 532 | - |