File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles
Title | Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles |
---|---|
Authors | |
Advisors | Advisor(s):Wang, CL |
Issue Date | 2022 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Ji, Z. [紀卓然]. (2022). Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Machine learning has shown great success in many applications, such as molecules, social networks, and recommendation systems. It often takes massive computation to train a machine learning model, especially with the ever-increasing volume of data. GPU platforms have been widely used in supercomputers and data centers because of their better absolute performance and performance per watt, yielding many GPU-accelerated machine learning frameworks (e.g., CUML). However, for many machine learning algorithms, porting them to GPUs is non-trivial. Optimizing these algorithms will be an increasingly popular topic.
Harvesting idle GPU computing capacities in data centers is another way to reduce training costs. Nowadays, data centers also host user-facing GPU jobs, such as deep neural network inference. GPUs are usually over-provisioned to satisfy their strict latency requirements, causing GPU under-utilization most of the time. The long-run training jobs can be launched when GPUs are idle. When user-facing jobs come, they should release GPUs instantly to avoid impacting the user experience. It requires efficient low latency GPU preemption, which is challenging due to the large on-chip resources of GPUs.
This thesis advances machine learning computation from both algorithm level and system level. At the algorithm level, we address two challenges of implementing efficient machine learning algorithms on GPUs, exemplified by two concrete algorithms. We optimize k-nearest neighbor graph construction with on-the-fly top-k selection. The distance matrix calculation algorithm and top-k selection algorithm are co-designed to minimize communication and avoid shared memory
contention. It demonstrates the necessity of co-designing the algorithms when fusing the multiple steps of a machine learning model into a single kernel. Second, many machine learning models have unpredictable control flow and memory references (i.e., irregular kernels), such as the aggregation kernel of graph neural networks. Irregular kernels are inefficient on GPUs, as there are limited optimization strategies for them. We propose on-GPU interpreter-style programming, which allows irregular kernels to enjoy the optimizations designed for regular kernels and thus significantly improve the throughput.
At the system level, we enable efficient low latency GPU preemption with three novel techniques. The first technique is context flashback, which allows executing context switching at a preceding instruction (with a smaller context) of the instruction where preemption occurs. The feasible flashback points are identified with use-define chains analysis, and three complementary methods are proposed to make more feasible flashback points. Secondly, our solution includes checkpoint- based GPU preemption to enable zero latency preemption. To reduce its runtime overhead, we design a compiler-directed GPU incremental checkpointing technique tailed for preemption usage. Finally, we combine different GPU preemption techniques to leverage their various trade-offs. Our method is implemented with GPU spatial multitasking rather than dynamic selection, which is more compatible with sophisticated GPU kernels. Preemption hierarchy and SM-prefetching are proposed to achieve robust low latency and high throughput. Evaluations show that our machine learning algorithms have much higher throughputs than state-of-the-art implementations, and our low latency GPU preemption solution allows machine learning training to efficiently harvest the idle GPU computing capacities. |
Degree | Doctor of Philosophy |
Subject | Graphics processing units Computer algorithms Machine learning |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/323673 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Wang, CL | - |
dc.contributor.author | Ji, Zhuoran | - |
dc.contributor.author | 紀卓然 | - |
dc.date.accessioned | 2023-01-09T01:48:20Z | - |
dc.date.available | 2023-01-09T01:48:20Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Ji, Z. [紀卓然]. (2022). Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/323673 | - |
dc.description.abstract | Machine learning has shown great success in many applications, such as molecules, social networks, and recommendation systems. It often takes massive computation to train a machine learning model, especially with the ever-increasing volume of data. GPU platforms have been widely used in supercomputers and data centers because of their better absolute performance and performance per watt, yielding many GPU-accelerated machine learning frameworks (e.g., CUML). However, for many machine learning algorithms, porting them to GPUs is non-trivial. Optimizing these algorithms will be an increasingly popular topic. Harvesting idle GPU computing capacities in data centers is another way to reduce training costs. Nowadays, data centers also host user-facing GPU jobs, such as deep neural network inference. GPUs are usually over-provisioned to satisfy their strict latency requirements, causing GPU under-utilization most of the time. The long-run training jobs can be launched when GPUs are idle. When user-facing jobs come, they should release GPUs instantly to avoid impacting the user experience. It requires efficient low latency GPU preemption, which is challenging due to the large on-chip resources of GPUs. This thesis advances machine learning computation from both algorithm level and system level. At the algorithm level, we address two challenges of implementing efficient machine learning algorithms on GPUs, exemplified by two concrete algorithms. We optimize k-nearest neighbor graph construction with on-the-fly top-k selection. The distance matrix calculation algorithm and top-k selection algorithm are co-designed to minimize communication and avoid shared memory contention. It demonstrates the necessity of co-designing the algorithms when fusing the multiple steps of a machine learning model into a single kernel. Second, many machine learning models have unpredictable control flow and memory references (i.e., irregular kernels), such as the aggregation kernel of graph neural networks. Irregular kernels are inefficient on GPUs, as there are limited optimization strategies for them. We propose on-GPU interpreter-style programming, which allows irregular kernels to enjoy the optimizations designed for regular kernels and thus significantly improve the throughput. At the system level, we enable efficient low latency GPU preemption with three novel techniques. The first technique is context flashback, which allows executing context switching at a preceding instruction (with a smaller context) of the instruction where preemption occurs. The feasible flashback points are identified with use-define chains analysis, and three complementary methods are proposed to make more feasible flashback points. Secondly, our solution includes checkpoint- based GPU preemption to enable zero latency preemption. To reduce its runtime overhead, we design a compiler-directed GPU incremental checkpointing technique tailed for preemption usage. Finally, we combine different GPU preemption techniques to leverage their various trade-offs. Our method is implemented with GPU spatial multitasking rather than dynamic selection, which is more compatible with sophisticated GPU kernels. Preemption hierarchy and SM-prefetching are proposed to achieve robust low latency and high throughput. Evaluations show that our machine learning algorithms have much higher throughputs than state-of-the-art implementations, and our low latency GPU preemption solution allows machine learning training to efficiently harvest the idle GPU computing capacities. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Graphics processing units | - |
dc.subject.lcsh | Computer algorithms | - |
dc.subject.lcsh | Machine learning | - |
dc.title | Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2023 | - |
dc.identifier.mmsid | 991044625590503414 | - |