Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles

Ji, Zhuoran; 紀卓然

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles

Title	Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles
Authors	Ji, Zhuoran 紀卓然
Advisors	Advisor(s):Wang, CL
Issue Date	2022
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Ji, Z. [紀卓然]. (2022). Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Machine learning has shown great success in many applications, such as molecules, social networks, and recommendation systems. It often takes massive computation to train a machine learning model, especially with the ever-increasing volume of data. GPU platforms have been widely used in supercomputers and data centers because of their better absolute performance and performance per watt, yielding many GPU-accelerated machine learning frameworks (e.g., CUML). However, for many machine learning algorithms, porting them to GPUs is non-trivial. Optimizing these algorithms will be an increasingly popular topic. Harvesting idle GPU computing capacities in data centers is another way to reduce training costs. Nowadays, data centers also host user-facing GPU jobs, such as deep neural network inference. GPUs are usually over-provisioned to satisfy their strict latency requirements, causing GPU under-utilization most of the time. The long-run training jobs can be launched when GPUs are idle. When user-facing jobs come, they should release GPUs instantly to avoid impacting the user experience. It requires efficient low latency GPU preemption, which is challenging due to the large on-chip resources of GPUs. This thesis advances machine learning computation from both algorithm level and system level. At the algorithm level, we address two challenges of implementing efficient machine learning algorithms on GPUs, exemplified by two concrete algorithms. We optimize k-nearest neighbor graph construction with on-the-fly top-k selection. The distance matrix calculation algorithm and top-k selection algorithm are co-designed to minimize communication and avoid shared memory contention. It demonstrates the necessity of co-designing the algorithms when fusing the multiple steps of a machine learning model into a single kernel. Second, many machine learning models have unpredictable control flow and memory references (i.e., irregular kernels), such as the aggregation kernel of graph neural networks. Irregular kernels are inefficient on GPUs, as there are limited optimization strategies for them. We propose on-GPU interpreter-style programming, which allows irregular kernels to enjoy the optimizations designed for regular kernels and thus significantly improve the throughput. At the system level, we enable efficient low latency GPU preemption with three novel techniques. The first technique is context flashback, which allows executing context switching at a preceding instruction (with a smaller context) of the instruction where preemption occurs. The feasible flashback points are identified with use-define chains analysis, and three complementary methods are proposed to make more feasible flashback points. Secondly, our solution includes checkpoint- based GPU preemption to enable zero latency preemption. To reduce its runtime overhead, we design a compiler-directed GPU incremental checkpointing technique tailed for preemption usage. Finally, we combine different GPU preemption techniques to leverage their various trade-offs. Our method is implemented with GPU spatial multitasking rather than dynamic selection, which is more compatible with sophisticated GPU kernels. Preemption hierarchy and SM-prefetching are proposed to achieve robust low latency and high throughput. Evaluations show that our machine learning algorithms have much higher throughputs than state-of-the-art implementations, and our low latency GPU preemption solution allows machine learning training to efficiently harvest the idle GPU computing capacities.
Degree	Doctor of Philosophy
Subject	Graphics processing units Computer algorithms Machine learning
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/323673

DC Field	Value	Language
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Ji, Zhuoran	-
dc.contributor.author	紀卓然	-
dc.date.accessioned	2023-01-09T01:48:20Z	-
dc.date.available	2023-01-09T01:48:20Z	-
dc.date.issued	2022	-
dc.identifier.citation	Ji, Z. [紀卓然]. (2022). Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/323673	-
dc.description.abstract	Machine learning has shown great success in many applications, such as molecules, social networks, and recommendation systems. It often takes massive computation to train a machine learning model, especially with the ever-increasing volume of data. GPU platforms have been widely used in supercomputers and data centers because of their better absolute performance and performance per watt, yielding many GPU-accelerated machine learning frameworks (e.g., CUML). However, for many machine learning algorithms, porting them to GPUs is non-trivial. Optimizing these algorithms will be an increasingly popular topic. Harvesting idle GPU computing capacities in data centers is another way to reduce training costs. Nowadays, data centers also host user-facing GPU jobs, such as deep neural network inference. GPUs are usually over-provisioned to satisfy their strict latency requirements, causing GPU under-utilization most of the time. The long-run training jobs can be launched when GPUs are idle. When user-facing jobs come, they should release GPUs instantly to avoid impacting the user experience. It requires efficient low latency GPU preemption, which is challenging due to the large on-chip resources of GPUs. This thesis advances machine learning computation from both algorithm level and system level. At the algorithm level, we address two challenges of implementing efficient machine learning algorithms on GPUs, exemplified by two concrete algorithms. We optimize k-nearest neighbor graph construction with on-the-fly top-k selection. The distance matrix calculation algorithm and top-k selection algorithm are co-designed to minimize communication and avoid shared memory contention. It demonstrates the necessity of co-designing the algorithms when fusing the multiple steps of a machine learning model into a single kernel. Second, many machine learning models have unpredictable control flow and memory references (i.e., irregular kernels), such as the aggregation kernel of graph neural networks. Irregular kernels are inefficient on GPUs, as there are limited optimization strategies for them. We propose on-GPU interpreter-style programming, which allows irregular kernels to enjoy the optimizations designed for regular kernels and thus significantly improve the throughput. At the system level, we enable efficient low latency GPU preemption with three novel techniques. The first technique is context flashback, which allows executing context switching at a preceding instruction (with a smaller context) of the instruction where preemption occurs. The feasible flashback points are identified with use-define chains analysis, and three complementary methods are proposed to make more feasible flashback points. Secondly, our solution includes checkpoint- based GPU preemption to enable zero latency preemption. To reduce its runtime overhead, we design a compiler-directed GPU incremental checkpointing technique tailed for preemption usage. Finally, we combine different GPU preemption techniques to leverage their various trade-offs. Our method is implemented with GPU spatial multitasking rather than dynamic selection, which is more compatible with sophisticated GPU kernels. Preemption hierarchy and SM-prefetching are proposed to achieve robust low latency and high throughput. Evaluations show that our machine learning algorithms have much higher throughputs than state-of-the-art implementations, and our low latency GPU preemption solution allows machine learning training to efficiently harvest the idle GPU computing capacities.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Graphics processing units	-
dc.subject.lcsh	Computer algorithms	-
dc.subject.lcsh	Machine learning	-
dc.title	Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2023	-
dc.identifier.mmsid	991044625590503414	-

File Download

Supplementary

postgraduate thesis: Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats