File Download
Supplementary

postgraduate thesis: Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles

TitleEfficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles
Authors
Advisors
Advisor(s):Wang, CL
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Ji, Z. [紀卓然]. (2022). Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractMachine learning has shown great success in many applications, such as molecules, social networks, and recommendation systems. It often takes massive computation to train a machine learning model, especially with the ever-increasing volume of data. GPU platforms have been widely used in supercomputers and data centers because of their better absolute performance and performance per watt, yielding many GPU-accelerated machine learning frameworks (e.g., CUML). However, for many machine learning algorithms, porting them to GPUs is non-trivial. Optimizing these algorithms will be an increasingly popular topic. Harvesting idle GPU computing capacities in data centers is another way to reduce training costs. Nowadays, data centers also host user-facing GPU jobs, such as deep neural network inference. GPUs are usually over-provisioned to satisfy their strict latency requirements, causing GPU under-utilization most of the time. The long-run training jobs can be launched when GPUs are idle. When user-facing jobs come, they should release GPUs instantly to avoid impacting the user experience. It requires efficient low latency GPU preemption, which is challenging due to the large on-chip resources of GPUs. This thesis advances machine learning computation from both algorithm level and system level. At the algorithm level, we address two challenges of implementing efficient machine learning algorithms on GPUs, exemplified by two concrete algorithms. We optimize k-nearest neighbor graph construction with on-the-fly top-k selection. The distance matrix calculation algorithm and top-k selection algorithm are co-designed to minimize communication and avoid shared memory contention. It demonstrates the necessity of co-designing the algorithms when fusing the multiple steps of a machine learning model into a single kernel. Second, many machine learning models have unpredictable control flow and memory references (i.e., irregular kernels), such as the aggregation kernel of graph neural networks. Irregular kernels are inefficient on GPUs, as there are limited optimization strategies for them. We propose on-GPU interpreter-style programming, which allows irregular kernels to enjoy the optimizations designed for regular kernels and thus significantly improve the throughput. At the system level, we enable efficient low latency GPU preemption with three novel techniques. The first technique is context flashback, which allows executing context switching at a preceding instruction (with a smaller context) of the instruction where preemption occurs. The feasible flashback points are identified with use-define chains analysis, and three complementary methods are proposed to make more feasible flashback points. Secondly, our solution includes checkpoint- based GPU preemption to enable zero latency preemption. To reduce its runtime overhead, we design a compiler-directed GPU incremental checkpointing technique tailed for preemption usage. Finally, we combine different GPU preemption techniques to leverage their various trade-offs. Our method is implemented with GPU spatial multitasking rather than dynamic selection, which is more compatible with sophisticated GPU kernels. Preemption hierarchy and SM-prefetching are proposed to achieve robust low latency and high throughput. Evaluations show that our machine learning algorithms have much higher throughputs than state-of-the-art implementations, and our low latency GPU preemption solution allows machine learning training to efficiently harvest the idle GPU computing capacities.
DegreeDoctor of Philosophy
SubjectGraphics processing units
Computer algorithms
Machine learning
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/323673

 

DC FieldValueLanguage
dc.contributor.advisorWang, CL-
dc.contributor.authorJi, Zhuoran-
dc.contributor.author紀卓然-
dc.date.accessioned2023-01-09T01:48:20Z-
dc.date.available2023-01-09T01:48:20Z-
dc.date.issued2022-
dc.identifier.citationJi, Z. [紀卓然]. (2022). Efficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/323673-
dc.description.abstractMachine learning has shown great success in many applications, such as molecules, social networks, and recommendation systems. It often takes massive computation to train a machine learning model, especially with the ever-increasing volume of data. GPU platforms have been widely used in supercomputers and data centers because of their better absolute performance and performance per watt, yielding many GPU-accelerated machine learning frameworks (e.g., CUML). However, for many machine learning algorithms, porting them to GPUs is non-trivial. Optimizing these algorithms will be an increasingly popular topic. Harvesting idle GPU computing capacities in data centers is another way to reduce training costs. Nowadays, data centers also host user-facing GPU jobs, such as deep neural network inference. GPUs are usually over-provisioned to satisfy their strict latency requirements, causing GPU under-utilization most of the time. The long-run training jobs can be launched when GPUs are idle. When user-facing jobs come, they should release GPUs instantly to avoid impacting the user experience. It requires efficient low latency GPU preemption, which is challenging due to the large on-chip resources of GPUs. This thesis advances machine learning computation from both algorithm level and system level. At the algorithm level, we address two challenges of implementing efficient machine learning algorithms on GPUs, exemplified by two concrete algorithms. We optimize k-nearest neighbor graph construction with on-the-fly top-k selection. The distance matrix calculation algorithm and top-k selection algorithm are co-designed to minimize communication and avoid shared memory contention. It demonstrates the necessity of co-designing the algorithms when fusing the multiple steps of a machine learning model into a single kernel. Second, many machine learning models have unpredictable control flow and memory references (i.e., irregular kernels), such as the aggregation kernel of graph neural networks. Irregular kernels are inefficient on GPUs, as there are limited optimization strategies for them. We propose on-GPU interpreter-style programming, which allows irregular kernels to enjoy the optimizations designed for regular kernels and thus significantly improve the throughput. At the system level, we enable efficient low latency GPU preemption with three novel techniques. The first technique is context flashback, which allows executing context switching at a preceding instruction (with a smaller context) of the instruction where preemption occurs. The feasible flashback points are identified with use-define chains analysis, and three complementary methods are proposed to make more feasible flashback points. Secondly, our solution includes checkpoint- based GPU preemption to enable zero latency preemption. To reduce its runtime overhead, we design a compiler-directed GPU incremental checkpointing technique tailed for preemption usage. Finally, we combine different GPU preemption techniques to leverage their various trade-offs. Our method is implemented with GPU spatial multitasking rather than dynamic selection, which is more compatible with sophisticated GPU kernels. Preemption hierarchy and SM-prefetching are proposed to achieve robust low latency and high throughput. Evaluations show that our machine learning algorithms have much higher throughputs than state-of-the-art implementations, and our low latency GPU preemption solution allows machine learning training to efficiently harvest the idle GPU computing capacities.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshGraphics processing units-
dc.subject.lcshComputer algorithms-
dc.subject.lcshMachine learning-
dc.titleEfficient machine learning on GPUs : optimizing algorithms and harvesting idle cycles-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2023-
dc.identifier.mmsid991044625590503414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats