Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing

Wu, Hao; 吴昊

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_991044214993603414

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing

Title	Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing
Authors	Wu, Hao 吴昊
Advisors	Advisor(s):Wang, CL
Issue Date	2019
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wu, H. [吴昊]. (2019). Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	As a critical computing resource in multi-user systems such as supercomputers, data centres and cloud services, a GPU contains multiple compute units (CUs). GPU Multitasking is an intuitive solution to underutilization in GPGPU computing. Recently proposed solutions of multitasking GPUs can be classified into two categories: (1) spatially-partitioned sharing (SPS), which co-executes different kernels on disjointed sets of compute units (CU), and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU. Compared to SPS, SMK can improve resource utilization even further due to the interleaving of instructions from kernels with low dynamic resource contentions. However, it is hard to implement SMK on current GPU architecture, because: (1) techniques for applying SMK on top of GPU hardware scheduling policy are scarce; (2) finding an efficient SMK scheme is difficult due to the complex interferences of concurrently executed kernels. In this paper, we propose a lightweight and effective performance model to evaluate the complex interferences of SMK. Based on the probability of independent events, our performance model is built from a totally new angle and contains limited parameters. Then, we propose a metric, symbiotic factor, which can evaluate an SMK scheme so that kernels with complementary resource utilization can co-run within a CU. Also, we analyse the advantages and disadvantages of kernel slicing and kernel stretching techniques and integrate them to apply SMK on GPUs instead of simulators. We validate our model on 18 benchmarks. Compared to the optimized hardware-based concurrent kernel execution whose kernel launching order brings fast execution time, the results of co-running kernel pairs show 11%, 18% and 12% speedup on AMD R9 290X, RX 480 and Vega 64, on average. Compared to the Warped-Slicer, the results show 29%, 18% and 51% speedup on AMD R9 290X, RX 480 and Vega 64, on average.
Degree	Doctor of Philosophy
Subject	Graphics processing units - Programming Computer graphics
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/281608

DC Field	Value	Language
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Wu, Hao	-
dc.contributor.author	吴昊	-
dc.date.accessioned	2020-03-18T11:33:04Z	-
dc.date.available	2020-03-18T11:33:04Z	-
dc.date.issued	2019	-
dc.identifier.citation	Wu, H. [吴昊]. (2019). Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/281608	-
dc.description.abstract	As a critical computing resource in multi-user systems such as supercomputers, data centres and cloud services, a GPU contains multiple compute units (CUs). GPU Multitasking is an intuitive solution to underutilization in GPGPU computing. Recently proposed solutions of multitasking GPUs can be classified into two categories: (1) spatially-partitioned sharing (SPS), which co-executes different kernels on disjointed sets of compute units (CU), and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU. Compared to SPS, SMK can improve resource utilization even further due to the interleaving of instructions from kernels with low dynamic resource contentions. However, it is hard to implement SMK on current GPU architecture, because: (1) techniques for applying SMK on top of GPU hardware scheduling policy are scarce; (2) finding an efficient SMK scheme is difficult due to the complex interferences of concurrently executed kernels. In this paper, we propose a lightweight and effective performance model to evaluate the complex interferences of SMK. Based on the probability of independent events, our performance model is built from a totally new angle and contains limited parameters. Then, we propose a metric, symbiotic factor, which can evaluate an SMK scheme so that kernels with complementary resource utilization can co-run within a CU. Also, we analyse the advantages and disadvantages of kernel slicing and kernel stretching techniques and integrate them to apply SMK on GPUs instead of simulators. We validate our model on 18 benchmarks. Compared to the optimized hardware-based concurrent kernel execution whose kernel launching order brings fast execution time, the results of co-running kernel pairs show 11%, 18% and 12% speedup on AMD R9 290X, RX 480 and Vega 64, on average. Compared to the Warped-Slicer, the results show 29%, 18% and 51% speedup on AMD R9 290X, RX 480 and Vega 64, on average.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Graphics processing units - Programming	-
dc.subject.lcsh	Computer graphics	-
dc.title	Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_991044214993603414	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044214993603414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Fine-grained concurrent kernel execution on SM/CU level for GPGPU computing

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats