Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example

Tang, Yifeng; 唐艺峰

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example

Title	Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example
Authors	Tang, Yifeng 唐艺峰
Advisors	Advisor(s):Cui, H Wang, CL
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Tang, Y. [唐艺峰]. (2024). Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	The field of Artificial Intelligence (AI) has become one of the most prominent areas of research and industry application due to its vast impact and versatility. The development of complex AI models, which often involve extensive parameters, necessitates substantial computational power. This is where specialized hardware, known as AI processors, becomes essential. This thesis focuses on the architecture and performance of Huawei Ascend processors, a representative AI processor, and introduces novel optimization strategies to enhance algorithmic efficiency. AI processors primarily rely on matrix multiplier-accumulators (MACs), which execute matrix multiplications with remarkable computational capability. Matrix multiplications serve as the foundation for various AI algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. However, complete AI applications require a broader range of operations beyond matrix multiplication. Therefore, AI processors include additional hardware units that facilitate essential operations and data transfers to enable comprehensive AI functionalities. Huawei Ascend processors incorporate four key types of hardware units: Matrix MACs for matrix multiplications, IO units for intra- and inter-core data transfers, vector units for performing vectorized calculations, and scalar units for address calculations and branch conditions. Each unit has specific strengths and constraints that influence the overall performance and optimization of AI algorithms on Ascend processors. To gain in-depth insights into the structure and functionality of Ascend processors, we developed specialized micro-benchmarks to examine their hardware characteristics, including IO contention, bandwidth sharing, and runtime behavior. The empirical data collected enabled us to construct a performance model, Verrocchio, which accurately predicts the execution time of real-world Ascend kernels. Verrocchio's predictions achieve an average error rate of 2.62% for single-core and 2.30% for double-core executions. It is noteworthy that non-MAC units, compared to Matrix MACs, often exhibit limited performance, potentially bottlenecking overall application efficiency. In response, we introduce two primary optimization strategies alongside Verrocchio to enhance algorithmic implementation on Ascend processors: (1) replacing suboptimal scalar or vectorized operations with more efficient alternatives, and (2) mapping certain operations to matrix multiplications where feasible. For the first optimization, we exemplify with the k-nearest neighbors (k-NN) algorithm, proposing SelB-k-NN (Selection-Bitonic-k-NN), which mitigates the need for suboptimal operations in large-scale datasets. SelB-k-NN delivers a 2.01x speedup over bitonic k-selection, 23.93x over the heap method, and 78.52$x over the CPU-based approach. For the second optimization, we propose Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Performance evaluations show that Cube-fx surpasses the standard Taylor expansion implementation by 2.73x, CORDIC by 6.06x, and Horner's Method by 1.64x. While this second strategy achieves significant performance gains by fully utilizing Matrix MACs, it is limited to operations that can be reformulated as matrix multiplications. Therefore, the first strategy remains essential for optimizing a wider range of computations on AI processors. Together, these strategies offer a comprehensive approach to maximize efficiency on this architecture.
Degree	Doctor of Philosophy
Subject	High performance processors Artificial intelligence
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/360654

DC Field	Value	Language
dc.contributor.advisor	Cui, H	-
dc.contributor.advisor	Wang, CL	-
dc.contributor.author	Tang, Yifeng	-
dc.contributor.author	唐艺峰	-
dc.date.accessioned	2025-09-12T02:02:27Z	-
dc.date.available	2025-09-12T02:02:27Z	-
dc.date.issued	2024	-
dc.identifier.citation	Tang, Y. [唐艺峰]. (2024). Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/360654	-
dc.description.abstract	The field of Artificial Intelligence (AI) has become one of the most prominent areas of research and industry application due to its vast impact and versatility. The development of complex AI models, which often involve extensive parameters, necessitates substantial computational power. This is where specialized hardware, known as AI processors, becomes essential. This thesis focuses on the architecture and performance of Huawei Ascend processors, a representative AI processor, and introduces novel optimization strategies to enhance algorithmic efficiency. AI processors primarily rely on matrix multiplier-accumulators (MACs), which execute matrix multiplications with remarkable computational capability. Matrix multiplications serve as the foundation for various AI algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. However, complete AI applications require a broader range of operations beyond matrix multiplication. Therefore, AI processors include additional hardware units that facilitate essential operations and data transfers to enable comprehensive AI functionalities. Huawei Ascend processors incorporate four key types of hardware units: Matrix MACs for matrix multiplications, IO units for intra- and inter-core data transfers, vector units for performing vectorized calculations, and scalar units for address calculations and branch conditions. Each unit has specific strengths and constraints that influence the overall performance and optimization of AI algorithms on Ascend processors. To gain in-depth insights into the structure and functionality of Ascend processors, we developed specialized micro-benchmarks to examine their hardware characteristics, including IO contention, bandwidth sharing, and runtime behavior. The empirical data collected enabled us to construct a performance model, Verrocchio, which accurately predicts the execution time of real-world Ascend kernels. Verrocchio's predictions achieve an average error rate of 2.62% for single-core and 2.30% for double-core executions. It is noteworthy that non-MAC units, compared to Matrix MACs, often exhibit limited performance, potentially bottlenecking overall application efficiency. In response, we introduce two primary optimization strategies alongside Verrocchio to enhance algorithmic implementation on Ascend processors: (1) replacing suboptimal scalar or vectorized operations with more efficient alternatives, and (2) mapping certain operations to matrix multiplications where feasible. For the first optimization, we exemplify with the k-nearest neighbors (k-NN) algorithm, proposing SelB-k-NN (Selection-Bitonic-k-NN), which mitigates the need for suboptimal operations in large-scale datasets. SelB-k-NN delivers a 2.01x speedup over bitonic k-selection, 23.93x over the heap method, and 78.52$x over the CPU-based approach. For the second optimization, we propose Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Performance evaluations show that Cube-fx surpasses the standard Taylor expansion implementation by 2.73x, CORDIC by 6.06x, and Horner's Method by 1.64x. While this second strategy achieves significant performance gains by fully utilizing Matrix MACs, it is limited to operations that can be reformulated as matrix multiplications. Therefore, the first strategy remains essential for optimizing a wider range of computations on AI processors. Together, these strategies offer a comprehensive approach to maximize efficiency on this architecture.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	High performance processors	-
dc.subject.lcsh	Artificial intelligence	-
dc.title	Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991045060524103414	-

File Download

Supplementary

postgraduate thesis: Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats