File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example
| Title | Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example |
|---|---|
| Authors | |
| Advisors | |
| Issue Date | 2024 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Tang, Y. [唐艺峰]. (2024). Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | The field of Artificial Intelligence (AI) has become one of the most prominent areas of research and industry application due to its vast impact and versatility. The development of complex AI models, which often involve extensive parameters, necessitates substantial computational power. This is where specialized hardware, known as AI processors, becomes essential. This thesis focuses on the architecture and performance of Huawei Ascend processors, a representative AI processor, and introduces novel optimization strategies to enhance algorithmic efficiency.
AI processors primarily rely on matrix multiplier-accumulators (MACs), which execute matrix multiplications with remarkable computational capability. Matrix multiplications serve as the foundation for various AI algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. However, complete AI applications require a broader range of operations beyond matrix multiplication. Therefore, AI processors include additional hardware units that facilitate essential operations and data transfers to enable comprehensive AI functionalities.
Huawei Ascend processors incorporate four key types of hardware units: Matrix MACs for matrix multiplications, IO units for intra- and inter-core data transfers, vector units for performing vectorized calculations, and scalar units for address calculations and branch conditions. Each unit has specific strengths and constraints that influence the overall performance and optimization of AI algorithms on Ascend processors.
To gain in-depth insights into the structure and functionality of Ascend processors, we developed specialized micro-benchmarks to examine their hardware characteristics, including IO contention, bandwidth sharing, and runtime behavior. The empirical data collected enabled us to construct a performance model, Verrocchio, which accurately predicts the execution time of real-world Ascend kernels. Verrocchio's predictions achieve an average error rate of 2.62% for single-core and 2.30% for double-core executions.
It is noteworthy that non-MAC units, compared to Matrix MACs, often exhibit limited performance, potentially bottlenecking overall application efficiency. In response, we introduce two primary optimization strategies alongside Verrocchio to enhance algorithmic implementation on Ascend processors: (1) replacing suboptimal scalar or vectorized operations with more efficient alternatives, and (2) mapping certain operations to matrix multiplications where feasible. For the first optimization, we exemplify with the k-nearest neighbors (k-NN) algorithm, proposing SelB-k-NN (Selection-Bitonic-k-NN), which mitigates the need for suboptimal operations in large-scale datasets. SelB-k-NN delivers a 2.01x speedup over bitonic k-selection, 23.93x over the heap method, and 78.52$x over the CPU-based approach. For the second optimization, we propose Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Performance evaluations show that Cube-fx surpasses the standard Taylor expansion implementation by 2.73x, CORDIC by 6.06x, and Horner's Method by 1.64x. While this second strategy achieves significant performance gains by fully utilizing Matrix MACs, it is limited to operations that can be reformulated as matrix multiplications. Therefore, the first strategy remains essential for optimizing a wider range of computations on AI processors. Together, these strategies offer a comprehensive approach to maximize efficiency on this architecture. |
| Degree | Doctor of Philosophy |
| Subject | High performance processors Artificial intelligence |
| Dept/Program | Computer Science |
| Persistent Identifier | http://hdl.handle.net/10722/360654 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Cui, H | - |
| dc.contributor.advisor | Wang, CL | - |
| dc.contributor.author | Tang, Yifeng | - |
| dc.contributor.author | 唐艺峰 | - |
| dc.date.accessioned | 2025-09-12T02:02:27Z | - |
| dc.date.available | 2025-09-12T02:02:27Z | - |
| dc.date.issued | 2024 | - |
| dc.identifier.citation | Tang, Y. [唐艺峰]. (2024). Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/360654 | - |
| dc.description.abstract | The field of Artificial Intelligence (AI) has become one of the most prominent areas of research and industry application due to its vast impact and versatility. The development of complex AI models, which often involve extensive parameters, necessitates substantial computational power. This is where specialized hardware, known as AI processors, becomes essential. This thesis focuses on the architecture and performance of Huawei Ascend processors, a representative AI processor, and introduces novel optimization strategies to enhance algorithmic efficiency. AI processors primarily rely on matrix multiplier-accumulators (MACs), which execute matrix multiplications with remarkable computational capability. Matrix multiplications serve as the foundation for various AI algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. However, complete AI applications require a broader range of operations beyond matrix multiplication. Therefore, AI processors include additional hardware units that facilitate essential operations and data transfers to enable comprehensive AI functionalities. Huawei Ascend processors incorporate four key types of hardware units: Matrix MACs for matrix multiplications, IO units for intra- and inter-core data transfers, vector units for performing vectorized calculations, and scalar units for address calculations and branch conditions. Each unit has specific strengths and constraints that influence the overall performance and optimization of AI algorithms on Ascend processors. To gain in-depth insights into the structure and functionality of Ascend processors, we developed specialized micro-benchmarks to examine their hardware characteristics, including IO contention, bandwidth sharing, and runtime behavior. The empirical data collected enabled us to construct a performance model, Verrocchio, which accurately predicts the execution time of real-world Ascend kernels. Verrocchio's predictions achieve an average error rate of 2.62% for single-core and 2.30% for double-core executions. It is noteworthy that non-MAC units, compared to Matrix MACs, often exhibit limited performance, potentially bottlenecking overall application efficiency. In response, we introduce two primary optimization strategies alongside Verrocchio to enhance algorithmic implementation on Ascend processors: (1) replacing suboptimal scalar or vectorized operations with more efficient alternatives, and (2) mapping certain operations to matrix multiplications where feasible. For the first optimization, we exemplify with the k-nearest neighbors (k-NN) algorithm, proposing SelB-k-NN (Selection-Bitonic-k-NN), which mitigates the need for suboptimal operations in large-scale datasets. SelB-k-NN delivers a 2.01x speedup over bitonic k-selection, 23.93x over the heap method, and 78.52$x over the CPU-based approach. For the second optimization, we propose Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Performance evaluations show that Cube-fx surpasses the standard Taylor expansion implementation by 2.73x, CORDIC by 6.06x, and Horner's Method by 1.64x. While this second strategy achieves significant performance gains by fully utilizing Matrix MACs, it is limited to operations that can be reformulated as matrix multiplications. Therefore, the first strategy remains essential for optimizing a wider range of computations on AI processors. Together, these strategies offer a comprehensive approach to maximize efficiency on this architecture. | - |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | High performance processors | - |
| dc.subject.lcsh | Artificial intelligence | - |
| dc.title | Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Computer Science | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991045060524103414 | - |
