File Download
Supplementary

postgraduate thesis: Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example

TitleHardware-aware algorithm optimization of AI processors : Huawei Ascend as an example
Authors
Advisors
Advisor(s):Cui, HWang, CL
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Tang, Y. [唐艺峰]. (2024). Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThe field of Artificial Intelligence (AI) has become one of the most prominent areas of research and industry application due to its vast impact and versatility. The development of complex AI models, which often involve extensive parameters, necessitates substantial computational power. This is where specialized hardware, known as AI processors, becomes essential. This thesis focuses on the architecture and performance of Huawei Ascend processors, a representative AI processor, and introduces novel optimization strategies to enhance algorithmic efficiency. AI processors primarily rely on matrix multiplier-accumulators (MACs), which execute matrix multiplications with remarkable computational capability. Matrix multiplications serve as the foundation for various AI algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. However, complete AI applications require a broader range of operations beyond matrix multiplication. Therefore, AI processors include additional hardware units that facilitate essential operations and data transfers to enable comprehensive AI functionalities. Huawei Ascend processors incorporate four key types of hardware units: Matrix MACs for matrix multiplications, IO units for intra- and inter-core data transfers, vector units for performing vectorized calculations, and scalar units for address calculations and branch conditions. Each unit has specific strengths and constraints that influence the overall performance and optimization of AI algorithms on Ascend processors. To gain in-depth insights into the structure and functionality of Ascend processors, we developed specialized micro-benchmarks to examine their hardware characteristics, including IO contention, bandwidth sharing, and runtime behavior. The empirical data collected enabled us to construct a performance model, Verrocchio, which accurately predicts the execution time of real-world Ascend kernels. Verrocchio's predictions achieve an average error rate of 2.62% for single-core and 2.30% for double-core executions. It is noteworthy that non-MAC units, compared to Matrix MACs, often exhibit limited performance, potentially bottlenecking overall application efficiency. In response, we introduce two primary optimization strategies alongside Verrocchio to enhance algorithmic implementation on Ascend processors: (1) replacing suboptimal scalar or vectorized operations with more efficient alternatives, and (2) mapping certain operations to matrix multiplications where feasible. For the first optimization, we exemplify with the k-nearest neighbors (k-NN) algorithm, proposing SelB-k-NN (Selection-Bitonic-k-NN), which mitigates the need for suboptimal operations in large-scale datasets. SelB-k-NN delivers a 2.01x speedup over bitonic k-selection, 23.93x over the heap method, and 78.52$x over the CPU-based approach. For the second optimization, we propose Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Performance evaluations show that Cube-fx surpasses the standard Taylor expansion implementation by 2.73x, CORDIC by 6.06x, and Horner's Method by 1.64x. While this second strategy achieves significant performance gains by fully utilizing Matrix MACs, it is limited to operations that can be reformulated as matrix multiplications. Therefore, the first strategy remains essential for optimizing a wider range of computations on AI processors. Together, these strategies offer a comprehensive approach to maximize efficiency on this architecture.
DegreeDoctor of Philosophy
SubjectHigh performance processors
Artificial intelligence
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/360654

 

DC FieldValueLanguage
dc.contributor.advisorCui, H-
dc.contributor.advisorWang, CL-
dc.contributor.authorTang, Yifeng-
dc.contributor.author唐艺峰-
dc.date.accessioned2025-09-12T02:02:27Z-
dc.date.available2025-09-12T02:02:27Z-
dc.date.issued2024-
dc.identifier.citationTang, Y. [唐艺峰]. (2024). Hardware-aware algorithm optimization of AI processors : Huawei Ascend as an example. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/360654-
dc.description.abstractThe field of Artificial Intelligence (AI) has become one of the most prominent areas of research and industry application due to its vast impact and versatility. The development of complex AI models, which often involve extensive parameters, necessitates substantial computational power. This is where specialized hardware, known as AI processors, becomes essential. This thesis focuses on the architecture and performance of Huawei Ascend processors, a representative AI processor, and introduces novel optimization strategies to enhance algorithmic efficiency. AI processors primarily rely on matrix multiplier-accumulators (MACs), which execute matrix multiplications with remarkable computational capability. Matrix multiplications serve as the foundation for various AI algorithms, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. However, complete AI applications require a broader range of operations beyond matrix multiplication. Therefore, AI processors include additional hardware units that facilitate essential operations and data transfers to enable comprehensive AI functionalities. Huawei Ascend processors incorporate four key types of hardware units: Matrix MACs for matrix multiplications, IO units for intra- and inter-core data transfers, vector units for performing vectorized calculations, and scalar units for address calculations and branch conditions. Each unit has specific strengths and constraints that influence the overall performance and optimization of AI algorithms on Ascend processors. To gain in-depth insights into the structure and functionality of Ascend processors, we developed specialized micro-benchmarks to examine their hardware characteristics, including IO contention, bandwidth sharing, and runtime behavior. The empirical data collected enabled us to construct a performance model, Verrocchio, which accurately predicts the execution time of real-world Ascend kernels. Verrocchio's predictions achieve an average error rate of 2.62% for single-core and 2.30% for double-core executions. It is noteworthy that non-MAC units, compared to Matrix MACs, often exhibit limited performance, potentially bottlenecking overall application efficiency. In response, we introduce two primary optimization strategies alongside Verrocchio to enhance algorithmic implementation on Ascend processors: (1) replacing suboptimal scalar or vectorized operations with more efficient alternatives, and (2) mapping certain operations to matrix multiplications where feasible. For the first optimization, we exemplify with the k-nearest neighbors (k-NN) algorithm, proposing SelB-k-NN (Selection-Bitonic-k-NN), which mitigates the need for suboptimal operations in large-scale datasets. SelB-k-NN delivers a 2.01x speedup over bitonic k-selection, 23.93x over the heap method, and 78.52$x over the CPU-based approach. For the second optimization, we propose Cube-fx, a mapping algorithm of Taylor expansion for multiple functions onto Matrix MACs. Performance evaluations show that Cube-fx surpasses the standard Taylor expansion implementation by 2.73x, CORDIC by 6.06x, and Horner's Method by 1.64x. While this second strategy achieves significant performance gains by fully utilizing Matrix MACs, it is limited to operations that can be reformulated as matrix multiplications. Therefore, the first strategy remains essential for optimizing a wider range of computations on AI processors. Together, these strategies offer a comprehensive approach to maximize efficiency on this architecture.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshHigh performance processors-
dc.subject.lcshArtificial intelligence-
dc.titleHardware-aware algorithm optimization of AI processors : Huawei Ascend as an example-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991045060524103414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats