MindPipe: High-performance and Carbon-efficient Four-dimensional Parallel Training System for Large AI Models


Grant Data
Project Title
MindPipe: High-performance and Carbon-efficient Four-dimensional Parallel Training System for Large AI Models
Principal Investigator
Professor Cui, Heming   (Project Coordinator (PC))
Co-Investigator(s)
Zhang Xin   (Co-Investigator)
Professor Luo Ping   (Co-principal investigator)
Leung Matthew   (Co-Investigator)
Duration
36
Start Date
2023-02-13
Amount
4214286
Conference Title
MindPipe: High-performance and Carbon-efficient Four-dimensional Parallel Training System for Large AI Models
Keywords
High performance computing, Large DNN models, High GPU ALU utilization, Pipeline parallel training, Supernet training
Discipline
SoftwareArtificial Intelligence and Machine learning
HKU Project Code
R7030-22
Grant Type
Research Impact Fund (RIF) 2022/23
Funding Year
2022
Status
On-going
Objectives
(1) [vPipe, the first high-performance training system for the Pipeline Parallel dimension]Compared to existing notable Pipeline Parallel training systems (e.g., Pipedream's about 32% perGPU ALU utilization, and Megatron-LM and GPipe's about 50%), vPipe should achieve about 90% per-GPU ALU utilization with balanced GPU memory usage across the GPU pipeline stages of a typical large DNN model (e.g., Google's Transformer). When more GPUs are added to the pipeline of a large DNN model, vPipe's training performance and carbon-efficiency (both these two metrics can be converted from the per-GPU ALU utilization) should (1) keep increasing almost linearly and (2) gain even higher advantages compared to the two metrics of notable Pipeline Parallel systems.(2) [AFAPAB scheduling algorithms, for enabling network-efficient and high-teraflop threedimensional (Pipeline, Tensor, and Data) Parallel training]Compared to Megatron-LM (published in ACM SC 2021), the existing best three-dimensional parallel training system, our resultant AFAPAB (All-Forward-All-Parallel-All-Backward) algorithms should enforce fine-grained three-dimensional partitions for a large DNN model and make all the computation and communications tasks of the three dimensions highly overlap with each other. When training a typical large DNN model (e.g., Transformer) on commercial clouds (e.g., Amazon Cloud), compared to Megatron-LM, AFAPAB should greatly mitigate the network bottleneck caused by network resource contentions and should greatly increase the per-GPU ALU utilization by about 40%. AFAPAB’s absolute per-GPU training throughput of the DNN model should reach about 40 TFLops on the high-end NVIDIA V100 GPUs, the best results that have never been reported in existing relevant systems in the world (Megatron-LM’s results are merely about 29 TFLops on the same GPU and network settings).(3) [SPara (Supernet Parallel), a novel dimension of parallel training for large DNN models constructed in the Supernet-subnet manner]SPara should be the first training system (and a ground-breaking new dimension) for supporting the training of general large DNN models that are constructed in the Supernet-subnet manner, while all existing relevant training systems (except for Microsoft’s Retiari system) cannot support such Supernet-subnet models. SPara should be the first work in the literature on enabling the training of multiple subnets of a DNN model ``simultaneously’’ (i.e., enabling much higher training performance that is multiple times higher than the Retiari system, which can only train each subnet ``sequentially’’ one by one). SPara should deterministically resolve and schedule the shared DNN operators among different active subnets. Specifically, even given different GPU cluster settings (e.g., given different numbers of GPUs) for training the same DNN model, the determinism guarantee of SPara should always produce the same resultant subnets for this DNN model on different settings. Therefore, SPara will greatly ease the (1) construction, (2) debugging, and (3) analysis of large Supernet-subnet models, which may nurture new research papers published by people in these three areas.(4) [EPara (Efficient Parallel), the first configuration system for achieving entropyput-optimized, efficient 4D (3D parallel + Supernet Parallel) training]Based on the first three objectives, EPara should be the first 4D parallel training configuration system. In particular, EPara should be developed via creating an online optimization algorithm to near-optimally infer training-related important configurations (e.g., learning rate, batch size, and 4D partitions of all GPUs). EPara should maximize a new performance metric proposed by us, ``training entropyput’’ (i.e., the amount of training progress made per time unit, revealing the information embedding efficiency of a DNN model). This new metric should incorporate the optimization of both the traditional ``training throughput’’ (i.e., FLops per GPU) and the statistical accuracy convergence rate of a DNN model, so that the performance advantages in the first three objectives can be inherited to this Objective 4. EPara's ``training entropyput’’ optimization algorithms should be the first work to efficiently solve near-optimal configurations for 4D parallel training in its huge 4D configuration optimization space. Specifically, EPara should explore entropyput improvement possibilities in the sparsely activated Supernet space of the DNN model and support typical Supernet training scalability categories (e.g., the scalability of Mixture-of-Experts).