File Download
Supplementary

postgraduate thesis: Optimal resource scheduling algorithms in distributed systems

TitleOptimal resource scheduling algorithms in distributed systems
Authors
Advisors
Advisor(s):Wu, C
Issue Date2022
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Luo, Z. [罗子越]. (2022). Optimal resource scheduling algorithms in distributed systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractDistributed systems comprise networked machines that cooperate together to execute computing tasks beyond the capabilities of a single machine, such as distributed machine learning (ML) training and computer network services. Efficient resource scheduling in distributed systems is the key to maximizing system efficiency and minimizing operational costs. However, designing such scheduling algorithms is intrinsically challenging due to unknown future workloads, various resource demands, sophisticated machine interconnections, and complex task dependencies. This thesis proposes a series of resource scheduling algorithms for two typical distributed systems: the distributed network function virtualization (NFV) system and the distributed ML training system. We first investigate the scenario of serving multiple virtual network function (VNF) chains in a cloud data center, considering VNF deployment and operational expenses, subject to resource capacity constraints. To dynamically adapt VNF deployment, we present a randomized online scaling algorithm that decouples the original online problem into a series of regularized sub-problems and rounds the fractional solutions to the sub-problems to integer solutions via dependent rounding. Our design achieves a good expected competitive ratio compared to the offline optimum while violating the resource capacity constraints by a relatively small constant. Trace-driven simulation validates the analytical results and demonstrates the good performance of our design. Next, we study the dynamic scaling of VNF chains in geo-distributed NFV systems. We start by designing a deep reinforcement learning (DRL) framework for scaling multiple geo-distributed VNF chains. We leverage a recurrent neural network to make traffic predictions as the traffic model. We further design a DRL framework based on the asynchronous actor-critic learning method that takes the predictions from the traffic model as input and makes VNF placement decisions online. Trace-driven simulation results demonstrate that our design quickly adapts to traffic dynamics and reduces system costs by up to 42% as compared to representative online optimization algorithms. Second, we design efficient scheduling algorithms for distributed ML training in ML clusters. We first exploit the pipeline parallelism for expediting the training of large deep neural networks (DNNs) over arbitrary inter-GPU connectivity. To minimize per-iteration training time, we propose an efficient, near-optimal synchronous pipeline planning algorithm consisting of a pipeline partition and device mapping algorithm based on the dynamic programming framework and an execution scheduler that determines the order of microbatch execution. Evaluations on two testbeds show that our design speeds up the training of representative DNN models by up to 157% as compared to representative baselines. Finally, we study distributed graph neural network (GNN) training optimization. We design an online scheduling algorithm to minimize the overall training time that schedules the execution of GNN training tasks and the flow transmission plan between tasks with a good competitive ratio. We further propose an exploratory task placement scheme that adopts the Markov Chain Monte Carlo search framework to make training task placement decisions. We conduct thorough testbed experiments and simulation studies, showing that our design achieves up to 67% speed-up as compared to representative baselines.
DegreeDoctor of Philosophy
SubjectComputer networks
Machine learning
Distributed artificial intelligence
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/322838

 

DC FieldValueLanguage
dc.contributor.advisorWu, C-
dc.contributor.authorLuo, Ziyue-
dc.contributor.author罗子越-
dc.date.accessioned2022-11-18T10:41:00Z-
dc.date.available2022-11-18T10:41:00Z-
dc.date.issued2022-
dc.identifier.citationLuo, Z. [罗子越]. (2022). Optimal resource scheduling algorithms in distributed systems. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/322838-
dc.description.abstractDistributed systems comprise networked machines that cooperate together to execute computing tasks beyond the capabilities of a single machine, such as distributed machine learning (ML) training and computer network services. Efficient resource scheduling in distributed systems is the key to maximizing system efficiency and minimizing operational costs. However, designing such scheduling algorithms is intrinsically challenging due to unknown future workloads, various resource demands, sophisticated machine interconnections, and complex task dependencies. This thesis proposes a series of resource scheduling algorithms for two typical distributed systems: the distributed network function virtualization (NFV) system and the distributed ML training system. We first investigate the scenario of serving multiple virtual network function (VNF) chains in a cloud data center, considering VNF deployment and operational expenses, subject to resource capacity constraints. To dynamically adapt VNF deployment, we present a randomized online scaling algorithm that decouples the original online problem into a series of regularized sub-problems and rounds the fractional solutions to the sub-problems to integer solutions via dependent rounding. Our design achieves a good expected competitive ratio compared to the offline optimum while violating the resource capacity constraints by a relatively small constant. Trace-driven simulation validates the analytical results and demonstrates the good performance of our design. Next, we study the dynamic scaling of VNF chains in geo-distributed NFV systems. We start by designing a deep reinforcement learning (DRL) framework for scaling multiple geo-distributed VNF chains. We leverage a recurrent neural network to make traffic predictions as the traffic model. We further design a DRL framework based on the asynchronous actor-critic learning method that takes the predictions from the traffic model as input and makes VNF placement decisions online. Trace-driven simulation results demonstrate that our design quickly adapts to traffic dynamics and reduces system costs by up to 42% as compared to representative online optimization algorithms. Second, we design efficient scheduling algorithms for distributed ML training in ML clusters. We first exploit the pipeline parallelism for expediting the training of large deep neural networks (DNNs) over arbitrary inter-GPU connectivity. To minimize per-iteration training time, we propose an efficient, near-optimal synchronous pipeline planning algorithm consisting of a pipeline partition and device mapping algorithm based on the dynamic programming framework and an execution scheduler that determines the order of microbatch execution. Evaluations on two testbeds show that our design speeds up the training of representative DNN models by up to 157% as compared to representative baselines. Finally, we study distributed graph neural network (GNN) training optimization. We design an online scheduling algorithm to minimize the overall training time that schedules the execution of GNN training tasks and the flow transmission plan between tasks with a good competitive ratio. We further propose an exploratory task placement scheme that adopts the Markov Chain Monte Carlo search framework to make training task placement decisions. We conduct thorough testbed experiments and simulation studies, showing that our design achieves up to 67% speed-up as compared to representative baselines.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer networks-
dc.subject.lcshMachine learning-
dc.subject.lcshDistributed artificial intelligence-
dc.titleOptimal resource scheduling algorithms in distributed systems-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2022-
dc.identifier.mmsid991044609104803414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats