File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: ROG: A High Performance and Robust Distributed Training System for Robotic IoT

TitleROG: A High Performance and Robust Distributed Training System for Robotic IoT
Authors
Keywordsdistributed training
energy efficient
robust
training throughput
wireless networks
Issue Date5-Oct-2022
Abstract

Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%~6.5% training accuracy gain compared with the baselines and saved 20.4%~50.7% of the energy to achieve the same training accuracy.


Persistent Identifierhttp://hdl.handle.net/10722/333867
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorGuan, Xiuxian-
dc.contributor.authorSun, Zekai-
dc.contributor.authorDeng, Shengliang-
dc.contributor.authorChen, Xusheng-
dc.contributor.authorZhao, Shixiong-
dc.contributor.authorZhang, Zongyuan-
dc.contributor.authorDuan, Tianyang-
dc.contributor.authorWang, Yuexuan-
dc.contributor.authorWu, Chenshu-
dc.contributor.authorCui, Yong-
dc.contributor.authorZhang, Libo-
dc.contributor.authorWu, Yanjun-
dc.contributor.authorWang, Rui-
dc.contributor.authorCui, Heming-
dc.date.accessioned2023-10-06T08:39:44Z-
dc.date.available2023-10-06T08:39:44Z-
dc.date.issued2022-10-05-
dc.identifier.urihttp://hdl.handle.net/10722/333867-
dc.description.abstract<p>Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%~6.5% training accuracy gain compared with the baselines and saved 20.4%~50.7% of the energy to achieve the same training accuracy.<br></p>-
dc.languageeng-
dc.relation.ispartof2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO) (01/10/2022-05/10/2022, Chicago)-
dc.subjectdistributed training-
dc.subjectenergy efficient-
dc.subjectrobust-
dc.subjecttraining throughput-
dc.subjectwireless networks-
dc.titleROG: A High Performance and Robust Distributed Training System for Robotic IoT-
dc.typeConference_Paper-
dc.identifier.doi10.1109/MICRO56248.2022.00032-
dc.identifier.scopuseid_2-s2.0-85141723019-
dc.identifier.volume2022-October-
dc.identifier.spage336-
dc.identifier.epage353-
dc.identifier.isiWOS:000886530600020-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats