File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/MICRO56248.2022.00032
- Scopus: eid_2-s2.0-85141723019
- WOS: WOS:000886530600020
Supplementary
- Citations:
- Appears in Collections:
Conference Paper: ROG: A High Performance and Robust Distributed Training System for Robotic IoT
Title | ROG: A High Performance and Robust Distributed Training System for Robotic IoT |
---|---|
Authors | |
Keywords | distributed training energy efficient robust training throughput wireless networks |
Issue Date | 5-Oct-2022 |
Abstract | Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%~6.5% training accuracy gain compared with the baselines and saved 20.4%~50.7% of the energy to achieve the same training accuracy. |
Persistent Identifier | http://hdl.handle.net/10722/333867 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Guan, Xiuxian | - |
dc.contributor.author | Sun, Zekai | - |
dc.contributor.author | Deng, Shengliang | - |
dc.contributor.author | Chen, Xusheng | - |
dc.contributor.author | Zhao, Shixiong | - |
dc.contributor.author | Zhang, Zongyuan | - |
dc.contributor.author | Duan, Tianyang | - |
dc.contributor.author | Wang, Yuexuan | - |
dc.contributor.author | Wu, Chenshu | - |
dc.contributor.author | Cui, Yong | - |
dc.contributor.author | Zhang, Libo | - |
dc.contributor.author | Wu, Yanjun | - |
dc.contributor.author | Wang, Rui | - |
dc.contributor.author | Cui, Heming | - |
dc.date.accessioned | 2023-10-06T08:39:44Z | - |
dc.date.available | 2023-10-06T08:39:44Z | - |
dc.date.issued | 2022-10-05 | - |
dc.identifier.uri | http://hdl.handle.net/10722/333867 | - |
dc.description.abstract | <p>Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission. We present ROG, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. ROG confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, ROG achieved about 4.9%~6.5% training accuracy gain compared with the baselines and saved 20.4%~50.7% of the energy to achieve the same training accuracy.<br></p> | - |
dc.language | eng | - |
dc.relation.ispartof | 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO) (01/10/2022-05/10/2022, Chicago) | - |
dc.subject | distributed training | - |
dc.subject | energy efficient | - |
dc.subject | robust | - |
dc.subject | training throughput | - |
dc.subject | wireless networks | - |
dc.title | ROG: A High Performance and Robust Distributed Training System for Robotic IoT | - |
dc.type | Conference_Paper | - |
dc.identifier.doi | 10.1109/MICRO56248.2022.00032 | - |
dc.identifier.scopus | eid_2-s2.0-85141723019 | - |
dc.identifier.volume | 2022-October | - |
dc.identifier.spage | 336 | - |
dc.identifier.epage | 353 | - |
dc.identifier.isi | WOS:000886530600020 | - |