Improved Tree Gradient Coding with Non-uniform Computation Load

No Thumbnail Available

Date

2021

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The scaling-up process of distributed machine learning systems faces two major bottlenecks - delays due to stragglers and limited communication bandwidth. Gradient Coding (GC) was proposed for mitigating stragglers in distributed learning. A major drawback of the master-worker architecture in GC for distributed learning is the bandwidth contention at the master, which increases as the cluster size increases. Tree Gradient Coding (TGC), in which the workers are arranged in a regular tree topology leads to a reduction in bandwidth contention at the master. In TGC, each node is allocated the same amount of data for computation and the nodes communicate only with its immediate parent. In this paper, an improvement in the completion time of the TGC is achieved by allocating different computation loads to the nodes at different levels. This approach takes into account the communication delay from the nodes in the lower layers to the upper layers. The computation load at each node has been derived taking the communication delay into account. The computation load, thus calculated, is also proven to be optimal. Furthermore, the improvement in completion time is shown by implementing the proposed scheme on Amazon EC2 servers. � 2021 IEEE.

Description

Keywords

Citation

1

Endorsement

Review

Supplemented By

Referenced By