-
Hierarchical Coded Gradient Aggregation Based on Layered MDS Codes
Authors:
M. Nikhil Krishnan,
Anoop Thomas,
Birenjith Sasidharan
Abstract:
The growing privacy concerns and the communication costs associated with transmitting raw data have resulted in techniques like federated learning, where the machine learning models are trained at the edge nodes, and the parameter updates are shared with a central server. Because communications from the edge nodes are often unreliable, a hierarchical setup involving intermediate helper nodes is co…
▽ More
The growing privacy concerns and the communication costs associated with transmitting raw data have resulted in techniques like federated learning, where the machine learning models are trained at the edge nodes, and the parameter updates are shared with a central server. Because communications from the edge nodes are often unreliable, a hierarchical setup involving intermediate helper nodes is considered. The communication links between the edges and the helper nodes are error-prone and are modeled as straggling/failing links. To overcome the issue of link failures, coding techniques are proposed. The edge nodes communicate encoded versions of the model updates to the helper nodes, which pass them on to the master after suitable aggregation. The primary work in this area uses repetition codes and Maximum Distance Separable (MDS) codes at the edge nodes to arrive at the Aligned Repetition Coding (ARC) and Aligned MDS Coding (AMC) schemes, respectively. We propose using vector codes, specifically a family of layered MDS codes parameterized by a variable $ν$, at the edge nodes. For the proposed family of codes, suitable aggregation strategies at the helper nodes are also developed. At the extreme values of $ν$, our scheme matches the communication costs incurred by the ARC and AMC schemes, resulting in a graceful transition between these schemes.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Explicit Information-Debt-Optimal Streaming Codes With Small Memory
Authors:
M. Nikhil Krishnan,
Myna Vajha,
Vinayak Ramkumar,
P. Vijay Kumar
Abstract:
For a convolutional code in the presence of a symbol erasure channel, the information debt $I(t)$ at time $t$ provides a measure of the number of additional code symbols required to recover all message symbols up to time $t$. Information-debt-optimal streaming ($i$DOS) codes are convolutional codes which allow for the recovery of all message symbols up to $t$ whenever $I(t)$ turns zero under the f…
▽ More
For a convolutional code in the presence of a symbol erasure channel, the information debt $I(t)$ at time $t$ provides a measure of the number of additional code symbols required to recover all message symbols up to time $t$. Information-debt-optimal streaming ($i$DOS) codes are convolutional codes which allow for the recovery of all message symbols up to $t$ whenever $I(t)$ turns zero under the following conditions; (i) information debt can be non-zero for at most $τ$ consecutive time slots and (ii) information debt never increases beyond a particular threshold. The existence of periodically-time-varying $i$DOS codes are known for all parameters. In this paper, we address the problem of constructing explicit, time-invariant $i$DOS codes. We present an explicit time-invariant construction of $i$DOS codes for the unit memory ($m=1$) case. It is also shown that a construction method for convolutional codes due to Almeida et al. leads to explicit time-invariant $i$DOS codes for all parameters. However, this general construction requires a larger field size than the first construction for the $m=1$ case.
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Sequential Gradient Coding For Straggler Mitigation
Authors:
M. Nikhil Krishnan,
MohammadReza Ebrahimi,
Ashish Khisti
Abstract:
In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing o…
▽ More
In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.
△ Less
Submitted 28 June, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Adaptive relaying for streaming erasure codes in a three node relay network
Authors:
Gustavo Kasper Facenda,
M. Nikhil Krishnan,
Elad Domanovitz,
Silas L. Fong,
Ashish Khisti,
Wai-Tian Tan,
John Apostolopoulos
Abstract:
This paper investigates adaptive streaming codes over a three-node relayed network. In this setting, a source node transmits a sequence of message packets to a destination through a relay. The source-to-relay and relay-to-destination links are unreliable and introduce at most $N_1$ and $N_2$ packet erasures, respectively. The destination node must recover each message packet within a strict delay…
▽ More
This paper investigates adaptive streaming codes over a three-node relayed network. In this setting, a source node transmits a sequence of message packets to a destination through a relay. The source-to-relay and relay-to-destination links are unreliable and introduce at most $N_1$ and $N_2$ packet erasures, respectively. The destination node must recover each message packet within a strict delay constraint $T$. The paper presents achievable streaming codes for all feasible parameters $\{N_1, N_2, T\}$ that exploit the fact that the relay naturally observes the erasure pattern occurring in the link from source to relay, thus it can adapt its relaying strategy based on these observations. In a recent work, Fong et al. provide streaming codes featuring channel-state-independent relaying strategies. The codes proposed in this paper achieve rates higher than the ones proposed by Fong et al. whenever $N_2 > N_1$, and achieve the same rate when $N_2 = N_1$. The paper also presents an upper bound on the achievable rate that takes into account erasures in both links in order to bound the rate in the second link. The upper bound is shown to be tighter than a trivial bound that considers only the erasures in the second link.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
Explicit Rate-Optimal Streaming Codes with Smaller Field Size
Authors:
Myna Vajha,
Vinayak Ramkumar,
M. Nikhil Krishnan,
P. Vijay Kumar
Abstract:
Streaming codes are a class of packet-level erasure codes that ensure packet recovery over a sliding window channel which allows either a burst erasure of size $b$ or $a$ random erasures within any window of size $(τ+1)$ time units, under a strict decoding-delay constraint $τ$. The field size over which streaming codes are constructed is an important factor determining the complexity of implementa…
▽ More
Streaming codes are a class of packet-level erasure codes that ensure packet recovery over a sliding window channel which allows either a burst erasure of size $b$ or $a$ random erasures within any window of size $(τ+1)$ time units, under a strict decoding-delay constraint $τ$. The field size over which streaming codes are constructed is an important factor determining the complexity of implementation. The best known explicit rate-optimal streaming code requires a field size of $q^2$ where $q \ge τ+b-a$ is a prime power. In this work, we present an explicit rate-optimal streaming code, for all possible $\{a,b,τ\}$ parameters, over a field of size $q^2$ for prime power $q \ge τ$. This is the smallest-known field size of a general explicit rate-optimal construction that covers all $\{a,b,τ\}$ parameter sets. We achieve this by modifying the non-explicit code construction due to Krishnan et al. to make it explicit, without change in field size.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
Codes for Distributed Storage
Authors:
Vinayak Ramkumar,
Myna Vajha,
S. B. Balaji,
M. Nikhil Krishnan,
Birenjith Sasidharan,
P. Vijay Kumar
Abstract:
This chapter deals with the topic of designing reliable and efficient codes for the storage and retrieval of large quantities of data over storage devices that are prone to failure. For long, the traditional objective has been one of ensuring reliability against data loss while minimizing storage overhead. More recently, a third concern has surfaced, namely of the need to efficiently recover from…
▽ More
This chapter deals with the topic of designing reliable and efficient codes for the storage and retrieval of large quantities of data over storage devices that are prone to failure. For long, the traditional objective has been one of ensuring reliability against data loss while minimizing storage overhead. More recently, a third concern has surfaced, namely of the need to efficiently recover from the failure of a single storage unit, corresponding to recovery from the erasure of a single code symbol. We explain here, how coding theory has evolved to tackle this fresh challenge.
△ Less
Submitted 3 October, 2020;
originally announced October 2020.
-
Staggered Diagonal Embedding Based Linear Field Size Streaming Codes
Authors:
Vinayak Ramkumar,
Myna Vajha,
M. Nikhil Krishnan,
P. Vijay Kumar
Abstract:
An $(a,b,τ)$ streaming code is a packet-level erasure code that can recover under a strict delay constraint of $τ$ time units, from either a burst of $b$ erasures or else of $a$ random erasures, occurring within a sliding window of time duration $w$. While rate-optimal constructions of such streaming codes are available for all parameters $\{a,b,τ,w\}$ in the literature, they require in most insta…
▽ More
An $(a,b,τ)$ streaming code is a packet-level erasure code that can recover under a strict delay constraint of $τ$ time units, from either a burst of $b$ erasures or else of $a$ random erasures, occurring within a sliding window of time duration $w$. While rate-optimal constructions of such streaming codes are available for all parameters $\{a,b,τ,w\}$ in the literature, they require in most instances, a quadratic, $O(τ^2)$ field size. In this work, we make further progress towards field size reduction and present rate-optimal $O(τ)$ field size streaming codes for two regimes: (i) $gcd(b,τ+1-a)\ge a$ (ii) $τ+1 \ge a+b$ and $b \mod \ a \in \{0,a-1\}$.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
Low Field-size, Rate-Optimal Streaming Codes for Channels With Burst and Random Erasures
Authors:
M. Nikhil Krishnan,
Deeptanshu Shukla,
P. Vijay Kumar
Abstract:
In this paper, we design erasure-correcting codes for channels with burst and random erasures, when a strict decoding delay constraint is in place. We consider the sliding-window-based packet erasure model proposed by Badr et al., where any time-window of width $w$ contains either up to $a$ random erasures or an erasure burst of length at most $b$. One needs to recover any erased packet, where era…
▽ More
In this paper, we design erasure-correcting codes for channels with burst and random erasures, when a strict decoding delay constraint is in place. We consider the sliding-window-based packet erasure model proposed by Badr et al., where any time-window of width $w$ contains either up to $a$ random erasures or an erasure burst of length at most $b$. One needs to recover any erased packet, where erasures are as per the channel model, with a strict decoding delay deadline of $τ$ time slots. Presently existing rate-optimal constructions in the literature require, in general, a field-size which grows exponential in $τ$, for a constant $\frac{a}τ$. In this work, we present a new rate-optimal code construction covering all channel and delay parameters, which requires an $O(τ^2)$ field-size. As a special case, when $(b-a)=1$, we have a field-size linear in $τ$. We also present three other constructions having linear field-size, under certain constraints on channel and decoding delay parameters. As a corollary, we obtain low field-size, rate-optimal convolutional codes for any given column distance and column span. Simulations indicate that the newly proposed streaming code constructions offer lower packet-loss probabilities compared to existing schemes, for selected instances of Gilbert-Elliott and Fritchman channels.
△ Less
Submitted 14 March, 2019;
originally announced March 2019.
-
Erasure Coding for Distributed Storage: An Overview
Authors:
S. B. Balaji,
M. Nikhil Krishnan,
Myna Vajha,
Vinayak Ramkumar,
Birenjith Sasidharan,
P. Vijay Kumar
Abstract:
In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node. Efficient repair calls for erasure codes that in the face of node failure, are efficient in terms of minimizing the amount of repair data transferred over the netw…
▽ More
In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node. Efficient repair calls for erasure codes that in the face of node failure, are efficient in terms of minimizing the amount of repair data transferred over the network, the amount of data accessed at a helper node as well as the number of helper nodes contacted. Coding theory has evolved to handle these challenges by introducing two new classes of erasure codes, namely regenerating codes and locally recoverable codes as well as by coming up with novel ways to repair the ubiquitous Reed-Solomon code. This survey provides an overview of the efforts in this direction that have taken place over the past decade.
△ Less
Submitted 12 June, 2018;
originally announced June 2018.
-
Codes with Combined Locality and Regeneration Having Optimal Rate, $d_{\text{min}}$ and Linear Field Size
Authors:
M. Nikhil Krishnan,
Anantha Narayanan R.,
P. Vijay Kumar
Abstract:
In this paper, we study vector codes with all-symbol locality, where the local code is either a Minimum Bandwidth Regenerating (MBR) code or a Minimum Storage Regenerating (MSR) code. In the first part, we present vector codes with all-symbol MBR locality, for all parameters, that have both optimal minimum-distance and optimal rate. These codes combine ideas from two popular codes in the distribut…
▽ More
In this paper, we study vector codes with all-symbol locality, where the local code is either a Minimum Bandwidth Regenerating (MBR) code or a Minimum Storage Regenerating (MSR) code. In the first part, we present vector codes with all-symbol MBR locality, for all parameters, that have both optimal minimum-distance and optimal rate. These codes combine ideas from two popular codes in the distributed storage literature, Product-Matrix codes and Tamo-Barg codes. In the second part which deals with codes having all-symbol MSR locality, we follow a Pairwise Coupling Transform-based approach to arrive at optimal minimum-distance and optimal rate, for a range of parameters. All the code constructions presented in this paper have a low field-size that grows linearly with the code-length $n$.
△ Less
Submitted 2 April, 2018;
originally announced April 2018.
-
Rate-Optimal Streaming Codes for Channels with Burst and Isolated Erasures
Authors:
M. Nikhil Krishnan,
P. Vijay Kumar
Abstract:
Recovery of data packets from packet erasures in a timely manner is critical for many streaming applications. An early paper by Martinian and Sundberg introduced a framework for streaming codes and designed rate-optimal codes that permit delay-constrained recovery from an erasure burst of length up to $B$. A recent work by Badr et al. extended this result and introduced a sliding-window channel mo…
▽ More
Recovery of data packets from packet erasures in a timely manner is critical for many streaming applications. An early paper by Martinian and Sundberg introduced a framework for streaming codes and designed rate-optimal codes that permit delay-constrained recovery from an erasure burst of length up to $B$. A recent work by Badr et al. extended this result and introduced a sliding-window channel model $\mathcal{C}(N,B,W)$. Under this model, in a sliding-window of width $W$, one of the following erasure patterns are possible (i) a burst of length at most $B$ or (ii) at most $N$ (possibly non-contiguous) arbitrary erasures. Badr et al. obtained a rate upper bound for streaming codes that can recover with a time delay $T$, from any erasure patterns permissible under the $\mathcal{C}(N,B,W)$ model. However, constructions matching the bound were absent, except for a few parameter sets. In this paper, we present an explicit family of codes that achieves the rate upper bound for all feasible parameters $N$, $B$, $W$ and $T$.
△ Less
Submitted 17 January, 2018;
originally announced January 2018.
-
A Study on the Impact of Locality in the Decoding of Binary Cyclic Codes
Authors:
M. Nikhil Krishnan,
Bhagyashree Puranik,
P. Vijay Kumar,
Itzhak Tamo,
Alexander Barg
Abstract:
In this paper, we study the impact of locality on the decoding of binary cyclic codes under two approaches, namely ordered statistics decoding (OSD) and trellis decoding. Given a binary cyclic code having locality or availability, we suitably modify the OSD to obtain gains in terms of the Signal-To-Noise ratio, for a given reliability and essentially the same level of decoder complexity. With rega…
▽ More
In this paper, we study the impact of locality on the decoding of binary cyclic codes under two approaches, namely ordered statistics decoding (OSD) and trellis decoding. Given a binary cyclic code having locality or availability, we suitably modify the OSD to obtain gains in terms of the Signal-To-Noise ratio, for a given reliability and essentially the same level of decoder complexity. With regard to trellis decoding, we show that careful introduction of locality results in the creation of cyclic subcodes having lower maximum state complexity. We also present a simple upper-bounding technique on the state complexity profile, based on the zeros of the code. Finally, it is shown how the decoding speed can be significantly increased in the presence of locality, in the moderate-to-high SNR regime, by making use of a quick-look decoder that often returns the ML codeword.
△ Less
Submitted 13 February, 2017;
originally announced February 2017.
-
Outer Bounds on the Storage-Repair Bandwidth Tradeoff of Exact-Repair Regenerating Codes
Authors:
Birenjith Sasidharan,
N. Prakash,
M. Nikhil Krishnan,
Myna Vajha,
Kaushik Senthoor,
P. Vijay Kumar
Abstract:
In this paper, three outer bounds on the normalized storage-repair bandwidth (S-RB) tradeoff of regenerating codes having parameter set $\{(n,k,d),(α,β)\}$ under the exact-repair (ER) setting are presented. The first outer bound is applicable for every parameter set $(n,k,d)$ and in conjunction with a code construction known as {\em improved layered codes}, it characterizes the normalized ER trade…
▽ More
In this paper, three outer bounds on the normalized storage-repair bandwidth (S-RB) tradeoff of regenerating codes having parameter set $\{(n,k,d),(α,β)\}$ under the exact-repair (ER) setting are presented. The first outer bound is applicable for every parameter set $(n,k,d)$ and in conjunction with a code construction known as {\em improved layered codes}, it characterizes the normalized ER tradeoff for the case $(n,k=3,d=n-1)$. It establishes a non-vanishing gap between the ER and functional-repair (FR) tradeoffs for every $(n,k,d)$. The second bound is an improvement upon an existing bound due to Mohajer et al. and is tighter than the first bound, in a regime away from the Minimum Storage Regeneraing (MSR) point. The third bound is for the case of $k=d$, under the linear setting. This outer bound matches with the achievable region of {\em layered codes} thereby characterizing the normalized ER tradeoff of linear ER codes when $k=d=n-1$.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
On MBR codes with replication
Authors:
M. Nikhil Krishnan,
P. Vijay Kumar
Abstract:
An early paper by Rashmi et. al. presented the construction of an $(n,k,d=n-1)$ MBR regenerating code featuring the inherent double replication of all code symbols and repair-by-transfer (RBT), both of which are important in practice. We first show that no MBR code can contain even a single code symbol that is replicated more than twice. We then go on to present two new families of MBR codes which…
▽ More
An early paper by Rashmi et. al. presented the construction of an $(n,k,d=n-1)$ MBR regenerating code featuring the inherent double replication of all code symbols and repair-by-transfer (RBT), both of which are important in practice. We first show that no MBR code can contain even a single code symbol that is replicated more than twice. We then go on to present two new families of MBR codes which feature double replication of all systematic message symbols. The codes also possess a set of $d$ nodes whose contents include the message symbols and which can be repaired through help-by-transfer (HBT). As a corollary, we obtain systematic RBT codes for the case $d=(n-1)$ that possess inherent double replication of all code symbols and having a field size of $O(n)$ in comparison with the general, $O(n^2)$ field size requirement of the earlier construction by Rashmi et. al. For the cases $(k=d=n-2)$ or $(k+1=d=n-2)$, the field size can be reduced to $q=2$, and hence the codes can be binary. We also give a necessary and sufficient condition for the existence of MBR codes having double replication of all code symbols and also suggest techniques which will enable an arbitrary MBR code to be converted to one with double replication of all code symbols.
△ Less
Submitted 29 January, 2016;
originally announced January 2016.
-
The Storage-Repair-Bandwidth Trade-off of Exact Repair Linear Regenerating Codes for the Case $d = k = n-1$
Authors:
N. Prakash,
M. Nikhil Krishnan
Abstract:
In this paper, we consider the setting of exact repair linear regenerating codes. Under this setting, we derive a new outer bound on the storage-repair-bandwidth trade-off for the case when $d = k = n -1$, where $(n, k, d)$ are parameters of the regenerating code, with their usual meaning. Taken together with the achievability result of Tian et. al. [1], we show that the new outer bound derived he…
▽ More
In this paper, we consider the setting of exact repair linear regenerating codes. Under this setting, we derive a new outer bound on the storage-repair-bandwidth trade-off for the case when $d = k = n -1$, where $(n, k, d)$ are parameters of the regenerating code, with their usual meaning. Taken together with the achievability result of Tian et. al. [1], we show that the new outer bound derived here completely characterizes the trade-off for the case of exact repair linear regenerating codes, when $d = k = n -1$. The new outer bound is derived by analyzing the dual code of the linear regenerating code.
△ Less
Submitted 26 January, 2015; v1 submitted 16 January, 2015;
originally announced January 2015.
-
Evaluation of Codes with Inherent Double Replication for Hadoop
Authors:
M. Nikhil Krishnan,
N. Prakash,
V. Lalitha,
Birenjith Sasidharan,
P. Vijay Kumar,
Srinivasan Narayanamurthy,
Ranjit Kumar,
Siddhartha Nandi
Abstract:
In this paper, we evaluate the efficacy, in a Hadoop setting, of two coding schemes, both possessing an inherent double replication of data. The two coding schemes belong to the class of regenerating and locally regenerating codes respectively, and these two classes are representative of recent advances made in designing codes for the efficient storage of data in a distributed setting. In comparis…
▽ More
In this paper, we evaluate the efficacy, in a Hadoop setting, of two coding schemes, both possessing an inherent double replication of data. The two coding schemes belong to the class of regenerating and locally regenerating codes respectively, and these two classes are representative of recent advances made in designing codes for the efficient storage of data in a distributed setting. In comparison with triple replication, double replication permits a significant reduction in storage overhead, while delivering good MapReduce performance under moderate work loads. The two coding solutions under evaluation here, add only moderately to the storage overhead of double replication, while simultaneously offering reliability levels similar to that of triple replication.
One might expect from the property of inherent data duplication that the performance of these codes in executing a MapReduce job would be comparable to that of double replication. However, a second feature of this class of code comes into play here, namely that under both coding schemes analyzed here, multiple blocks from the same coded stripe are required to be stored on the same node. This concentration of data belonging to a single stripe negatively impacts MapReduce execution times. However, much of this effect can be undone by simply adding a larger number of processors per node. Further improvements are possible if one tailors the Map task scheduler to the codes under consideration. We present both experimental and simulation results that validate these observations.
△ Less
Submitted 26 June, 2014;
originally announced June 2014.