-
Robust Gray Codes Approaching the Optimal Rate
Authors:
Roni Con,
Dorsa Fathollahi,
Ryan Gabrys,
Mary Wootters,
Eitan Yaakobi
Abstract:
Robust Gray codes were introduced by (Lolck and Pagh, SODA 2024). Informally, a robust Gray code is a (binary) Gray code $\mathcal{G}$ so that, given a noisy version of the encoding $\mathcal{G}(j)$ of an integer $j$, one can recover $\hat{j}$ that is close to $j$ (with high probability over the noise). Such codes have found applications in differential privacy.
In this work, we present near-opt…
▽ More
Robust Gray codes were introduced by (Lolck and Pagh, SODA 2024). Informally, a robust Gray code is a (binary) Gray code $\mathcal{G}$ so that, given a noisy version of the encoding $\mathcal{G}(j)$ of an integer $j$, one can recover $\hat{j}$ that is close to $j$ (with high probability over the noise). Such codes have found applications in differential privacy.
In this work, we present near-optimal constructions of robust Gray codes. In more detail, we construct a Gray code $\mathcal{G}$ of rate $1 - H_2(p) - \varepsilon$ that is efficiently encodable, and that is robust in the following sense. Supposed that $\mathcal{G}(j)$ is passed through the binary symmetric channel $\text{BSC}_p$ with cross-over probability $p$, to obtain $x$. We present an efficient decoding algorithm that, given $x$, returns an estimate $\hat{j}$ so that $|j - \hat{j}|$ is small with high probability.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Optimal Almost-Balanced Sequences
Authors:
Daniella Bar-Lev,
Adir Kobovich,
Orian Leitersdorf,
Eitan Yaakobi
Abstract:
This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is $\varepsilon(n)$-almost balanc…
▽ More
This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is $\varepsilon(n)$-almost balanced if its Hamming weight is between $0.5n\pm \varepsilon(n)$. It is known that for any algorithm with a constant number of bits, $\varepsilon(n)$ has to be in the order of $Θ(\sqrt{n})$, with $O(n)$ average time complexity. However, prior solutions with a single redundancy bit required $\varepsilon(n)$ to be a linear shift from $n/2$. Employing an iterative method and arithmetic coding, our emphasis lies in constructing almost balanced codes with a single redundancy bit. Notably, our method surpasses previous approaches by achieving the optimal balanced order of $Θ(\sqrt{n})$. Additionally, we extend our method to the non-binary case considering $q$-ary almost polarity-balanced sequences for even $q$, and almost symbol-balanced for $q=4$. Our work marks the first asymptotically optimal solutions for almost-balanced sequences, for both, binary and non-binary alphabet.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Representing Information on DNA using Patterns Induced by Enzymatic Labeling
Authors:
Daniella Bar-Lev,
Tuvi Etzion,
Eitan Yaakobi,
Zohar Yakhini
Abstract:
Enzymatic DNA labeling is a powerful tool with applications in biochemistry, molecular biology, biotechnology, medical science, and genomic research. This paper contributes to the evolving field of DNA-based data storage by presenting a formal framework for modeling DNA labeling in strings, specifically tailored for data storage purposes. Our approach involves a known DNA molecule as a template fo…
▽ More
Enzymatic DNA labeling is a powerful tool with applications in biochemistry, molecular biology, biotechnology, medical science, and genomic research. This paper contributes to the evolving field of DNA-based data storage by presenting a formal framework for modeling DNA labeling in strings, specifically tailored for data storage purposes. Our approach involves a known DNA molecule as a template for labeling, employing patterns induced by a set of designed labels to represent information. One hypothetical implementation can use CRISPR-Cas9 and gRNA reagents for labeling. Various aspects of the general labeling channel, including fixed-length labels, are explored, and upper bounds on the maximal size of the corresponding codes are given. The study includes the development of an efficient encoder-decoder pair that is proven optimal in terms of maximum code size under specific conditions.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Noise-Tolerant Codebooks for Semi-Quantitative Group Testing: Application to Spatial Genomics
Authors:
Kok Hao Chen,
Duc Tu Dao,
Han Mao Kiah,
Van Long Phuoc Pham,
Eitan Yaakobi
Abstract:
Motivated by applications in spatial genomics, we revisit group testing (Dorfman~1943) and propose the class of $λ$-{\sf ADD}-codes, studying such codes with certain distance $d$ and codelength $n$. When $d$ is constant, we provide explicit code constructions with rates close to $1/2$. When $d$ is proportional to $n$, we provide a GV-type lower bound whose rates are efficiently computable. Upper b…
▽ More
Motivated by applications in spatial genomics, we revisit group testing (Dorfman~1943) and propose the class of $λ$-{\sf ADD}-codes, studying such codes with certain distance $d$ and codelength $n$. When $d$ is constant, we provide explicit code constructions with rates close to $1/2$. When $d$ is proportional to $n$, we provide a GV-type lower bound whose rates are efficiently computable. Upper bounds for such codes are also studied.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Private Repair of a Single Erasure in Reed-Solomon Codes
Authors:
Stanislav Kruglik,
Han Mao Kiah,
Son Hoang Dau,
Eitan Yaakobi
Abstract:
We investigate the problem of privately recovering a single erasure for Reed-Solomon codes with low communication bandwidths. For an $[n,k]_{q^\ell}$ code with $n-k\geq q^{m}+t-1$, we construct a repair scheme that allows a client to recover an arbitrary codeword symbol without leaking its index to any set of $t$ colluding helper nodes at a repair bandwidth of $(n-1)(\ell-m)$ sub-symbols in…
▽ More
We investigate the problem of privately recovering a single erasure for Reed-Solomon codes with low communication bandwidths. For an $[n,k]_{q^\ell}$ code with $n-k\geq q^{m}+t-1$, we construct a repair scheme that allows a client to recover an arbitrary codeword symbol without leaking its index to any set of $t$ colluding helper nodes at a repair bandwidth of $(n-1)(\ell-m)$ sub-symbols in $\mathbb{F}_q$. When $t=1$, this reduces to the bandwidth of existing repair schemes based on subspace polynomials. We prove the optimality of the proposed scheme when $n=q^\ell$ under a reasonable assumption about the schemes being used. Our private repair scheme can also be transformed into a private retrieval scheme for data encoded by Reed-Solomon codes.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Coding for Synthesis Defects
Authors:
Ziyang Lu,
Han Mao Kiah,
Yiwei Zhang,
Robert N. Grass,
Eitan Yaakobi
Abstract:
Motivated by DNA based data storage system, we investigate the errors that occur when synthesizing DNA strands in parallel, where each strand is appended one nucleotide at a time by the machine according to a template supersequence. If there is a cycle such that the machine fails, then the strands meant to be appended at this cycle will not be appended, and we refer to this as a synthesis defect.…
▽ More
Motivated by DNA based data storage system, we investigate the errors that occur when synthesizing DNA strands in parallel, where each strand is appended one nucleotide at a time by the machine according to a template supersequence. If there is a cycle such that the machine fails, then the strands meant to be appended at this cycle will not be appended, and we refer to this as a synthesis defect. In this paper, we present two families of codes correcting synthesis defects, which are t-known-synthesis-defect correcting codes and t-synthesis-defect correcting codes. For the first one, it is assumed that the defective cycles are known, and each of the codeword is a quaternary sequence. We provide constructions for this family of codes for t = 1, 2, with redundancy log 4 and log n+18 log 3, respectively. For the second one, the codeword is a set of M ordered sequences, and we give constructions for t = 1, 2 to show a strategy for constructing this family of codes. Finally, we derive a lower bound on the redundancy for single-known-synthesis-defect correcting codes, which assures that our construction is almost optimal.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions
Authors:
Frederik Walter,
Omer Sabary,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Composite DNA is a recent method to increase the base alphabet size in DNA-based data storage.This paper models synthesizing and sequencing of composite DNA and introduces coding techniques to correct substitutions, losses of entire strands, and symbol deletion errors. Non-asymptotic upper bounds on the size of codes with $t$ occurrences of these error types are derived. Explicit constructions are…
▽ More
Composite DNA is a recent method to increase the base alphabet size in DNA-based data storage.This paper models synthesizing and sequencing of composite DNA and introduces coding techniques to correct substitutions, losses of entire strands, and symbol deletion errors. Non-asymptotic upper bounds on the size of codes with $t$ occurrences of these error types are derived. Explicit constructions are presented which can achieve the bounds.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
One Code Fits All: Strong stuck-at codes for versatile memory encoding
Authors:
Roni Con,
Ryan Gabrys,
Eitan Yaakobi
Abstract:
In this work we consider a generalization of the well-studied problem of coding for ``stuck-at'' errors, which we refer to as ``strong stuck-at'' codes. In the traditional framework of stuck-at codes, the task involves encoding a message into a one-dimensional binary vector. However, a certain number of the bits in this vector are 'frozen', meaning they are fixed at a predetermined value and canno…
▽ More
In this work we consider a generalization of the well-studied problem of coding for ``stuck-at'' errors, which we refer to as ``strong stuck-at'' codes. In the traditional framework of stuck-at codes, the task involves encoding a message into a one-dimensional binary vector. However, a certain number of the bits in this vector are 'frozen', meaning they are fixed at a predetermined value and cannot be altered by the encoder. The decoder, aware of the proportion of frozen bits but not their specific positions, is responsible for deciphering the intended message. We consider a more challenging version of this problem where the decoder does not know also the fraction of frozen bits. We construct explicit and efficient encoding and decoding algorithms that get arbitrarily close to capacity in this scenario. Furthermore, to the best of our knowledge, our construction is the first, fully explicit construction of stuck-at codes that approach capacity.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Permutation Recovery Problem against Deletion Errors for DNA Data Storage
Authors:
Shubhransh Singhvi,
Charchit Gupta,
Avital Boruchovsky,
Yuval Goldberg,
Han Mao Kiah,
Eitan Yaakobi
Abstract:
Owing to its immense storage density and durability, DNA has emerged as a promising storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules called data blocks that are stored in an unordered way. To handle the unordered nature of DNA data storage systems, a unique address is typically prepended to each data block to form a DNA strand. Howev…
▽ More
Owing to its immense storage density and durability, DNA has emerged as a promising storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules called data blocks that are stored in an unordered way. To handle the unordered nature of DNA data storage systems, a unique address is typically prepended to each data block to form a DNA strand. However, DNA storage systems are prone to errors and generate multiple noisy copies of each strand called DNA reads. Thus, we study the permutation recovery problem against deletions errors for DNA data storage.
The permutation recovery problem for DNA data storage requires one to reconstruct the addresses or in other words to uniquely identify the noisy reads. By successfully reconstructing the addresses, one can essentially determine the correct order of the data blocks, effectively solving the clustering problem.
We first show that we can almost surely identify all the noisy reads under certain mild assumptions. We then propose a permutation recovery procedure and analyze its complexity.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
An Optimal Sequence Reconstruction Algorithm for Reed-Solomon Codes
Authors:
Shubhransh Singhvi,
Roni Con,
Han Mao Kiah,
Eitan Yaakobi
Abstract:
The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a scenario where the sender transmits a codeword from some codebook, and the receiver obtains $N$ noisy outputs of the codeword. We study the problem of efficient reconstruction using $N$ outputs that are each corrupted by at most $t$ substitutions. Specifically, for the ubiquitous Reed-Solomon codes, we adapt the Ko…
▽ More
The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a scenario where the sender transmits a codeword from some codebook, and the receiver obtains $N$ noisy outputs of the codeword. We study the problem of efficient reconstruction using $N$ outputs that are each corrupted by at most $t$ substitutions. Specifically, for the ubiquitous Reed-Solomon codes, we adapt the Koetter-Vardy soft-decoding algorithm, presenting a reconstruction algorithm capable of correcting beyond Johnson radius. Furthermore, the algorithm uses $\mathcal{O}(nN)$ field operations, where $n$ is the codeword length.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Tail-Erasure-Correcting Codes
Authors:
Boaz Moav,
Ryan Gabrys,
Eitan Yaakobi
Abstract:
The increasing demand for data storage has prompted the exploration of new techniques, with molecular data storage being a promising alternative. In this work, we develop coding schemes for a new storage paradigm that can be represented as a collection of two-dimensional arrays. Motivated by error patterns observed in recent prototype architectures, our study focuses on correcting erasures in the…
▽ More
The increasing demand for data storage has prompted the exploration of new techniques, with molecular data storage being a promising alternative. In this work, we develop coding schemes for a new storage paradigm that can be represented as a collection of two-dimensional arrays. Motivated by error patterns observed in recent prototype architectures, our study focuses on correcting erasures in the last few symbols of each row, and also correcting arbitrary deletions across rows. We present code constructions and explicit encoders and decoders that are shown to be nearly optimal in many scenarios. We show that the new coding schemes are capable of effectively mitigating these errors, making these emerging storage platforms potentially promising solutions.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Covering All Bases: The Next Inning in DNA Sequencing Efficiency
Authors:
Hadas Abraham,
Rayn Gabrys,
Eitan Yaakobi
Abstract:
DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyse…
▽ More
DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems.
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
Interactive Byzantine-Resilient Gradient Coding for General Data Assignments
Authors:
Shreyas Jain,
Luis Maßny,
Christoph Hofmeister,
Eitan Yaakobi,
Rawad Bitar
Abstract:
We tackle the problem of Byzantine errors in distributed gradient descent within the Byzantine-resilient gradient coding framework. Our proposed solution can recover the exact full gradient in the presence of $s$ malicious workers with a data replication factor of only $s+1$. It generalizes previous solutions to any data assignment scheme that has a regular replication over all data samples. The s…
▽ More
We tackle the problem of Byzantine errors in distributed gradient descent within the Byzantine-resilient gradient coding framework. Our proposed solution can recover the exact full gradient in the presence of $s$ malicious workers with a data replication factor of only $s+1$. It generalizes previous solutions to any data assignment scheme that has a regular replication over all data samples. The scheme detects malicious workers through additional interactive communication and a small number of local computations at the main node, leveraging group-wise comparisons between workers with a provably optimal grou** strategy. The scheme requires at most $s$ interactive rounds that incur a total communication cost logarithmic in the number of data samples.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Correcting a Single Deletion in Reads from a Nanopore Sequencer
Authors:
Anisha Banerjee,
Yonatan Yehezkeally,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to devise practical and efficient coding schemes that would allow accurate retrieval of the data stored. Our work takes a step in this direction by adopting a simplifi…
▽ More
Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to devise practical and efficient coding schemes that would allow accurate retrieval of the data stored. Our work takes a step in this direction by adopting a simplified model of the nanopore sequencer inspired by Mao \emph{et al.}, which incorporates some of its physical aspects. This channel model can be viewed as a sliding window of length $\ell$ that passes over the incoming input sequence and produces the Hamming weight of the enclosed $\ell$ bits, while shifting by one position at each time step. The resulting $(\ell+1)$-ary vector, referred to as the $\ell$-\emph{read vector}, is susceptible to deletion errors due to imperfections inherent in the sequencing process. We establish that at least $\log n - \ell$ bits of redundancy are needed to correct a single deletion. An error-correcting code that is optimal up to an additive constant, is also proposed. Furthermore, we find that for $\ell \geq 2$, reconstruction from two distinct noisy $\ell$-read vectors can be accomplished without any redundancy, and provide a suitable reconstruction algorithm to this effect.
△ Less
Submitted 7 May, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
Achieving DNA Labeling Capacity with Minimum Labels through Extremal de Bruijn Subgraphs
Authors:
Christoph Hofmeister,
Anina Gruica,
Dganit Hanania,
Rawad Bitar,
Eitan Yaakobi
Abstract:
DNA labeling is a tool in molecular biology and biotechnology to visualize, detect, and study DNA at the molecular level. In this process, a DNA molecule is labeled by a set of specific patterns, referred to as labels, and is then imaged. The resulting image is modeled as an $(\ell+1)$-ary sequence, where $\ell$ is the number of labels, in which any non-zero symbol indicates the appearance of the…
▽ More
DNA labeling is a tool in molecular biology and biotechnology to visualize, detect, and study DNA at the molecular level. In this process, a DNA molecule is labeled by a set of specific patterns, referred to as labels, and is then imaged. The resulting image is modeled as an $(\ell+1)$-ary sequence, where $\ell$ is the number of labels, in which any non-zero symbol indicates the appearance of the corresponding label in the DNA molecule. The labeling capacity refers to the maximum information rate that can be achieved by the labeling process for any given set of labels. The main goal of this paper is to study the minimum number of labels of the same length required to achieve the maximum labeling capacity of 2 for DNA sequences or $\log_2q$ for an arbitrary alphabet of size $q$. The solution to this problem requires the study of path unique subgraphs of the de Bruijn graph with the largest number of edges and we provide upper and lower bounds on this value.
△ Less
Submitted 28 January, 2024;
originally announced January 2024.
-
Reducing Coverage Depth in DNA Storage: A Combinatorial Perspective on Random Access Efficiency
Authors:
Anina Gruica,
Daniella Bar-Lev,
Alberto Ravagnani,
Eitan Yaakobi
Abstract:
We investigate the fundamental limits of the recently proposed random access coverage depth problem for DNA data storage. Under this paradigm, it is assumed that the user information consists of $k$ information strands, which are encoded into $n$ strands via some generator matrix $G$. In the sequencing process, the strands are read uniformly at random, since each strand is available in a large num…
▽ More
We investigate the fundamental limits of the recently proposed random access coverage depth problem for DNA data storage. Under this paradigm, it is assumed that the user information consists of $k$ information strands, which are encoded into $n$ strands via some generator matrix $G$. In the sequencing process, the strands are read uniformly at random, since each strand is available in a large number of copies. In this context, the random access coverage depth problem refers to the expected number of reads (i.e., sequenced strands) until it is possible to decode a specific information strand, which is requested by the user. The goal is to minimize the maximum expectation over all possible requested information strands, and this value is denoted by $T_{\max}(G)$. This paper introduces new techniques to investigate the random access coverage depth problem, which capture its combinatorial nature. We establish two general formulas to find $T_{max}(G)$ for arbitrary matrices. We introduce the concept of recovery balanced codes and combine all these results and notions to compute $T_{\max}(G)$ for MDS, simplex, and Hamming codes. We also study the performance of modified systematic MDS matrices and our results show that the best results for $T_{\max}(G)$ are achieved with a specific mix of encoded strands and replication of the information strands.
△ Less
Submitted 28 January, 2024;
originally announced January 2024.
-
Error-Correcting Codes for Combinatorial Composite DNA
Authors:
Omer Sabary,
Inbal Preuss,
Ryan Gabrys,
Zohar Yakhini,
Leon Anavy,
Eitan Yaakobi
Abstract:
Data storage in DNA is develo** as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, con…
▽ More
Data storage in DNA is develo** as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth.
△ Less
Submitted 26 May, 2024; v1 submitted 28 January, 2024;
originally announced January 2024.
-
The Capacity of the Weighted Read Channel
Authors:
Omer Yerushalmi,
Tuvi Etzion,
Eitan Yaakobi
Abstract:
One of the primary sequencing methods gaining prominence in DNA storage is nanopore sequencing, attributed to various factors. In this work, we consider a simplified model of the sequencer, characterized as a channel. This channel takes a sequence and processes it using a sliding window of length $\ell$, shifting the window by $δ$ characters each time. The output of this channel, which we refer to…
▽ More
One of the primary sequencing methods gaining prominence in DNA storage is nanopore sequencing, attributed to various factors. In this work, we consider a simplified model of the sequencer, characterized as a channel. This channel takes a sequence and processes it using a sliding window of length $\ell$, shifting the window by $δ$ characters each time. The output of this channel, which we refer to as the read vector, is a vector containing the sums of the entries in each of the windows. The capacity of the channel is defined as the maximal information rate of the channel. Previous works have already revealed capacity values for certain parameters $\ell$ and $δ$. In this work, we show that when $δ< \ell < 2δ$, the capacity value is given by $\frac{1}δ\log_2 \frac{1}{2}(\ell+1+ \sqrt{(\ell+1)^2 - 4(\ell - δ)(\ell-δ+1)})$. Additionally, we construct an upper bound when $2δ< \ell$. Finally, we extend the model to the two-dimensional case and present several results on its capacity.
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Byzantine-Resilient Gradient Coding through Local Gradient Computations
Authors:
Christoph Hofmeister,
Luis Maßny,
Eitan Yaakobi,
Rawad Bitar
Abstract:
We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partia…
▽ More
We consider gradient coding in the presence of an adversary controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the responses from malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we propose a way to reduce the replication to $s+1$ instead of $2s+1$ in the presence of $s$ malicious workers. Our method detects erroneous inputs from the malicious workers, transforming them into erasures. This comes at the expense of $s$ additional local computations at the main node and additional rounds of light communication between the main node and the workers. We define a general framework and give fundamental limits for fractional repetition data allocations. Our scheme is optimal in terms of replication and local computation and incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound. We furthermore show how additional redundancy can be exploited to reduce the number of local computations and communication cost, or, alternatively, tolerate straggling workers.
△ Less
Submitted 5 January, 2024; v1 submitted 4 January, 2024;
originally announced January 2024.
-
M-DAB: An Input-Distribution Optimization Algorithm for Composite DNA Storage by the Multinomial Channel
Authors:
Adir Kobovich,
Eitan Yaakobi,
Nir Weinberger
Abstract:
Recent experiments have shown that the capacity of DNA storage systems may be significantly increased by synthesizing composite DNA letters. In this work, we model a DNA storage channel with composite inputs as a \textit{multinomial channel}, and propose an optimization algorithm for its capacity achieving input distribution, for an arbitrary number of output reads. The algorithm is termed multidi…
▽ More
Recent experiments have shown that the capacity of DNA storage systems may be significantly increased by synthesizing composite DNA letters. In this work, we model a DNA storage channel with composite inputs as a \textit{multinomial channel}, and propose an optimization algorithm for its capacity achieving input distribution, for an arbitrary number of output reads. The algorithm is termed multidimensional dynamic assignment Blahut-Arimoto (M-DAB), and is a generalized version of the DAB algorithm, proposed by Wesel et al. developed for the binomial channel. We also empirically observe a scaling law behavior of the capacity as a function of the support size of the capacity-achieving input distribution.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
Storage codes and recoverable systems on lines and grids
Authors:
Alexander Barg,
Ohad Elishco,
Ryan Gabrys,
Geyang Wang,
Eitan Yaakobi
Abstract:
A storage code is an assignment of symbols to the vertices of a connected graph $G(V,E)$ with the property that the value of each vertex is a function of the values of its neighbors, or more generally, of a certain neighborhood of the vertex in $G$. In this work we introduce a new construction method of storage codes, enabling one to construct new codes from known ones via an interleaving procedur…
▽ More
A storage code is an assignment of symbols to the vertices of a connected graph $G(V,E)$ with the property that the value of each vertex is a function of the values of its neighbors, or more generally, of a certain neighborhood of the vertex in $G$. In this work we introduce a new construction method of storage codes, enabling one to construct new codes from known ones via an interleaving procedure driven by resolvable designs. We also study storage codes on $\mathbb Z$ and ${\mathbb Z}^2$ (lines and grids), finding closed-form expressions for the capacity of several one and two-dimensional systems depending on their recovery set, using connections between storage codes, graphs, anticodes, and difference-avoiding sets.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Error-Correcting Codes for Nanopore Sequencing
Authors:
Anisha Banerjee,
Yonatan Yehezkeally,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporatin…
▽ More
Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporating intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length \(\ell\) over a \(q\)-ary input sequence that outputs the \textit{composition} of the enclosed \(\ell\) bits and shifts by \(δ\) positions with each time step. In this context, the composition of a \(q\)-ary vector $\bfx$ specifies the number of occurrences in \(\bfx\) of each symbol in \(\lbrace 0,1,\ldots, q-1\rbrace\). The resulting compositions vector, termed the \emph{read vector}, may also be corrupted by \(t\) substitution errors. By employing graph-theoretic techniques, we deduce that for \(δ=1\), at least \(\log \log n\) symbols of redundancy are required to correct a single (\(t=1\)) substitution. Finally, for \(\ell \geq 3\), we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors.
△ Less
Submitted 8 December, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
On the Capacity of DNA Labeling
Authors:
Dganit Hanania,
Daniella Bar-Lev,
Yevgeni Nogin,
Yoav Shechtman,
Eitan Yaakobi
Abstract:
DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k + 1)- ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label…
▽ More
DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k + 1)- ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.
△ Less
Submitted 22 January, 2024; v1 submitted 13 May, 2023;
originally announced May 2023.
-
Coding for IBLTs with Listing Guarantees
Authors:
Daniella Bar-Lev,
Avi Mizrahi,
Tuvi Etzion,
Ori Rottenstreich,
Eitan Yaakobi
Abstract:
The Invertible Bloom Lookup Table (IBLT) is a probabilistic data structure for set representation, with applications in network and traffic monitoring. It is known for its ability to list its elements, an operation that succeeds with high probability for sufficiently large table. However, listing can fail even for relatively small sets. This paper extends recent work on the worst-case analysis of…
▽ More
The Invertible Bloom Lookup Table (IBLT) is a probabilistic data structure for set representation, with applications in network and traffic monitoring. It is known for its ability to list its elements, an operation that succeeds with high probability for sufficiently large table. However, listing can fail even for relatively small sets. This paper extends recent work on the worst-case analysis of IBLT, which guarantees successful listing for all sets of a certain size, by introducing more general IBLT schemes. These schemes allow for greater freedom in the implementation of the insert, delete, and listing operations and demonstrate that the IBLT memory can be reduced while still maintaining successful listing guarantees. The paper also explores the time-memory trade-off of these schemes, some of which are based on linear codes and \(B_h\)-sequences over finite fields.
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems
Authors:
Daniella Bar-Lev,
Omer Sabary,
Ryan Gabrys,
Eitan Yaakobi
Abstract:
Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly $120/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which…
▽ More
Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly $120/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand the effect of error-correcting codes and retrieval algorithms on the required sequencing coverage depth. We establish that the expected number of reads that are required for information retrieval is minimized when the channel follows a uniform distribution. We also derive upper and lower bounds on the probability distribution of this number of required reads and provide a comprehensive upper and lower bound on its expected value. We further prove that for a noiseless channel and uniform distribution, MDS codes are optimal in terms of minimizing the expected number of reads. Additionally, we study the DNA coverage depth problem under the random-access setup, in which the user aims to retrieve just a specific information unit from the entire DNA storage system. We prove that the expected retrieval time is at least k for [n,k] MDS codes as well as for other families of codes. Furthermore, we present explicit code constructions that achieve expected retrieval times below k and evaluate their performance through analytical methods and simulations. Lastly, we provide lower bounds on the maximum expected retrieval time. Our findings offer valuable insights for reducing the cost and latency of DNA storage.
△ Less
Submitted 29 November, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Data-Driven Bee Identification for DNA Strands
Authors:
Shubhransh Singhvi,
Avital Boruchovsky,
Han Mao Kiah,
Eitan Yaakobi
Abstract:
We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify $M$ bees, each tagged by a unique barcode, via a set of $M$ noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes $N$ noisy measurements of each bee, and applied the model to addres…
▽ More
We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify $M$ bees, each tagged by a unique barcode, via a set of $M$ noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes $N$ noisy measurements of each bee, and applied the model to address the unordered nature of DNA storage systems. In such systems, a unique address is typically prepended to each DNA data block to form a DNA strand, but the address may possibly be corrupted. While clustering is usually used to identify the address of a DNA strand, this requires $\mathcal{M}^2$ data comparisons (when $\mathcal{M}$ is the number of reads). In contrast, the approach of Chrisnata et al. (2022) avoids data comparisons completely. In this work, we study an intermediate, data-driven approach to this identification task. For the binary erasure channel, we first show that we can almost surely correctly identify all DNA strands under certain mild assumptions. Then we propose a data-driven pruning procedure and demonstrate that on average the procedure uses only a fraction of $\mathcal{M}^2$ data comparisons. Specifically, for $\mathcal{M}= 2^n$ and erasure probability $p$, the expected number of data comparisons performed by the procedure is $κ\mathcal{M}^2$, where $\left(\frac{1+2p-p^2}{2}\right)^n \leq κ\leq \left(\frac{1+p}{2}\right)^n $.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems
Authors:
Avital Boruchovsky,
Daniella Bar-Lev,
Eitan Yaakobi
Abstract:
This paper introduces a new solution to DNA storage that integrates all three steps of retrieval, namely clustering, reconstruction, and error correction. DNA-correcting codes are presented as a unique solution to the problem of ensuring that the output of the storage system is unique for any valid set of input strands. To this end, we introduce a novel distance metric to capture the unique behavi…
▽ More
This paper introduces a new solution to DNA storage that integrates all three steps of retrieval, namely clustering, reconstruction, and error correction. DNA-correcting codes are presented as a unique solution to the problem of ensuring that the output of the storage system is unique for any valid set of input strands. To this end, we introduce a novel distance metric to capture the unique behavior of the DNA storage system and provide necessary and sufficient conditions for DNA-correcting codes. The paper also includes several bounds and constructions of DNA-correcting codes.
△ Less
Submitted 30 June, 2024; v1 submitted 20 April, 2023;
originally announced April 2023.
-
Universal Framework for Parametric Constrained Coding
Authors:
Daniella Bar-Lev,
Adir Kobovich,
Orian Leitersdorf,
Eitan Yaakobi
Abstract:
Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While channels with fixed constraints have a general optimal solution, there is increasing demand for parametric constraints that are dependent on the message length. Several works have tackled such parametric constraints through iterative algorithms, yet they require compl…
▽ More
Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While channels with fixed constraints have a general optimal solution, there is increasing demand for parametric constraints that are dependent on the message length. Several works have tackled such parametric constraints through iterative algorithms, yet they require complex constructions specific to each constraint to guarantee convergence through monotonic progression. In this paper, we propose a universal framework for tackling any parametric constrained-channel problem through a novel simple iterative algorithm. By reducing an execution of this iterative algorithm to an acyclic graph traversal, we prove a surprising result that guarantees convergence with efficient average time complexity even without requiring any monotonic progression.
We demonstrate the effectiveness of this universal framework by applying it to a variety of both local and global channel constraints. We begin by exploring the local constraints involving illegal substrings of variable length, where the universal construction essentially iteratively replaces forbidden windows. We apply this local algorithm to the minimal periodicity, minimal Hamming weight, local almost-balanced Hamming weight and the previously-unsolved minimal palindrome constraints. We then continue by exploring global constraints, and demonstrate the effectiveness of the proposed construction on the repeat-free encoding, reverse-complement encoding, and the open problem of global almost-balanced encoding. For reverse-complement, we also tackle a previously-unsolved version of the constraint that addresses overlap** windows. Overall, the proposed framework generates state-of-the-art constructions with significant ease while also enabling the simultaneous integration of multiple constraints for the first time.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Trading Communication for Computation in Byzantine-Resilient Gradient Coding
Authors:
Christoph Hofmeister,
Luis Maßny,
Eitan Yaakobi,
Rawad Bitar
Abstract:
We consider gradient coding in the presence of an adversary, controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the inputs of the malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partia…
▽ More
We consider gradient coding in the presence of an adversary, controlling so-called malicious workers trying to corrupt the computations. Previous works propose the use of MDS codes to treat the inputs of the malicious workers as errors and correct them using the error-correction properties of the code. This comes at the expense of increasing the replication, i.e., the number of workers each partial gradient is computed by. In this work, we reduce replication by proposing a method that detects the erroneous inputs from the malicious workers, hence transforming them into erasures. For $s$ malicious workers, our solution can reduce the replication to $s+1$ instead of $2s+1$ for each partial gradient at the expense of only $s$ additional computations at the main node and additional rounds of light communication between the main node and the workers. We give fundamental limits of the general framework for fractional repetition data allocation. Our scheme is optimal in terms of replication and local computation but incurs a communication cost that is asymptotically, in the size of the dataset, a multiplicative factor away from the derived bound.
△ Less
Submitted 5 June, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
Invertible Bloom Lookup Tables with Listing Guarantees
Authors:
Avi Mizrahi,
Daniella Bar-Lev,
Eitan Yaakobi,
Ori Rottenstreich
Abstract:
The Invertible Bloom Lookup Table (IBLT) is a probabilistic concise data structure for set representation that supports a listing operation as the recovery of the elements in the represented set. Its applications can be found in network synchronization and traffic monitoring as well as in error-correction codes. IBLT can list its elements with probability affected by the size of the allocated memo…
▽ More
The Invertible Bloom Lookup Table (IBLT) is a probabilistic concise data structure for set representation that supports a listing operation as the recovery of the elements in the represented set. Its applications can be found in network synchronization and traffic monitoring as well as in error-correction codes. IBLT can list its elements with probability affected by the size of the allocated memory and the size of the represented set, such that it can fail with small probability even for relatively small sets. While previous works only studied the failure probability of IBLT, this work initiates the worst case analysis of IBLT that guarantees successful listing for all sets of a certain size. The worst case study is important since the failure of IBLT imposes high overhead. We describe a novel approach that guarantees successful listing when the set satisfies a tunable upper bound on its size. To allow that, we develop multiple constructions that are based on various coding techniques such as stop** sets and the stop** redundancy of error-correcting codes, Steiner systems, and covering arrays as well as new methodologies we develop. We analyze the sizes of IBLTs with listing guarantees obtained by the various methods as well as their map** memory consumption. Lastly, we study lower bounds on the achievable sizes of IBLT with listing guarantees and verify the results in the paper by simulations.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
Generalized Unique Reconstruction from Substrings
Authors:
Yonatan Yehezkeally,
Daniella Bar-Lev,
Sagi Marcovich,
Eitan Yaakobi
Abstract:
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this…
▽ More
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length $\ell$ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of $\ell$ that asymptotically behave like the lower bound.
△ Less
Submitted 20 April, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Equivalence of Insertion/Deletion Correcting Codes for $d$-dimensional Arrays
Authors:
Evagoras Stylianou,
Lorenz Welter,
Rawad Bitar,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
We consider the problem of correcting insertion and deletion errors in the $d$-dimensional space. This problem is well understood for vectors (one-dimensional space) and was recently studied for arrays (two-dimensional space). For vectors and arrays, the problem is motivated by several practical applications such as DNA-based storage and racetrack memories. From a theoretical perspective, it is in…
▽ More
We consider the problem of correcting insertion and deletion errors in the $d$-dimensional space. This problem is well understood for vectors (one-dimensional space) and was recently studied for arrays (two-dimensional space). For vectors and arrays, the problem is motivated by several practical applications such as DNA-based storage and racetrack memories. From a theoretical perspective, it is interesting to know whether the same properties of insertion/deletion correcting codes generalize to the $d$-dimensional space. In this work, we show that the equivalence between insertion and deletion correcting codes generalizes to the $d$-dimensional space. As a particular result, we show the following missing equivalence for arrays: a code that can correct $t_\mathrm{r}$ and $t_\mathrm{c}$ row/column deletions can correct any combination of $t_\mathrm{r}^{\mathrm{ins}}+t_\mathrm{r}^{\mathrm{del}}=t_\mathrm{r}$ and $t_\mathrm{c}^{\mathrm{ins}}+t_\mathrm{c}^{\mathrm{del}}=t_\mathrm{c}$ row/column insertions and deletions. The fundamental limit on the redundancy and a construction of insertion/deletion correcting codes in the $d$-dimensional space remain open for future work.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
On the Size of Balls and Anticodes of Small Diameter under the Fixed-Length Levenshtein Metric
Authors:
Daniella Bar-Lev,
Tuvi Etzion,
Eitan Yaakobi
Abstract:
The rapid development of DNA storage has brought the deletion and insertion channel to the front line of research. When the number of deletions is equal to the number of insertions, the Fixed Length Levenshtein (FLL) metric is the right measure for the distance between two words of the same length. Similar to any other metric, the size of a ball is one of the most fundamental parameters. In this w…
▽ More
The rapid development of DNA storage has brought the deletion and insertion channel to the front line of research. When the number of deletions is equal to the number of insertions, the Fixed Length Levenshtein (FLL) metric is the right measure for the distance between two words of the same length. Similar to any other metric, the size of a ball is one of the most fundamental parameters. In this work, we consider the minimum, maximum, and average size of a ball with radius one, in the FLL metric. The related minimum and the maximum size of a maximal anticode with diameter one are also considered.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Covering Sequences for $\ell$-Tuples
Authors:
Sagi Marcovich,
Tuvi Etzion,
Eitan Yaakobi
Abstract:
de Bruijn sequences of order $\ell$, i.e., sequences that contain each $\ell$-tuple as a window exactly once, have found many diverse applications in information theory and most recently in DNA storage. This family of binary sequences has rate of $1/2$. To overcome this low rate, we study $\ell$-tuples covering sequences, which impose that each $\ell$-tuple appears at least once as a window in the…
▽ More
de Bruijn sequences of order $\ell$, i.e., sequences that contain each $\ell$-tuple as a window exactly once, have found many diverse applications in information theory and most recently in DNA storage. This family of binary sequences has rate of $1/2$. To overcome this low rate, we study $\ell$-tuples covering sequences, which impose that each $\ell$-tuple appears at least once as a window in the sequence. The cardinality of this family of sequences is analyzed while assuming that $\ell$ is a function of the sequence length $n$. Lower and upper bounds on the asymptotic rate of this family are given. Moreover, we study an upper bound for $\ell$ such that the redundancy of the set of $\ell$-tuples covering sequences is at most a single symbol. Lastly, we present efficient encoding and decoding schemes for $\ell$-tuples covering sequences that meet this bound.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
Reconstruction from Substrings with Partial Overlap
Authors:
Yonatan Yehezkeally,
Daniella Bar-Lev,
Sagi Marcovich,
Eitan Yaakobi
Abstract:
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which \emph{all} substrings of some fixed length are read or substrings are read with no overlap, this work considers the set…
▽ More
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which \emph{all} substrings of some fixed length are read or substrings are read with no overlap, this work considers the setup in which consecutive substrings are read with some given minimum overlap. First, upper bounds are provided on the attainable rates of codes that guarantee unique reconstruction. Then, we present efficient constructions of asymptotically optimal codes that meet the upper bound.
△ Less
Submitted 8 May, 2022;
originally announced May 2022.
-
Codes for Constrained Periodicity
Authors:
Adir Kobovich,
Orian Leitersdorf,
Daniella Bar-Lev,
Eitan Yaakobi
Abstract:
Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper pr…
▽ More
Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints.
△ Less
Submitted 25 August, 2022; v1 submitted 8 May, 2022;
originally announced May 2022.
-
The Input and Output Entropies of the $k$-Deletion/Insertion Channel
Authors:
Shubhransh Singhvi,
Omer Sabary,
Daniella Bar-Lev,
Eitan Yaakobi
Abstract:
The channel output entropy of a transmitted word is the entropy of the possible channel outputs and similarly, the input entropy of a received word is the entropy of all possible transmitted words. The goal of this work is to study these entropy values for the k-deletion, k-insertion channel, where exactly k symbols are deleted, and inserted in the transmitted word, respectively. If all possible w…
▽ More
The channel output entropy of a transmitted word is the entropy of the possible channel outputs and similarly, the input entropy of a received word is the entropy of all possible transmitted words. The goal of this work is to study these entropy values for the k-deletion, k-insertion channel, where exactly k symbols are deleted, and inserted in the transmitted word, respectively. If all possible words are transmitted with the same probability then studying the input and output entropies is equivalent. For both the 1-insertion and 1-deletion channels, it is proved that among all words with a fixed number of runs, the input entropy is minimized for words with a skewed distribution of their run lengths and it is maximized for words with a balanced distribution of their run lengths. Among our results, we establish a conjecture by Atashpendar et al. which claims that for the binary 1-deletion, the input entropy is maximized for the alternating words. This conjecture is also verified for the 2-deletion channel, where it is proved that constant words with a single run minimize the input entropy.
△ Less
Submitted 15 June, 2022; v1 submitted 7 February, 2022;
originally announced February 2022.
-
Adversarial Torn-paper Codes
Authors:
Daniella Bar-Lev,
Sagi Marcovich,
Eitan Yaakobi,
Yonatan Yehezkeally
Abstract:
We study the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry information may break into smaller pieces which are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We develop code constructions for any parameters of the channel for which non-vanishing asymptotic r…
▽ More
We study the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry information may break into smaller pieces which are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We develop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show our constructions achieve asymptotically optimal rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.
△ Less
Submitted 4 July, 2023; v1 submitted 26 January, 2022;
originally announced January 2022.
-
Insertion and Deletion Correction in Polymer-based Data Storage
Authors:
Anisha Banerjee,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Synthetic polymer-based storage seems to be a particularly promising candidate that could help to cope with the ever-increasing demand for archival storage requirements. It involves designing molecules of distinct masses to represent the respective bits $\{0,1\}$, followed by the synthesis of a polymer of molecular units that reflects the order of bits in the information string. Reading out the st…
▽ More
Synthetic polymer-based storage seems to be a particularly promising candidate that could help to cope with the ever-increasing demand for archival storage requirements. It involves designing molecules of distinct masses to represent the respective bits $\{0,1\}$, followed by the synthesis of a polymer of molecular units that reflects the order of bits in the information string. Reading out the stored data requires the use of a tandem mass spectrometer, that fragments the polymer into shorter substrings and provides their corresponding masses, from which the \emph{composition}, i.e. the number of $1$s and $0$s in the concerned substring can be inferred. Prior works have dealt with the problem of unique string reconstruction from the set of all possible compositions, called \emph{composition multiset}. This was accomplished either by determining which string lengths always allow unique reconstruction, or by formulating coding constraints to facilitate the same for all string lengths. Additionally, error-correcting schemes to deal with substitution errors caused by imprecise fragmentation during the readout process, have also been suggested. This work builds on this research by generalizing previously considered error models, mainly confined to substitution of compositions. To this end, we define new error models that consider insertions of spurious compositions and deletions of existing ones, thereby corrupting the composition multiset. We analyze if the reconstruction codebook proposed by Pattabiraman \emph{et al.} is indeed robust to such errors, and if not, propose new coding constraints to remedy this.
△ Less
Submitted 24 January, 2022; v1 submitted 21 January, 2022;
originally announced January 2022.
-
On The Decoding Error Weight of One or Two Deletion Channels
Authors:
Omer Sabary,
Daniella Bar-Lev,
Yotam Gershon,
Alexander Yucovich,
Eitan Yaakobi
Abstract:
This paper tackles two problems that are relevant to coding for insertions and deletions. These problems are motivated by several applications, among them is reconstructing strands in DNA-based storage systems. Under this paradigm, a word is transmitted over some fixed number of identical independent channels and the goal of the decoder is to output the transmitted word or some close approximation…
▽ More
This paper tackles two problems that are relevant to coding for insertions and deletions. These problems are motivated by several applications, among them is reconstructing strands in DNA-based storage systems. Under this paradigm, a word is transmitted over some fixed number of identical independent channels and the goal of the decoder is to output the transmitted word or some close approximation of it. The first part of this paper studies the deletion channel that deletes a symbol with some fixed probability $p$, while focusing on two instances of this channel. Since operating the maximum likelihood (ML) decoder in this case is computationally unfeasible, we study a slightly degraded version of this decoder for two channels and its expected normalized distance. We identify the dominant error patterns and based on these observations, it is derived that the expected normalized distance of the degraded ML decoder is roughly $\frac{3q-1}{q-1}p^2$, when the transmitted word is any $q$-ary sequence and $p$ is the channel's deletion probability. We also study the cases when the transmitted word belongs to the Varshamov Tenengolts (VT) code or the shifted VT code. Additionally, the insertion channel is studied as well as the case of two insertion channels. These theoretical results are verified by corresponding simulations. The second part of the paper studies optimal decoding for a special case of the deletion channel, the $k$-deletion channel, which deletes exactly $k$ symbols of the transmitted word uniformly at random. In this part, the goal is to understand how an optimal decoder operates in order to minimize the expected normalized distance. A full characterization of an efficient optimal decoder for this setup, referred to as the maximum likelihood* (ML*) decoder, is given for a channel that deletes one or two symbols.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
Lifted Reed-Solomon Codes and Lifted Multiplicity Codes
Authors:
Lukas Holzbaur,
Rina Polyanskaya,
Nikita Polyanskii,
Ilya Vorobyev,
Eitan Yaakobi
Abstract:
Lifted Reed-Solomon and multiplicity codes are classes of codes, constructed from specific sets of $m$-variate polynomials. These codes allow for the design of high-rate codes that can recover every codeword or information symbol from many disjoint sets. Recently, the underlying approaches have been combined for the bi-variate case to construct lifted multiplicity codes, a generalization of lifted…
▽ More
Lifted Reed-Solomon and multiplicity codes are classes of codes, constructed from specific sets of $m$-variate polynomials. These codes allow for the design of high-rate codes that can recover every codeword or information symbol from many disjoint sets. Recently, the underlying approaches have been combined for the bi-variate case to construct lifted multiplicity codes, a generalization of lifted codes that can offer further rate improvements. We continue the study of these codes by first establishing new lower bounds on the rate of lifted Reed-Solomon codes for any number of variables $m$, which improve upon the known bounds for any $m\ge 4$. Next, we use these results to provide lower bounds on the rate and distance of lifted multiplicity codes obtained from polynomials in an arbitrary number of variables, which improve upon the known results for any $m\ge 3$.
Specifically, we investigate a subcode of a lifted multiplicity code formed by the linear span of $m$-variate monomials whose restriction to an arbitrary line in $\mathbb{F}_q^m$ is equivalent to a low-degree univariate polynomial. We find the tight asymptotic behavior of the fraction of such monomials when the number of variables $m$ is fixed and the alphabet size $q=2^\ell$ is large. Using these results, we give a new explicit construction of batch codes utilizing lifted Reed-Solomon codes. For some parameter regimes, these codes have a better trade-off between parameters than previously known batch codes. Further, we show that lifted multiplicity codes have a better trade-off between redundancy and the number of disjoint recovering sets for every codeword or information symbol than previously known constructions, thereby providing the best known PIR codes for some parameter regimes. Additionally, we present a new local self-correction algorithm for lifted multiplicity codes.
△ Less
Submitted 11 October, 2021; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Endurance-Limited Memories: Capacity and Codes
Authors:
Yeow Meng Chee,
Michal Horovitz,
Alexander Vardy,
Van Khu Vu,
Eitan Yaakobi
Abstract:
\emph{Resistive memories}, such as \emph{phase change memories} and \emph{resistive random access memories} have attracted significant attention in recent years due to their better scalability, speed, rewritability, and yet non-volatility. However, their \emph{limited endurance} is still a major drawback that has to be improved before they can be widely adapted in large-scale systems.
In this wo…
▽ More
\emph{Resistive memories}, such as \emph{phase change memories} and \emph{resistive random access memories} have attracted significant attention in recent years due to their better scalability, speed, rewritability, and yet non-volatility. However, their \emph{limited endurance} is still a major drawback that has to be improved before they can be widely adapted in large-scale systems.
In this work, in order to reduce the wear out of the cells, we propose a new coding scheme, called \emph{endurance-limited memories} (\emph{ELM}) codes, that increases the endurance of these memories by limiting the number of cell programming operations. Namely, an \emph{$\ell$-change $t$-write ELM code} is a coding scheme that allows to write $t$ messages into some $n$ binary cells while guaranteeing that each cell is programmed at most $\ell$ times. In case $\ell=1$, these codes coincide with the well-studied \emph{write-once memory} (\emph{WOM}) codes. We study some models of these codes which depend upon whether the encoder knows on each write the number of times each cell was programmed, knows only the memory state, or even does not know anything. For the decoder, we consider these similar three cases. We fully characterize the capacity regions and the maximum sum-rates of three models where the encoder knows on each write the number of times each cell was programmed. In particular, it is shown that in these models the maximum sum-rate is $\log \sum_{i=0}^{\ell} {t \choose i}$. We also study and expose the capacity regions of the models where the decoder is informed with the number of times each cell was programmed. Finally we present the most practical model where the encoder read the memory before encoding new data and the decoder has no information about the previous states of the memory.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning
Authors:
Daniella Bar-Lev,
Itai Orr,
Omer Sabary,
Tuvi Etzion,
Eitan Yaakobi
Abstract:
DNA-based storage is an emerging technology that enables digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability, and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of th…
▽ More
DNA-based storage is an emerging technology that enables digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability, and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of the main bottlenecks are the scalability and accuracy, which have a natural tradeoff between the two. Here we show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Tensor-Product (TP) based Error-Correcting Codes (ECC), and a safety margin mechanism into a single coherent pipeline. We demonstrated our solution on 3.1MB of information using two different sequencing technologies. Our work improves upon the current leading solutions by up to x3200 increase in speed, 40% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime. In a broader sense, our work shows a viable path to commercial DNA storage solutions hindered by current information retrieval processes.
△ Less
Submitted 11 March, 2024; v1 submitted 31 August, 2021;
originally announced September 2021.
-
Multi-strand Reconstruction from Substrings
Authors:
Yonatan Yehezkeally,
Sagi Marcovich,
Eitan Yaakobi
Abstract:
The problem of string reconstruction based on its substrings spectrum has received significant attention recently due to its applicability to DNA data storage and sequencing. In contrast to previous works, we consider in this paper a setup of this problem where multiple strings are reconstructed together. Given a multiset $S$ of strings, all their substrings of some fixed length $\ell$, defined as…
▽ More
The problem of string reconstruction based on its substrings spectrum has received significant attention recently due to its applicability to DNA data storage and sequencing. In contrast to previous works, we consider in this paper a setup of this problem where multiple strings are reconstructed together. Given a multiset $S$ of strings, all their substrings of some fixed length $\ell$, defined as the $\ell$-profile of $S$, are received and the goal is to reconstruct all strings in $S$. A multi-strand $\ell$-reconstruction code is a set of multisets such that every element $S$ can be reconstructed from its $\ell$-profile. Given the number of strings~$k$ and their length~$n$, we first find a lower bound on the value of $\ell$ necessary for existence of multi-strand $\ell$-reconstruction codes with non-vanishing asymptotic rate. We then present two constructions of such codes and show that their rates approach~$1$ for values of $\ell$ that asymptotically behave like the lower bound.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
On Levenshtein Balls with Radius One
Authors:
Daniella Bar-Lev,
Tuvi Etzion,
Eitan Yaakobi
Abstract:
The rapid development of DNA storage has brought the deletion and insertion channel, once again, to the front line of research. When the number of deletions is equal to the number of insertions, the Fixed Length Levenshtein (FLL) metric is the right measure for the distance between two words of the same length. The size of a ball is one of the most fundamental parameters in any metric. The size of…
▽ More
The rapid development of DNA storage has brought the deletion and insertion channel, once again, to the front line of research. When the number of deletions is equal to the number of insertions, the Fixed Length Levenshtein (FLL) metric is the right measure for the distance between two words of the same length. The size of a ball is one of the most fundamental parameters in any metric. The size of the ball with radius one in the FLL metric depends on the number of runs and the length of the alternating segments of the given word. In this work, we find the minimum, maximum, and average size of a ball with radius one, in the FLL metric. The related minimum and maximum sizes of a maximal anticode with diameter one are also calculated.
△ Less
Submitted 29 June, 2021; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Function-Correcting Codes
Authors:
Andreas Lenz,
Rawad Bitar,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
In this paper we study function-correcting codes, a new class of codes designed to protect the function evaluation of a message against errors. We show that FCCs are equivalent to irregular-distance codes, i.e., codes that obey some given distance requirement between each pair of codewords. Using these connections, we study irregular-distance codes and derive general upper and lower bounds on thei…
▽ More
In this paper we study function-correcting codes, a new class of codes designed to protect the function evaluation of a message against errors. We show that FCCs are equivalent to irregular-distance codes, i.e., codes that obey some given distance requirement between each pair of codewords. Using these connections, we study irregular-distance codes and derive general upper and lower bounds on their optimal redundancy. Since these bounds heavily depend on the specific function, we provide simplified, suboptimal bounds that are easier to evaluate. We further employ our general results to specific functions of interest and compare our results to standard error-correcting codes, which protect the whole message.
△ Less
Submitted 22 May, 2023; v1 submitted 5 February, 2021;
originally announced February 2021.
-
Multiple Criss-Cross Insertion and Deletion Correcting Codes
Authors:
Lorenz Welter,
Rawad Bitar,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
This paper investigates the problem of correcting multiple criss-cross insertions and deletions in arrays. More precisely, we study the unique recovery of $n \times n$ arrays affected by $t$-criss-cross deletions defined as any combination of $t_r$ row and $t_c$ column deletions such that $t_r + t_c = t$ for a given $t$. We show an equivalence between correcting $t$-criss-cross deletions and $t$-c…
▽ More
This paper investigates the problem of correcting multiple criss-cross insertions and deletions in arrays. More precisely, we study the unique recovery of $n \times n$ arrays affected by $t$-criss-cross deletions defined as any combination of $t_r$ row and $t_c$ column deletions such that $t_r + t_c = t$ for a given $t$. We show an equivalence between correcting $t$-criss-cross deletions and $t$-criss-cross insertions and show that a code correcting $t$-criss-cross insertions/deletions has redundancy at least $tn + t \log n - \log(t!)$. Then, we present an existential construction of $t$-criss-cross insertion/deletion correcting code with redundancy bounded from above by $tn + \mathcal{O}(t^2 \log^2 n)$. The main ingredients of the presented code construction are systematic binary $t$-deletion correcting codes and Gabidulin codes. The first ingredient helps locating the indices of the inserted/deleted rows and columns, thus transforming the insertion/deletion-correction problem into a row/column erasure-correction problem which is then solved using the second ingredient.
△ Less
Submitted 15 November, 2021; v1 submitted 4 February, 2021;
originally announced February 2021.
-
The Zero Cubes Free and Cubes Unique Multidimensional Constraints
Authors:
Sagi Marcovich,
Eitan Yaakobi
Abstract:
This paper studies two families of constraints for two-dimensional and multidimensional arrays. The first family requires that a multidimensional array will not contain a cube of zeros of some fixed size and the second constraint imposes that there will not be two identical cubes of a given size in the array. These constraints are natural extensions of their one-dimensional counterpart that have b…
▽ More
This paper studies two families of constraints for two-dimensional and multidimensional arrays. The first family requires that a multidimensional array will not contain a cube of zeros of some fixed size and the second constraint imposes that there will not be two identical cubes of a given size in the array. These constraints are natural extensions of their one-dimensional counterpart that have been rigorously studied recently. For both of these constraint we present conditions of the size of the cube for which the asymptotic rate of the set of valid arrays approaches 1 as well as conditions for the redundancy to be at most a single symbol. For the first family we present an efficient encoding algorithm that uses a single symbol to encode arbitrary information into a valid array and for the second family we present a similar encoder for the two-dimensional case. The results in the paper are also extended to similar constraints where the sub-array is not necessarily a cube, but a box of arbitrary dimensions and only its volume is bounded.
△ Less
Submitted 31 January, 2021;
originally announced February 2021.
-
Correctable Erasure Patterns in Product Topologies
Authors:
Lukas Holzbaur,
Sven Puchinger,
Eitan Yaakobi,
Antonia Wachter-Zeh
Abstract:
Locality enables storage systems to recover failed nodes from small subsets of surviving nodes. The setting where nodes are partitioned into subsets, each allowing for local recovery, is well understood. In this work we consider a generalization introduced by Gopalan et al., where, viewing the codewords as arrays, constraints are imposed on the columns and rows in addition to some global constrain…
▽ More
Locality enables storage systems to recover failed nodes from small subsets of surviving nodes. The setting where nodes are partitioned into subsets, each allowing for local recovery, is well understood. In this work we consider a generalization introduced by Gopalan et al., where, viewing the codewords as arrays, constraints are imposed on the columns and rows in addition to some global constraints. Specifically, we present a generic method of adding such global parity-checks and derive new results on the set of correctable erasure patterns. Finally, we relate the set of correctable erasure patterns in the considered topology to those correctable in tensor-product codes.
△ Less
Submitted 10 February, 2021; v1 submitted 25 January, 2021;
originally announced January 2021.
-
Almost Optimal Construction of Functional Batch Codes Using Hadamard Codes
Authors:
Lev Yohananov,
Eitan Yaakobi
Abstract:
A \textit{functional $k$-batch} code of dimension $s$ consists of $n$ servers storing linear combinations of $s$ linearly independent information bits. Any multiset request of size $k$ of linear combinations (or requests) of the information bits can be recovered by $k$ disjoint subsets of the servers. The goal under this paradigm is to find the minimum number of servers for given values of $s$ and…
▽ More
A \textit{functional $k$-batch} code of dimension $s$ consists of $n$ servers storing linear combinations of $s$ linearly independent information bits. Any multiset request of size $k$ of linear combinations (or requests) of the information bits can be recovered by $k$ disjoint subsets of the servers. The goal under this paradigm is to find the minimum number of servers for given values of $s$ and $k$. A recent conjecture states that for any $k=2^{s-1}$ requests the optimal solution requires $2^s-1$ servers. This conjecture is verified for $s\leq 5$ but previous work could only show that codes with $n=2^s-1$ servers can support a solution for $k=2^{s-2} + 2^{s-4} + \left\lfloor \frac{ 2^{s/2}}{\sqrt{24}} \right\rfloor$ requests. This paper reduces this gap and shows the existence of codes for $k=\lfloor \frac{5}{6}2^{s-1} \rfloor - s$ requests with the same number of servers. Another construction in the paper provides a code with $n=2^{s+1}-2$ servers and $k=2^{s}$ requests, which is an optimal result.These constructions are mainly based on Hadamard codes and equivalently provide constructions for \textit{parallel Random I/O (RIO)} codes.
△ Less
Submitted 16 October, 2021; v1 submitted 17 January, 2021;
originally announced January 2021.