-
Index-Based Concatenated Codes for the Multi-Draw DNA Storage Channel
Authors:
Lorenz Welter,
Issam Maarouf,
Andreas Lenz,
Antonia Wachter-Zeh,
Eirik Rosnes,
Alexandre Graell i Amat
Abstract:
We consider error-correcting coding for DNA-based storage. We model the DNA storage channel as a multi-draw IDS channel where the input data is chunked into $M$ short DNA strands, which are copied a random number of times, and the channel outputs a random selection of $N$ noisy DNA strands. The retrieved DNA strands are prone to insertion, deletion, and substitution (IDS) errors. We propose an ind…
▽ More
We consider error-correcting coding for DNA-based storage. We model the DNA storage channel as a multi-draw IDS channel where the input data is chunked into $M$ short DNA strands, which are copied a random number of times, and the channel outputs a random selection of $N$ noisy DNA strands. The retrieved DNA strands are prone to insertion, deletion, and substitution (IDS) errors. We propose an index-based concatenated coding scheme consisting of the concatenation of an outer code, an index code, and an inner synchronization code, where the latter two tackle IDS errors. We further propose a mismatched joint index-synchronization code maximum a posteriori probability decoder with optional clustering to infer symbolwise a posterior probabilities for the outer decoder. We compute achievable information rates for the outer code and present Monte-Carlo simulations for information-outage probabilities and frame error rates on synthetic and experimental data, respectively.
△ Less
Submitted 21 June, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Sequential Decoding of Convolutional Codes for Synchronization Errors
Authors:
Anisha Banerjee,
Andreas Lenz,
Antonia Wachter-Zeh
Abstract:
Sequential decoding, commonly applied to substitution channels, is a sub-optimal alternative to Viterbi decoding with significantly reduced memory costs. In this work, a sequential decoder for convolutional codes over channels that are prone to insertion, deletion, and substitution errors, is described and analyzed. Our decoder expands the code trellis by a new channel-state variable, called drift…
▽ More
Sequential decoding, commonly applied to substitution channels, is a sub-optimal alternative to Viterbi decoding with significantly reduced memory costs. In this work, a sequential decoder for convolutional codes over channels that are prone to insertion, deletion, and substitution errors, is described and analyzed. Our decoder expands the code trellis by a new channel-state variable, called drift state, as proposed by Davey and MacKay. A suitable decoding metric on that trellis for sequential decoding is derived, generalizing the original Fano metric. The decoder is also extended to facilitate the simultaneous decoding of multiple received sequences that arise from a single transmitted sequence. Under low-noise environments, our decoding approach reduces the decoding complexity by a couple orders of magnitude in comparison to Viterbi's algorithm, albeit at slightly higher bit error rates. An analytical method to determine the computational cutoff rate is also suggested. This analysis is supported with numerical evaluations of bit error rates and computational complexity, which are compared with respect to optimal Viterbi decoding.
△ Less
Submitted 25 September, 2023; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Concatenated Codes for Multiple Reads of a DNA Sequence
Authors:
Issam Maarouf,
Andreas Lenz,
Lorenz Welter,
Antonia Wachter-Zeh,
Eirik Rosnes,
Alexandre Graell i Amat
Abstract:
Decoding sequences that stem from multiple transmissions of a codeword over an insertion, deletion, and substitution channel is a critical component of efficient deoxyribonucleic acid (DNA) data storage systems. In this paper, we consider a concatenated coding scheme with an outer nonbinary low-density parity-check code or a polar code and either an inner convolutional code or a time-varying block…
▽ More
Decoding sequences that stem from multiple transmissions of a codeword over an insertion, deletion, and substitution channel is a critical component of efficient deoxyribonucleic acid (DNA) data storage systems. In this paper, we consider a concatenated coding scheme with an outer nonbinary low-density parity-check code or a polar code and either an inner convolutional code or a time-varying block code. We propose two novel decoding algorithms for inference from multiple received sequences, both combining the inner code and channel to a joint hidden Markov model to infer symbolwise a posteriori probabilities (APPs). The first decoder computes the exact APPs by jointly decoding the received sequences, whereas the second decoder approximates the APPs by combining the results of separately decoded received sequences and has a complexity that is linear with the number of sequences. Using the proposed algorithms, we evaluate the performance of decoding multiple received sequences by means of achievable information rates and Monte-Carlo simulations. We show significant performance gains compared to a single received sequence. In addition, we succeed in improving the performance of the aforementioned coding scheme by optimizing both the inner and outer codes.
△ Less
Submitted 12 September, 2022; v1 submitted 29 November, 2021;
originally announced November 2021.
-
Multivariate Analytic Combinatorics for Cost Constrained Channels and Subsequence Enumeration
Authors:
Andreas Lenz,
Stephen Melczer,
Cyrus Rashtchian,
Paul H. Siegel
Abstract:
Analytic combinatorics in several variables is a powerful tool for deriving the asymptotic behavior of combinatorial quantities by analyzing multivariate generating functions. We study information-theoretic questions about sequences in a discrete noiseless channel under cost and forbidden substring constraints. Our main contributions involve the relationship between the graph structure of the chan…
▽ More
Analytic combinatorics in several variables is a powerful tool for deriving the asymptotic behavior of combinatorial quantities by analyzing multivariate generating functions. We study information-theoretic questions about sequences in a discrete noiseless channel under cost and forbidden substring constraints. Our main contributions involve the relationship between the graph structure of the channel and the singularities of the bivariate generating function whose coefficients are the number of sequences satisfying the constraints. We combine these new results with methods from multivariate analytic combinatorics to solve questions in many application areas. For example, we determine the optimal coded synthesis rate for DNA data storage when the synthesis supersequence is any periodic string. This follows from a precise characterization of the number of subsequences of an arbitrary periodic strings. Along the way, we provide a new proof of the equivalence of the combinatorial and probabilistic definitions of the cost-constrained capacity, and we show that the cost-constrained channel capacity is determined by a cost-dependent singularity, generalizing Shannon's classical result for unconstrained capacity.
△ Less
Submitted 14 November, 2021; v1 submitted 11 November, 2021;
originally announced November 2021.
-
Function-Correcting Codes
Authors:
Andreas Lenz,
Rawad Bitar,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
In this paper we study function-correcting codes, a new class of codes designed to protect the function evaluation of a message against errors. We show that FCCs are equivalent to irregular-distance codes, i.e., codes that obey some given distance requirement between each pair of codewords. Using these connections, we study irregular-distance codes and derive general upper and lower bounds on thei…
▽ More
In this paper we study function-correcting codes, a new class of codes designed to protect the function evaluation of a message against errors. We show that FCCs are equivalent to irregular-distance codes, i.e., codes that obey some given distance requirement between each pair of codewords. Using these connections, we study irregular-distance codes and derive general upper and lower bounds on their optimal redundancy. Since these bounds heavily depend on the specific function, we provide simplified, suboptimal bounds that are easier to evaluate. We further employ our general results to specific functions of interest and compare our results to standard error-correcting codes, which protect the whole message.
△ Less
Submitted 22 May, 2023; v1 submitted 5 February, 2021;
originally announced February 2021.
-
Concatenated Codes for Recovery From Multiple Reads of DNA Sequences
Authors:
Andreas Lenz,
Issam Maarouf,
Lorenz Welter,
Antonia Wachter-Zeh,
Eirik Rosnes,
Alexandre Graell i Amat
Abstract:
Decoding sequences that stem from multiple transmissions of a codeword over an insertion, deletion, and substitution channel is a critical component of efficient deoxyribonucleic acid (DNA) data storage systems. In this paper, we consider a concatenated coding scheme with an outer low-density parity-check code and either an inner convolutional code or a block code. We propose two new decoding algo…
▽ More
Decoding sequences that stem from multiple transmissions of a codeword over an insertion, deletion, and substitution channel is a critical component of efficient deoxyribonucleic acid (DNA) data storage systems. In this paper, we consider a concatenated coding scheme with an outer low-density parity-check code and either an inner convolutional code or a block code. We propose two new decoding algorithms for inference from multiple received sequences, both combining the inner code and channel to a joint hidden Markov model to infer symbolwise a posteriori probabilities (APPs). The first decoder computes the exact APPs by jointly decoding the received sequences, whereas the second decoder approximates the APPs by combining the results of separately decoded received sequences. Using the proposed algorithms, we evaluate the performance of decoding multiple received sequences by means of achievable information rates and Monte-Carlo simulations. We show significant performance gains compared to a single received sequence.
△ Less
Submitted 29 October, 2020;
originally announced October 2020.
-
Achievable Rates of Concatenated Codes in DNA Storage under Substitution Errors
Authors:
Andreas Lenz,
Lorenz Welter,
Sven Puchinger
Abstract:
In this paper, we study achievable rates of concatenated coding schemes over a deoxyribonucleic acid (DNA) storage channel. Our channel model incorporates the main features of DNA-based data storage. First, information is stored on many, short DNA strands. Second, the strands are stored in an unordered fashion inside the storage medium and each strand is replicated many times. Third, the data is a…
▽ More
In this paper, we study achievable rates of concatenated coding schemes over a deoxyribonucleic acid (DNA) storage channel. Our channel model incorporates the main features of DNA-based data storage. First, information is stored on many, short DNA strands. Second, the strands are stored in an unordered fashion inside the storage medium and each strand is replicated many times. Third, the data is accessed in an uncontrollable manner, i.e., random strands are drawn from the medium and received, possibly with errors. As one of our results, we show that there is a significant gap between the channel capacity and the achievable rate of a standard concatenated code in which one strand corresponds to an inner block. This is in fact surprising as for other channels, such as $q$-ary symmetric channels, concatenated codes are known to achieve the capacity. We further propose a modified concatenated coding scheme by combining several strands into one inner block, which allows to narrow the gap and achieve rates that are close to the capacity.
△ Less
Submitted 30 April, 2020;
originally announced May 2020.
-
Optimal Codes Correcting a Burst of Deletions of Variable Length
Authors:
Andreas Lenz,
Nikita Polyanskii
Abstract:
In this paper, we present an efficiently encodable and decodable code construction that is capable of correction a burst of deletions of length at most $k$. The redundancy of this code is $\log n + k(k+1)/2\log \log n+c_k$ for some constant $c_k$ that only depends on $k$ and thus is scaling-optimal. The code can be split into two main components. First, we impose a constraint that allows to locate…
▽ More
In this paper, we present an efficiently encodable and decodable code construction that is capable of correction a burst of deletions of length at most $k$. The redundancy of this code is $\log n + k(k+1)/2\log \log n+c_k$ for some constant $c_k$ that only depends on $k$ and thus is scaling-optimal. The code can be split into two main components. First, we impose a constraint that allows to locate the burst of deletions up to an interval of size roughly $\log n$. Then, with the knowledge of the approximate location of the burst, we use several {shifted Varshamov-Tenengolts} codes to correct the burst of deletions, which only requires a small amount of redundancy since the location is already known up to an interval of small size. Finally, we show how to efficiently encode and decode the code.
△ Less
Submitted 18 January, 2020;
originally announced January 2020.
-
Covering Codes using Insertions or Deletions
Authors:
Andreas Lenz,
Cyrus Rashtchian,
Paul H. Siegel,
Eitan Yaakobi
Abstract:
A covering code is a set of codewords with the property that the union of balls, suitably defined, around these codewords covers an entire space. Generally, the goal is to find the covering code with the minimum size codebook. While most prior work on covering codes has focused on the Hamming metric, we consider the problem of designing covering codes defined in terms of either insertions or delet…
▽ More
A covering code is a set of codewords with the property that the union of balls, suitably defined, around these codewords covers an entire space. Generally, the goal is to find the covering code with the minimum size codebook. While most prior work on covering codes has focused on the Hamming metric, we consider the problem of designing covering codes defined in terms of either insertions or deletions. First, we provide new sphere-covering lower bounds on the minimum possible size of such codes. Then, we provide new existential upper bounds on the size of optimal covering codes for a single insertion or a single deletion that are tight up to a constant factor. Finally, we derive improved upper bounds for covering codes using $R\geq 2$ insertions or deletions. We prove that codes exist with density that is only a factor $O(R \log R)$ larger than the lower bounds for all fixed~$R$. In particular, our upper bounds have an optimal dependence on the word length, and we achieve asymptotic density matching the best known bounds for Hamming distance covering codes.
△ Less
Submitted 25 May, 2020; v1 submitted 22 November, 2019;
originally announced November 2019.
-
Clustering-Correcting Codes
Authors:
Tal Shinkar,
Eitan Yaakobi,
Andreas Lenz,
Antonia Wachter-Zeh
Abstract:
A new family of codes, called clustering-correcting codes, is presented in this paper. This family of codes is motivated by the special structure of data that is stored in DNA-based storage systems. The data stored in these systems has the form of unordered sequences, also called strands, and every strand is synthesized thousands to millions of times, where some of these copies are read back durin…
▽ More
A new family of codes, called clustering-correcting codes, is presented in this paper. This family of codes is motivated by the special structure of data that is stored in DNA-based storage systems. The data stored in these systems has the form of unordered sequences, also called strands, and every strand is synthesized thousands to millions of times, where some of these copies are read back during sequencing. Due to the unordered structure of the strands, an important task in the decoding process is to place them in their correct order. This is usually accomplished by allocating a part of the strand for an index. However, in the presence of errors in the index field, important information on the order of the strands may be lost.
Clustering-correcting codes ensure that if the distance between the index fields of two strands is small, then there will be a large distance between their data fields. It is shown how this property enables to place the strands together in their correct clusters even in the presence of errors. We present lower and upper bounds on the size of clustering-correcting codes and an explicit construction of these codes which uses only a single bit of redundancy.
△ Less
Submitted 11 March, 2019;
originally announced March 2019.
-
Anchor-Based Correction of Substitutions in Indexed Sets
Authors:
Andreas Lenz,
Paul H. Siegel,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Motivated by DNA-based data storage, we investigate a system where digital information is stored in an unordered set of several vectors over a finite alphabet. Each vector begins with a unique index that represents its position in the whole data set and does not contain data. This paper deals with the design of error-correcting codes for such indexed sets in the presence of substitution errors. We…
▽ More
Motivated by DNA-based data storage, we investigate a system where digital information is stored in an unordered set of several vectors over a finite alphabet. Each vector begins with a unique index that represents its position in the whole data set and does not contain data. This paper deals with the design of error-correcting codes for such indexed sets in the presence of substitution errors. We propose a construction that efficiently deals with the challenges that arise when designing codes for unordered sets. Using a novel mechanism, called anchoring, we show that it is possible to combat the ordering loss of sequences with only a small amount of redundancy, which allows to use standard coding techniques, such as tensor-product codes to correct errors within the sequences. We finally derive upper and lower bounds on the achievable redundancy of codes within the considered channel model and verify that our construction yields a redundancy that is close to the best possible achievable one. Our results surprisingly indicate that it requires less redundancy to correct errors in the indices than in the data part of vectors.
△ Less
Submitted 21 January, 2019;
originally announced January 2019.
-
Coding over Sets for DNA Storage
Authors:
Andreas Lenz,
Paul H. Siegel,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
In this paper we study error-correcting codes for the storage of data in synthetic deoxyribonucleic acid (DNA). We investigate a storage model where a data set is represented by an unordered set of $M$ sequences, each of length $L$. Errors within that model are a loss of whole sequences and point errors inside the sequences, such as insertions, deletions and substitutions. We derive Gilbert-Varsha…
▽ More
In this paper we study error-correcting codes for the storage of data in synthetic deoxyribonucleic acid (DNA). We investigate a storage model where a data set is represented by an unordered set of $M$ sequences, each of length $L$. Errors within that model are a loss of whole sequences and point errors inside the sequences, such as insertions, deletions and substitutions. We derive Gilbert-Varshamov lower bounds and sphere packing upper bounds on achievable cardinalities of error-correcting codes within this storage model. We further propose explicit code constructions than can correct errors in such a storage system that can be encoded and decoded efficiently. Comparing the sizes of these codes to the upper bounds, we show that many of the constructions are close to optimal.
△ Less
Submitted 12 February, 2020; v1 submitted 7 December, 2018;
originally announced December 2018.
-
Bounds and Constructions for Multi-Symbol Duplication Error Correcting Codes
Authors:
Andreas Lenz,
Niklas Jünger,
Antonia Wachter-Zeh
Abstract:
In this paper, we study codes correcting $t$ duplications of $\ell$ consecutive symbols. These errors are known as tandem duplication errors, where a sequence of symbols is repeated and inserted directly after its original occurrence. Using sphere packing arguments, we derive non-asymptotic upper bounds on the cardinality of codes that correct such errors for any choice of parameters. Based on the…
▽ More
In this paper, we study codes correcting $t$ duplications of $\ell$ consecutive symbols. These errors are known as tandem duplication errors, where a sequence of symbols is repeated and inserted directly after its original occurrence. Using sphere packing arguments, we derive non-asymptotic upper bounds on the cardinality of codes that correct such errors for any choice of parameters. Based on the fact that a code correcting insertions of $t$ zero-blocks can be used to correct $t$ tandem duplications, we construct codes for tandem duplication errors. We compare the cardinalities of these codes with their sphere packing upper bounds. Finally, we discuss the asymptotic behavior of the derived codes and bounds, which yields insights about the tandem duplication channel.
△ Less
Submitted 19 September, 2018; v1 submitted 8 July, 2018;
originally announced July 2018.
-
Coding over Sets for DNA Storage
Authors:
Andreas Lenz,
Paul H. Siegel,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
In this paper, we study error-correcting codes for the storage of data in synthetic deoxyribonucleic acid (DNA). We investigate a storage model where data is represented by an unordered set of $M$ sequences, each of length $L$. Errors within that model are losses of whole sequences and point errors inside the sequences, such as substitutions, insertions and deletions. We propose code constructions…
▽ More
In this paper, we study error-correcting codes for the storage of data in synthetic deoxyribonucleic acid (DNA). We investigate a storage model where data is represented by an unordered set of $M$ sequences, each of length $L$. Errors within that model are losses of whole sequences and point errors inside the sequences, such as substitutions, insertions and deletions. We propose code constructions which can correct these errors with efficient encoders and decoders. By deriving upper bounds on the cardinalities of these codes using sphere packing arguments, we show that many of our codes are close to optimal.
△ Less
Submitted 9 May, 2018; v1 submitted 15 January, 2018;
originally announced January 2018.
-
Duplication-Correcting Codes
Authors:
Andreas Lenz,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
In this work, we propose constructions that correct duplications of multiple consecutive symbols. These errors are known as tandem duplications, where a sequence of symbols is repeated; respectively as palindromic duplications, where a sequence is repeated in reversed order. We compare the redundancies of these constructions with code size upper bounds that are obtained from sphere packing argumen…
▽ More
In this work, we propose constructions that correct duplications of multiple consecutive symbols. These errors are known as tandem duplications, where a sequence of symbols is repeated; respectively as palindromic duplications, where a sequence is repeated in reversed order. We compare the redundancies of these constructions with code size upper bounds that are obtained from sphere packing arguments. Proving that an upper bound on the code cardinality for tandem deletions is also an upper bound for inserting tandem duplications, we derive the bounds based on this special tandem deletion error as this results in tighter bounds. Our upper bounds on the cardinality directly imply lower bounds on the redundancy which we compare with the redundancy of the best known construction correcting arbitrary burst insertions. Our results indicate that the correction of palindromic duplications requires more redundancy than the correction of tandem duplications and both significantly less than arbitrary burst insertions.
△ Less
Submitted 23 December, 2017;
originally announced December 2017.
-
Bounds on Codes Correcting Tandem and Palindromic Duplications
Authors:
Andreas Lenz,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
In this work, we derive upper bounds on the cardinality of tandem duplication and palindromic deletion correcting codes by deriving the generalized sphere packing bound for these error types. We first prove that an upper bound for tandem deletions is also an upper bound for inserting the respective type of duplications. Therefore, we derive the bounds based on these special deletions as this resul…
▽ More
In this work, we derive upper bounds on the cardinality of tandem duplication and palindromic deletion correcting codes by deriving the generalized sphere packing bound for these error types. We first prove that an upper bound for tandem deletions is also an upper bound for inserting the respective type of duplications. Therefore, we derive the bounds based on these special deletions as this results in tighter bounds. We determine the spheres for tandem and palindromic duplications/deletions and the number of words with a specific sphere size. Our upper bounds on the cardinality directly imply lower bounds on the redundancy which we compare with the redundancy of the best known construction correcting arbitrary burst errors. Our results indicate that the correction of palindromic duplications requires more redundancy than the correction of tandem duplications. Further, there is a significant gap between the minimum redundancy of duplication correcting codes and burst insertion correcting codes.
△ Less
Submitted 16 January, 2018; v1 submitted 30 June, 2017;
originally announced July 2017.
-
Joint Transmit and Receive Filter Optimization for Sub-Nyquist Delay-Doppler Estimation
Authors:
Andreas Lenz,
Manuel S. Stein,
A. Lee Swindlehurst
Abstract:
In this article, a framework is presented for the joint optimization of the analog transmit and receive filter with respect to a parameter estimation problem. At the receiver, conventional signal processing systems restrict the two-sided bandwidth of the analog pre-filter $B$ to the rate of the analog-to-digital converter $f_s$ to comply with the well-known Nyquist-Shannon sampling theorem. In con…
▽ More
In this article, a framework is presented for the joint optimization of the analog transmit and receive filter with respect to a parameter estimation problem. At the receiver, conventional signal processing systems restrict the two-sided bandwidth of the analog pre-filter $B$ to the rate of the analog-to-digital converter $f_s$ to comply with the well-known Nyquist-Shannon sampling theorem. In contrast, here we consider a transceiver that by design violates the common paradigm $B\leq f_s$. To this end, at the receiver, we allow for a higher pre-filter bandwidth $B>f_s$ and study the achievable parameter estimation accuracy under a fixed sampling rate when the transmit and receive filter are jointly optimized with respect to the Bayesian Cramér-Rao lower bound. For the case of delay-Doppler estimation, we propose to approximate the required Fisher information matrix and solve the transceiver design problem by an alternating optimization algorithm. The presented approach allows us to explore the Pareto-optimal region spanned by transmit and receive filters which are favorable under a weighted mean squared error criterion. We also discuss the computational complexity of the obtained transceiver design by visualizing the resulting ambiguity function. Finally, we verify the performance of the optimized designs by Monte-Carlo simulations of a likelihood-based estimator.
△ Less
Submitted 8 February, 2018; v1 submitted 25 April, 2017;
originally announced April 2017.
-
Analog Transmit Signal Optimization for Undersampled Delay-Doppler Estimation
Authors:
Andreas Lenz,
Manuel S. Stein,
A. Lee Swindlehurst
Abstract:
In this work, the optimization of the analog transmit waveform for joint delay-Doppler estimation under sub-Nyquist conditions is considered. Based on the Bayesian Cramér-Rao lower bound (BCRLB), we derive an estimation theoretic design rule for the Fourier coefficients of the analog transmit signal when violating the sampling theorem at the receiver through a wide analog pre-filtering bandwidth.…
▽ More
In this work, the optimization of the analog transmit waveform for joint delay-Doppler estimation under sub-Nyquist conditions is considered. Based on the Bayesian Cramér-Rao lower bound (BCRLB), we derive an estimation theoretic design rule for the Fourier coefficients of the analog transmit signal when violating the sampling theorem at the receiver through a wide analog pre-filtering bandwidth. For a wireless delay-Doppler channel, we obtain a system optimization problem which can be solved in compact form by using an Eigenvalue decomposition. The presented approach enables one to explore the Pareto region spanned by the optimized analog waveforms. Furthermore, we demonstrate how the framework can be used to reduce the sampling rate at the receiver while maintaining high estimation accuracy. Finally, we verify the practical impact by Monte-Carlo simulations of a channel estimation algorithm.
△ Less
Submitted 20 June, 2017; v1 submitted 20 March, 2017;
originally announced March 2017.