-
Correcting a Single Deletion in Reads from a Nanopore Sequencer
Authors:
Anisha Banerjee,
Yonatan Yehezkeally,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to devise practical and efficient coding schemes that would allow accurate retrieval of the data stored. Our work takes a step in this direction by adopting a simplifi…
▽ More
Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to devise practical and efficient coding schemes that would allow accurate retrieval of the data stored. Our work takes a step in this direction by adopting a simplified model of the nanopore sequencer inspired by Mao \emph{et al.}, which incorporates some of its physical aspects. This channel model can be viewed as a sliding window of length $\ell$ that passes over the incoming input sequence and produces the Hamming weight of the enclosed $\ell$ bits, while shifting by one position at each time step. The resulting $(\ell+1)$-ary vector, referred to as the $\ell$-\emph{read vector}, is susceptible to deletion errors due to imperfections inherent in the sequencing process. We establish that at least $\log n - \ell$ bits of redundancy are needed to correct a single deletion. An error-correcting code that is optimal up to an additive constant, is also proposed. Furthermore, we find that for $\ell \geq 2$, reconstruction from two distinct noisy $\ell$-read vectors can be accomplished without any redundancy, and provide a suitable reconstruction algorithm to this effect.
△ Less
Submitted 7 May, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
Error-Correcting Codes for Nanopore Sequencing
Authors:
Anisha Banerjee,
Yonatan Yehezkeally,
Antonia Wachter-Zeh,
Eitan Yaakobi
Abstract:
Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporatin…
▽ More
Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporating intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length \(\ell\) over a \(q\)-ary input sequence that outputs the \textit{composition} of the enclosed \(\ell\) bits and shifts by \(δ\) positions with each time step. In this context, the composition of a \(q\)-ary vector $\bfx$ specifies the number of occurrences in \(\bfx\) of each symbol in \(\lbrace 0,1,\ldots, q-1\rbrace\). The resulting compositions vector, termed the \emph{read vector}, may also be corrupted by \(t\) substitution errors. By employing graph-theoretic techniques, we deduce that for \(δ=1\), at least \(\log \log n\) symbols of redundancy are required to correct a single (\(t=1\)) substitution. Finally, for \(\ell \geq 3\), we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors.
△ Less
Submitted 8 December, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Bounds on Mixed Codes with Finite Alphabets
Authors:
Yonatan Yehezkeally,
Haider Al Kim,
Sven Puchinger,
Antonia Wachter-Zeh
Abstract:
Mixed codes, which are error-correcting codes in the Cartesian product of different-sized spaces, model degrading storage systems well. While such codes have previously been studied for their algebraic properties (e.g., existence of perfect codes) or in the case of unbounded alphabet sizes, we focus on the case of finite alphabets, and generalize the Gilbert-Varshamov, sphere-packing, Elias-Bassal…
▽ More
Mixed codes, which are error-correcting codes in the Cartesian product of different-sized spaces, model degrading storage systems well. While such codes have previously been studied for their algebraic properties (e.g., existence of perfect codes) or in the case of unbounded alphabet sizes, we focus on the case of finite alphabets, and generalize the Gilbert-Varshamov, sphere-packing, Elias-Bassalygo, and first linear programming bounds to that setting. In the latter case, our proof is also the first for the non-symmetric mono-alphabetic $q$-ary case using Navon and Samorodnitsky's Fourier-analytic approach.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Generalized Unique Reconstruction from Substrings
Authors:
Yonatan Yehezkeally,
Daniella Bar-Lev,
Sagi Marcovich,
Eitan Yaakobi
Abstract:
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this…
▽ More
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length $\ell$ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of $\ell$ that asymptotically behave like the lower bound.
△ Less
Submitted 20 April, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Reconstruction from Substrings with Partial Overlap
Authors:
Yonatan Yehezkeally,
Daniella Bar-Lev,
Sagi Marcovich,
Eitan Yaakobi
Abstract:
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which \emph{all} substrings of some fixed length are read or substrings are read with no overlap, this work considers the set…
▽ More
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which \emph{all} substrings of some fixed length are read or substrings are read with no overlap, this work considers the setup in which consecutive substrings are read with some given minimum overlap. First, upper bounds are provided on the attainable rates of codes that guarantee unique reconstruction. Then, we present efficient constructions of asymptotically optimal codes that meet the upper bound.
△ Less
Submitted 8 May, 2022;
originally announced May 2022.
-
Adversarial Torn-paper Codes
Authors:
Daniella Bar-Lev,
Sagi Marcovich,
Eitan Yaakobi,
Yonatan Yehezkeally
Abstract:
We study the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry information may break into smaller pieces which are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We develop code constructions for any parameters of the channel for which non-vanishing asymptotic r…
▽ More
We study the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry information may break into smaller pieces which are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We develop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show our constructions achieve asymptotically optimal rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.
△ Less
Submitted 4 July, 2023; v1 submitted 26 January, 2022;
originally announced January 2022.
-
Multi-strand Reconstruction from Substrings
Authors:
Yonatan Yehezkeally,
Sagi Marcovich,
Eitan Yaakobi
Abstract:
The problem of string reconstruction based on its substrings spectrum has received significant attention recently due to its applicability to DNA data storage and sequencing. In contrast to previous works, we consider in this paper a setup of this problem where multiple strings are reconstructed together. Given a multiset $S$ of strings, all their substrings of some fixed length $\ell$, defined as…
▽ More
The problem of string reconstruction based on its substrings spectrum has received significant attention recently due to its applicability to DNA data storage and sequencing. In contrast to previous works, we consider in this paper a setup of this problem where multiple strings are reconstructed together. Given a multiset $S$ of strings, all their substrings of some fixed length $\ell$, defined as the $\ell$-profile of $S$, are received and the goal is to reconstruct all strings in $S$. A multi-strand $\ell$-reconstruction code is a set of multisets such that every element $S$ can be reconstructed from its $\ell$-profile. Given the number of strings~$k$ and their length~$n$, we first find a lower bound on the value of $\ell$ necessary for existence of multi-strand $\ell$-reconstruction codes with non-vanishing asymptotic rate. We then present two constructions of such codes and show that their rates approach~$1$ for values of $\ell$ that asymptotically behave like the lower bound.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
On Codes for the Noisy Substring Channel
Authors:
Yonatan Yehezkeally,
Nikita Polyanskii
Abstract:
We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise…
▽ More
We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise before its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds on their sizes. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases. Moreover, we develop an efficient encoder for such constrained strings in some cases. Finally, we show how a similar encoder can be used to avoid formation of secondary-structures in coded DNA strands, even when accounting for imperfect structures.
△ Less
Submitted 26 March, 2024; v1 submitted 2 February, 2021;
originally announced February 2021.
-
Uncertainty of Reconstruction with List-Decoding from Uniform-Tandem-Duplication Noise
Authors:
Yonatan Yehezkeally,
Moshe Schwartz
Abstract:
We propose a list-decoding scheme for reconstruction codes in the context of uniform-tandem-duplication noise, which can be viewed as an application of the associative memory model to this setting. We find the uncertainty associated with $m>2$ strings (where a previous paper considered $m=2$) in asymptotic terms, where code-words are taken from an error-correcting code. Thus, we find the trade-off…
▽ More
We propose a list-decoding scheme for reconstruction codes in the context of uniform-tandem-duplication noise, which can be viewed as an application of the associative memory model to this setting. We find the uncertainty associated with $m>2$ strings (where a previous paper considered $m=2$) in asymptotic terms, where code-words are taken from an error-correcting code. Thus, we find the trade-off between the design minimum distance, the number of errors, the acceptable list size and the resulting uncertainty, which corresponds to the required number of distinct retrieved outputs for successful reconstruction. It is therefore seen that by accepting list-decoding one may decrease coding redundancy, or the required number of reads, or both.
△ Less
Submitted 18 February, 2021; v1 submitted 20 January, 2020;
originally announced January 2020.
-
Single-Error Detection and Correction for Duplication and Substitution Channels
Authors:
Yuanyuan Tang,
Yonatan Yehezkeally,
Moshe Schwartz,
Farzad Farnoud
Abstract:
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and err…
▽ More
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and error-correcting codes are constructed, which can handle correctly any number of tandem duplications of a fixed length $k$, and at most a single substitution occurring at any time during the mutation process.
△ Less
Submitted 28 June, 2020; v1 submitted 13 November, 2019;
originally announced November 2019.
-
Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors
Authors:
Yonatan Yehezkeally,
Moshe Schwartz
Abstract:
DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored…
▽ More
DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.
△ Less
Submitted 5 September, 2019; v1 submitted 18 January, 2018;
originally announced January 2018.
-
Limited-Magnitude Error-Correcting Gray Codes for Rank Modulation
Authors:
Yonatan Yehezkeally,
Moshe Schwartz
Abstract:
We construct Gray codes over permutations for the rank-modulation scheme, which are also capable of correcting errors under the infinity-metric. These errors model limited-magnitude or spike errors, for which only single-error-detecting Gray codes are currently known. Surprisingly, the error-correcting codes we construct achieve a better asymptotic rate than that of presently known constructions n…
▽ More
We construct Gray codes over permutations for the rank-modulation scheme, which are also capable of correcting errors under the infinity-metric. These errors model limited-magnitude or spike errors, for which only single-error-detecting Gray codes are currently known. Surprisingly, the error-correcting codes we construct achieve a better asymptotic rate than that of presently known constructions not having the Gray property, and exceed the Gilbert-Varshamov bound. Additionally, we present efficient ranking and unranking procedures, as well as a decoding procedure that runs in linear time. Finally, we also apply our methods to solve an outstanding issue with error-detecting rank-modulation Gray codes (snake-in-the-box codes) under a different metric, the Kendall $τ$-metric, in the group of permutations over an even number of elements $S_{2n}$, where we provide asymptotically optimal codes.
△ Less
Submitted 19 June, 2016; v1 submitted 20 January, 2016;
originally announced January 2016.
-
Snake-in-the-Box Codes for Rank Modulation
Authors:
Yonatan Yehezkeally,
Moshe Schwartz
Abstract:
Motivated by the rank-modulation scheme with applications to flash memory, we consider Gray codes capable of detecting a single error, also known as snake-in-the-box codes. We study two error metrics: Kendall's $τ$-metric, which applies to charge-constrained errors, and the $\ell_\infty$-metric, which is useful in the case of limited magnitude errors. In both cases we construct snake-in-the-box co…
▽ More
Motivated by the rank-modulation scheme with applications to flash memory, we consider Gray codes capable of detecting a single error, also known as snake-in-the-box codes. We study two error metrics: Kendall's $τ$-metric, which applies to charge-constrained errors, and the $\ell_\infty$-metric, which is useful in the case of limited magnitude errors. In both cases we construct snake-in-the-box codes with rate asymptotically tending to 1. We also provide efficient successor-calculation functions, as well as ranking and unranking functions. Finally, we also study bounds on the parameters of such codes.
△ Less
Submitted 18 July, 2011;
originally announced July 2011.