Search | arXiv e-print repository

Correcting a Single Deletion in Reads from a Nanopore Sequencer

Authors: Anisha Banerjee, Yonatan Yehezkeally, Antonia Wachter-Zeh, Eitan Yaakobi

Abstract: Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to devise practical and efficient coding schemes that would allow accurate retrieval of the data stored. Our work takes a step in this direction by adopting a simplifi… ▽ More Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to devise practical and efficient coding schemes that would allow accurate retrieval of the data stored. Our work takes a step in this direction by adopting a simplified model of the nanopore sequencer inspired by Mao \emph{et al.}, which incorporates some of its physical aspects. This channel model can be viewed as a sliding window of length $\ell$ that passes over the incoming input sequence and produces the Hamming weight of the enclosed $\ell$ bits, while shifting by one position at each time step. The resulting $(\ell+1)$-ary vector, referred to as the $\ell$-\emph{read vector}, is susceptible to deletion errors due to imperfections inherent in the sequencing process. We establish that at least $\log n - \ell$ bits of redundancy are needed to correct a single deletion. An error-correcting code that is optimal up to an additive constant, is also proposed. Furthermore, we find that for $\ell \geq 2$, reconstruction from two distinct noisy $\ell$-read vectors can be accomplished without any redundancy, and provide a suitable reconstruction algorithm to this effect. △ Less

Submitted 7 May, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted at IEEE ISIT'24

arXiv:2305.10214 [pdf, ps, other]

doi 10.1109/TIT.2024.3380615

Error-Correcting Codes for Nanopore Sequencing

Authors: Anisha Banerjee, Yonatan Yehezkeally, Antonia Wachter-Zeh, Eitan Yaakobi

Abstract: Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporatin… ▽ More Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporating intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length $\ell$ over a $q$-ary input sequence that outputs the \textit{composition} of the enclosed $\ell$ bits and shifts by $δ$ positions with each time step. In this context, the composition of a $q$-ary vector $\bfx$ specifies the number of occurrences in $\bfx$ of each symbol in $\lbrace 0,1,\ldots, q-1\rbrace$. The resulting compositions vector, termed the \emph{read vector}, may also be corrupted by $t$ substitution errors. By employing graph-theoretic techniques, we deduce that for $δ=1$, at least $\log \log n$ symbols of redundancy are required to correct a single ($t=1$) substitution. Finally, for $\ell \geq 3$, we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors. △ Less

Submitted 8 December, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Submitted to Transactions on Information Theory

arXiv:2212.09314 [pdf, ps, other]

Bounds on Mixed Codes with Finite Alphabets

Authors: Yonatan Yehezkeally, Haider Al Kim, Sven Puchinger, Antonia Wachter-Zeh

Abstract: Mixed codes, which are error-correcting codes in the Cartesian product of different-sized spaces, model degrading storage systems well. While such codes have previously been studied for their algebraic properties (e.g., existence of perfect codes) or in the case of unbounded alphabet sizes, we focus on the case of finite alphabets, and generalize the Gilbert-Varshamov, sphere-packing, Elias-Bassal… ▽ More Mixed codes, which are error-correcting codes in the Cartesian product of different-sized spaces, model degrading storage systems well. While such codes have previously been studied for their algebraic properties (e.g., existence of perfect codes) or in the case of unbounded alphabet sizes, we focus on the case of finite alphabets, and generalize the Gilbert-Varshamov, sphere-packing, Elias-Bassalygo, and first linear programming bounds to that setting. In the latter case, our proof is also the first for the non-symmetric mono-alphabetic $q$-ary case using Navon and Samorodnitsky's Fourier-analytic approach. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2210.04471 [pdf, ps, other]

doi 10.1109/TIT.2023.3269124

Generalized Unique Reconstruction from Substrings

Authors: Yonatan Yehezkeally, Daniella Bar-Lev, Sagi Marcovich, Eitan Yaakobi

Abstract: This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this… ▽ More This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length $\ell$ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of $\ell$ that asymptotically behave like the lower bound. △ Less

Submitted 20 April, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

Comments: Author-submitted, peer-reviewed and accepted version (IEEE Trans. on Inform. Theory). arXiv admin note: text overlap with arXiv:2205.03933

arXiv:2205.03933 [pdf, ps, other]

Reconstruction from Substrings with Partial Overlap

Authors: Yonatan Yehezkeally, Daniella Bar-Lev, Sagi Marcovich, Eitan Yaakobi

Abstract: This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which \emph{all} substrings of some fixed length are read or substrings are read with no overlap, this work considers the set… ▽ More This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which \emph{all} substrings of some fixed length are read or substrings are read with no overlap, this work considers the setup in which consecutive substrings are read with some given minimum overlap. First, upper bounds are provided on the attainable rates of codes that guarantee unique reconstruction. Then, we present efficient constructions of asymptotically optimal codes that meet the upper bound. △ Less

Submitted 8 May, 2022; originally announced May 2022.

Comments: 6 pages, 2 figures; conference submission

arXiv:2201.11150 [pdf, ps, other]

doi 10.1109/TIT.2023.3292895

Adversarial Torn-paper Codes

Authors: Daniella Bar-Lev, Sagi Marcovich, Eitan Yaakobi, Yonatan Yehezkeally

Abstract: We study the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry information may break into smaller pieces which are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We develop code constructions for any parameters of the channel for which non-vanishing asymptotic r… ▽ More We study the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry information may break into smaller pieces which are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We develop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show our constructions achieve asymptotically optimal rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage. △ Less

Submitted 4 July, 2023; v1 submitted 26 January, 2022; originally announced January 2022.

Comments: Author submitted, peer-reviewed version

arXiv:2108.11725 [pdf, ps, other]

doi 10.1109/ITW48936.2021.9611486

Multi-strand Reconstruction from Substrings

Authors: Yonatan Yehezkeally, Sagi Marcovich, Eitan Yaakobi

Abstract: The problem of string reconstruction based on its substrings spectrum has received significant attention recently due to its applicability to DNA data storage and sequencing. In contrast to previous works, we consider in this paper a setup of this problem where multiple strings are reconstructed together. Given a multiset $S$ of strings, all their substrings of some fixed length $\ell$, defined as… ▽ More The problem of string reconstruction based on its substrings spectrum has received significant attention recently due to its applicability to DNA data storage and sequencing. In contrast to previous works, we consider in this paper a setup of this problem where multiple strings are reconstructed together. Given a multiset $S$ of strings, all their substrings of some fixed length $\ell$, defined as the $\ell$-profile of $S$, are received and the goal is to reconstruct all strings in $S$. A multi-strand $\ell$-reconstruction code is a set of multisets such that every element $S$ can be reconstructed from its $\ell$-profile. Given the number of strings~$k$ and their length~$n$, we first find a lower bound on the value of $\ell$ necessary for existence of multi-strand $\ell$-reconstruction codes with non-vanishing asymptotic rate. We then present two constructions of such codes and show that their rates approach~$1$ for values of $\ell$ that asymptotically behave like the lower bound. △ Less

Submitted 26 August, 2021; originally announced August 2021.

Comments: 5 pages + 1 reference page. Version accepted for presentation at ITW2021

arXiv:2102.01412 [pdf, ps, other]

doi 10.1109/TMBMC.2024.3382499

On Codes for the Noisy Substring Channel

Authors: Yonatan Yehezkeally, Nikita Polyanskii

Abstract: We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise… ▽ More We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise before its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds on their sizes. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases. Moreover, we develop an efficient encoder for such constrained strings in some cases. Finally, we show how a similar encoder can be used to avoid formation of secondary-structures in coded DNA strands, even when accounting for imperfect structures. △ Less

Submitted 26 March, 2024; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: Author submitted, peer-reviewed, version

arXiv:2001.07047 [pdf, ps, other]

doi 10.1109/TIT.2021.3070466

Uncertainty of Reconstruction with List-Decoding from Uniform-Tandem-Duplication Noise

Authors: Yonatan Yehezkeally, Moshe Schwartz

Abstract: We propose a list-decoding scheme for reconstruction codes in the context of uniform-tandem-duplication noise, which can be viewed as an application of the associative memory model to this setting. We find the uncertainty associated with $m>2$ strings (where a previous paper considered $m=2$) in asymptotic terms, where code-words are taken from an error-correcting code. Thus, we find the trade-off… ▽ More We propose a list-decoding scheme for reconstruction codes in the context of uniform-tandem-duplication noise, which can be viewed as an application of the associative memory model to this setting. We find the uncertainty associated with $m>2$ strings (where a previous paper considered $m=2$) in asymptotic terms, where code-words are taken from an error-correcting code. Thus, we find the trade-off between the design minimum distance, the number of errors, the acceptable list size and the resulting uncertainty, which corresponds to the required number of distinct retrieved outputs for successful reconstruction. It is therefore seen that by accepting list-decoding one may decrease coding redundancy, or the required number of reads, or both. △ Less

Submitted 18 February, 2021; v1 submitted 20 January, 2020; originally announced January 2020.

Comments: 13 pages, no figures. Accepted version

arXiv:1911.05413 [pdf, ps, other]

doi 10.1109/TIT.2020.3006228

Single-Error Detection and Correction for Duplication and Substitution Channels

Authors: Yuanyuan Tang, Yonatan Yehezkeally, Moshe Schwartz, Farzad Farnoud

Abstract: Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and err… ▽ More Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and error-correcting codes are constructed, which can handle correctly any number of tandem duplications of a fixed length $k$, and at most a single substitution occurring at any time during the mutation process. △ Less

Submitted 28 June, 2020; v1 submitted 13 November, 2019; originally announced November 2019.

Comments: Author-submitted, peer-reviewed, version

arXiv:1801.06022 [pdf, ps, other]

doi 10.1109/TIT.2019.2940256

Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

Authors: Yonatan Yehezkeally, Moshe Schwartz

Abstract: DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored… ▽ More DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters. △ Less

Submitted 5 September, 2019; v1 submitted 18 January, 2018; originally announced January 2018.

Comments: 11 pages, 2 figures, Latex; version accepted for publication

arXiv:1601.05218 [pdf, ps, other]

doi 10.1109/TIT.2017.2719710

Limited-Magnitude Error-Correcting Gray Codes for Rank Modulation

Authors: Yonatan Yehezkeally, Moshe Schwartz

Abstract: We construct Gray codes over permutations for the rank-modulation scheme, which are also capable of correcting errors under the infinity-metric. These errors model limited-magnitude or spike errors, for which only single-error-detecting Gray codes are currently known. Surprisingly, the error-correcting codes we construct achieve a better asymptotic rate than that of presently known constructions n… ▽ More We construct Gray codes over permutations for the rank-modulation scheme, which are also capable of correcting errors under the infinity-metric. These errors model limited-magnitude or spike errors, for which only single-error-detecting Gray codes are currently known. Surprisingly, the error-correcting codes we construct achieve a better asymptotic rate than that of presently known constructions not having the Gray property, and exceed the Gilbert-Varshamov bound. Additionally, we present efficient ranking and unranking procedures, as well as a decoding procedure that runs in linear time. Finally, we also apply our methods to solve an outstanding issue with error-detecting rank-modulation Gray codes (snake-in-the-box codes) under a different metric, the Kendall $τ$-metric, in the group of permutations over an even number of elements $S_{2n}$, where we provide asymptotically optimal codes. △ Less

Submitted 19 June, 2016; v1 submitted 20 January, 2016; originally announced January 2016.

Comments: Revised version for journal submission. Additional results include more tight auxiliary constructions, a decoding shcema, ranking/unranking procedures, and application to snake-in-the-box codes under the Kendall tau-metric

arXiv:1107.3372 [pdf, ps, other]

doi 10.1109/TIT.2012.2196755

Snake-in-the-Box Codes for Rank Modulation

Authors: Yonatan Yehezkeally, Moshe Schwartz

Abstract: Motivated by the rank-modulation scheme with applications to flash memory, we consider Gray codes capable of detecting a single error, also known as snake-in-the-box codes. We study two error metrics: Kendall's $τ$-metric, which applies to charge-constrained errors, and the $\ell_\infty$-metric, which is useful in the case of limited magnitude errors. In both cases we construct snake-in-the-box co… ▽ More Motivated by the rank-modulation scheme with applications to flash memory, we consider Gray codes capable of detecting a single error, also known as snake-in-the-box codes. We study two error metrics: Kendall's $τ$-metric, which applies to charge-constrained errors, and the $\ell_\infty$-metric, which is useful in the case of limited magnitude errors. In both cases we construct snake-in-the-box codes with rate asymptotically tending to 1. We also provide efficient successor-calculation functions, as well as ranking and unranking functions. Finally, we also study bounds on the parameters of such codes. △ Less

Submitted 18 July, 2011; originally announced July 2011.

Showing 1–13 of 13 results for author: Yehezkeally, Y