Search | arXiv e-print repository

Reconstruction of Sets of Strings from Prefix/Suffix Compositions

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Olgica Milenkovic

Abstract: The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new… ▽ More The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide upper and lower bounds on the asymptotic rate of the underlying codebooks. Our code constructions combine properties of binary Bh and Dyck strings and that can be extended to accommodate missing substrings in the pool. As auxiliary results, we obtain the first known bounds on binary Bh sequences for arbitrary even parameters h, and also describe various error models inherent to mass spectrometry analysis. This paper contains a correction of the prior work by the authors, published in [24]. In particular, the bounds on the prefix codes are now corrected. △ Less

Submitted 5 October, 2021; originally announced October 2021.

arXiv:2010.11116 [pdf, ps, other]

Reconstructing Mixtures of Coded Strings from Prefix and Suffix Compositions

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Olgica Milenkovic

Abstract: The problem of string reconstruction from substring information has found many applications due to its relevance in DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry readouts. We describe new coding methods that allow fo… ▽ More The problem of string reconstruction from substring information has found many applications due to its relevance in DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry readouts. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide matching upper and lower bounds on the asymptotic rate of the underlying codebooks. Under certain mild constraints on the problem parameters, one can show that the largest possible rate of a codebook that allows for all subcollections of $\leq h$ codestrings to be uniquely reconstructable from the prefix-suffix information equals $1/h$. △ Less

Submitted 21 October, 2020; originally announced October 2020.

arXiv:2003.02121 [pdf, other]

Coding for Polymer-Based Data Storage

Authors: Srilakshmi Pattabiraman, Ryan Gabrys, Olgica Milenkovic

Abstract: Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for both unique string reconstruction and correction of multiple mass errors. We consider two approaches: The first approach pertains to asymmetric errors and it is based on introducing… ▽ More Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for both unique string reconstruction and correction of multiple mass errors. We consider two approaches: The first approach pertains to asymmetric errors and it is based on introducing redundancy that scales linearly with the number of errors and logarithmically with the length of the string. The construction allows for the string to be uniquely reconstructed based only on its erroneous substring composition multiset. The key idea behind our unique reconstruction approach is to interleave (shifted) Catalan-Bertrand paths with arbitrary binary strings and "reflect" them so as to force prefixes and suffixes of the same length to have different weights. The asymptotic code rate of the scheme is one, and decoding is accomplished via a simplified version of the backtracking algorithm used for the Turnpike problem. For symmetric errors, we use a polynomial characterization of the mass information and adapt polynomial evaluation code constructions for this setting. In the process, we develop new efficient decoding algorithms for a constant number of composition errors and show that the redundancy of the scheme scales quadratically with the number of errors and logarithmically with the codelength. △ Less

Submitted 28 June, 2021; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1904.09280, arXiv:2001.04967

arXiv:2001.04967 [pdf, ps, other]

Mass Error-Correction Codes for Polymer-Based Data Storage

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Olgica Milenkovic

Abstract: We consider the problem of correcting mass readout errors in information encoded in binary polymer strings. Our work builds on results for string reconstruction problems using composition multisets [Acharya et al., 2015] and the unique string reconstruction framework proposed in [Pattabiraman et al., 2019]. Binary polymer-based data storage systems [Laure et al., 2016] operate by designing two mol… ▽ More We consider the problem of correcting mass readout errors in information encoded in binary polymer strings. Our work builds on results for string reconstruction problems using composition multisets [Acharya et al., 2015] and the unique string reconstruction framework proposed in [Pattabiraman et al., 2019]. Binary polymer-based data storage systems [Laure et al., 2016] operate by designing two molecules of significantly different masses to represent the symbols $\{0,1\}$ and perform readouts through noisy tandem mass spectrometry. Tandem mass spectrometers fragment the strings to be read into shorter substrings and only report their masses, often with errors due to imprecise ionization. Modeling the fragmentation process output in terms of composition multisets allows for designing asymptotically optimal codes capable of unique reconstruction and the correction of a single mass error [Pattabiraman et al., 2019] through the use of derivatives of Catalan paths. Nevertheless, no solutions for multiple-mass error-corrections are currently known. Our work addresses this issue by describing the first multiple-error correction codes that use the polynomial factorization approach for the Turnpike problem [Skiena et al., 1990] and the related factorization described in [Acharya et al., 2015]. Adding Reed-Solomon type coding redundancy into the corresponding polynomials allows for correcting $t$ mass errors in polynomial time using $t^2\, \log\,k$ redundant bits, where $k$ is the information string length. The redundancy can be improved to $\log\,k + t$. However, no decoding algorithm that runs polynomial-time in both $t$ and $n$ for this scheme are currently known, where $n$ is the length of the coded string. △ Less

Submitted 14 January, 2020; originally announced January 2020.

arXiv:2001.04577 [pdf, other]

Group Testing with Runlength Constraints for Topological Molecular Storage

Authors: Abhishek Agarwal, Olgica Milenkovic, Srilakshmi Pattabiraman, João Ribeiro

Abstract: Motivated by applications in topological DNA-based data storage, we introduce and study a novel setting of Non-Adaptive Group Testing (NAGT) with runlength constraints on the columns of the test matrix, in the sense that any two 1's must be separated by a run of at least d 0's. We describe and analyze a probabilistic construction of a runlength-constrained scheme in the zero-error and vanishing er… ▽ More Motivated by applications in topological DNA-based data storage, we introduce and study a novel setting of Non-Adaptive Group Testing (NAGT) with runlength constraints on the columns of the test matrix, in the sense that any two 1's must be separated by a run of at least d 0's. We describe and analyze a probabilistic construction of a runlength-constrained scheme in the zero-error and vanishing error settings, and show that the number of tests required by this construction is optimal up to logarithmic factors in the runlength constraint d and the number of defectives k in both cases. Surprisingly, our results show that runlength-constrained NAGT is not more demanding than unconstrained NAGT when d=O(k), and that for almost all choices of d and k it is not more demanding than NAGT with a column Hamming weight constraint only. Towards obtaining runlength-constrained Quantitative NAGT (QNAGT) schemes with good parameters, we also provide lower bounds for this setting and a nearly optimal probabilistic construction of a QNAGT scheme with a column Hamming weight constraint. △ Less

Submitted 13 January, 2020; originally announced January 2020.

arXiv:1904.09280 [pdf, other]

Reconstruction and Error-Correction Codes for Polymer-Based Data Storage

Authors: Srilakshmi Pattabiraman, Ryan Gabrys, Olgica Milenkovic

Abstract: Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for unique string reconstruction and correction of one mass error. Our approach is based on introducing redundancy that scales logarithmically with the length of the string and allows f… ▽ More Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for unique string reconstruction and correction of one mass error. Our approach is based on introducing redundancy that scales logarithmically with the length of the string and allows for the string to be uniquely reconstructed based only on its erroneous substring composition multiset. The key idea behind our unique reconstruction approach is to interleave Catalan-type paths with arbitrary binary strings and "reflect" them so as to allow prefixes and suffixes of the same length to have different weights. For error correction, we add a constant number of bits that provides information about the weights of reflected pairs of bits and hence enable recovery from a single mass error. The asymptotic code rate of the scheme is one, and decoding is accomplished via a simplified version of the backtracking algorithm used for the Turnpike problem. △ Less

Submitted 19 April, 2019; originally announced April 2019.

Showing 1–6 of 6 results for author: Pattabiraman, S