Search | arXiv e-print repository

HoneyGAN Pots: A Deep Learning Approach for Generating Honeypots

Authors: Ryan Gabrys, Daniel Silva, Mark Bilinski

Abstract: This paper investigates the feasibility and effectiveness of employing Generative Adversarial Networks (GANs) for the generation of decoy configurations in the field of cyber defense. The utilization of honeypots has been extensively studied in the past; however, selecting appropriate decoy configurations for a given cyber scenario (and subsequently retrieving/generating them) remain open challeng… ▽ More This paper investigates the feasibility and effectiveness of employing Generative Adversarial Networks (GANs) for the generation of decoy configurations in the field of cyber defense. The utilization of honeypots has been extensively studied in the past; however, selecting appropriate decoy configurations for a given cyber scenario (and subsequently retrieving/generating them) remain open challenges. Existing approaches often rely on maintaining lists of configurations or storing collections of pre-configured images, lacking adaptability and efficiency. In this pioneering study, we present a novel approach that leverages GANs' learning capabilities to tackle these challenges. To the best of our knowledge, no prior attempts have been made to utilize GANs specifically for generating decoy configurations. Our research aims to address this gap and provide cyber defenders with a powerful tool to bolster their network defenses. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Presented at the 2nd International Workshop on Adaptive Cyber Defense, 2023 (arXiv:2308.09520)

Report number: ACD/2023/112

arXiv:2406.17689 [pdf, ps, other]

Robust Gray Codes Approaching the Optimal Rate

Authors: Roni Con, Dorsa Fathollahi, Ryan Gabrys, Mary Wootters, Eitan Yaakobi

Abstract: Robust Gray codes were introduced by (Lolck and Pagh, SODA 2024). Informally, a robust Gray code is a (binary) Gray code $\mathcal{G}$ so that, given a noisy version of the encoding $\mathcal{G}(j)$ of an integer $j$, one can recover $\hat{j}$ that is close to $j$ (with high probability over the noise). Such codes have found applications in differential privacy. In this work, we present near-opt… ▽ More Robust Gray codes were introduced by (Lolck and Pagh, SODA 2024). Informally, a robust Gray code is a (binary) Gray code $\mathcal{G}$ so that, given a noisy version of the encoding $\mathcal{G}(j)$ of an integer $j$, one can recover $\hat{j}$ that is close to $j$ (with high probability over the noise). Such codes have found applications in differential privacy. In this work, we present near-optimal constructions of robust Gray codes. In more detail, we construct a Gray code $\mathcal{G}$ of rate $1 - H_2(p) - \varepsilon$ that is efficiently encodable, and that is robust in the following sense. Supposed that $\mathcal{G}(j)$ is passed through the binary symmetric channel $\text{BSC}_p$ with cross-over probability $p$, to obtain $x$. We present an efficient decoding algorithm that, given $x$, returns an estimate $\hat{j}$ so that $|j - \hat{j}|$ is small with high probability. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2405.16370 [pdf, ps, other]

Quickly-Decodable Group Testing with Fewer Tests: Price-Scarlett and Cheraghchi-Nakos's Nonadaptive Splitting with Explicit Scalars

Authors: Hsin-Po Wang, Ryan Gabrys, Venkatesan Guruswami

Abstract: We modify Cheraghchi-Nakos [CN20] and Price-Scarlett's [PS20] fast binary splitting approach to nonadaptive group testing. We show that, to identify a uniformly random subset of $k$ infected persons among a population of $n$, it takes only $\ln(2 - 4\varepsilon) ^{-2} k \ln n$ tests and decoding complexity $O(\varepsilon^{-2} k \ln n)$, for any small $\varepsilon > 0$, with vanishing error probabi… ▽ More We modify Cheraghchi-Nakos [CN20] and Price-Scarlett's [PS20] fast binary splitting approach to nonadaptive group testing. We show that, to identify a uniformly random subset of $k$ infected persons among a population of $n$, it takes only $\ln(2 - 4\varepsilon) ^{-2} k \ln n$ tests and decoding complexity $O(\varepsilon^{-2} k \ln n)$, for any small $\varepsilon > 0$, with vanishing error probability. In works prior to ours, only two types of group testing schemes exist. Those that use $\ln(2)^{-2} k \ln n$ or fewer tests require linear-in-$n$ complexity, sometimes even polynomial in $n$; those that enjoy sub-$n$ complexity employ $O(k \ln n)$ tests, where the big-$O$ scalar is implicit, presumably greater than $\ln(2)^{-2}$. We almost achieve the best of both worlds, namely, the almost-$\ln(2)^{-2}$ scalar and the sub-$n$ decoding complexity. How much further one can reduce the scalar $\ln(2)^{-2}$ remains an open problem. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 6 pages, 3 figures, ISIT 2023

arXiv:2403.19061 [pdf, other]

One Code Fits All: Strong stuck-at codes for versatile memory encoding

Authors: Roni Con, Ryan Gabrys, Eitan Yaakobi

Abstract: In this work we consider a generalization of the well-studied problem of coding for ``stuck-at'' errors, which we refer to as ``strong stuck-at'' codes. In the traditional framework of stuck-at codes, the task involves encoding a message into a one-dimensional binary vector. However, a certain number of the bits in this vector are 'frozen', meaning they are fixed at a predetermined value and canno… ▽ More In this work we consider a generalization of the well-studied problem of coding for ``stuck-at'' errors, which we refer to as ``strong stuck-at'' codes. In the traditional framework of stuck-at codes, the task involves encoding a message into a one-dimensional binary vector. However, a certain number of the bits in this vector are 'frozen', meaning they are fixed at a predetermined value and cannot be altered by the encoder. The decoder, aware of the proportion of frozen bits but not their specific positions, is responsible for deciphering the intended message. We consider a more challenging version of this problem where the decoder does not know also the fraction of frozen bits. We construct explicit and efficient encoding and decoding algorithms that get arbitrarily close to capacity in this scenario. Furthermore, to the best of our knowledge, our construction is the first, fully explicit construction of stuck-at codes that approach capacity. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2402.03987 [pdf, other]

Tail-Erasure-Correcting Codes

Authors: Boaz Moav, Ryan Gabrys, Eitan Yaakobi

Abstract: The increasing demand for data storage has prompted the exploration of new techniques, with molecular data storage being a promising alternative. In this work, we develop coding schemes for a new storage paradigm that can be represented as a collection of two-dimensional arrays. Motivated by error patterns observed in recent prototype architectures, our study focuses on correcting erasures in the… ▽ More The increasing demand for data storage has prompted the exploration of new techniques, with molecular data storage being a promising alternative. In this work, we develop coding schemes for a new storage paradigm that can be represented as a collection of two-dimensional arrays. Motivated by error patterns observed in recent prototype architectures, our study focuses on correcting erasures in the last few symbols of each row, and also correcting arbitrary deletions across rows. We present code constructions and explicit encoders and decoders that are shown to be nearly optimal in many scenarios. We show that the new coding schemes are capable of effectively mitigating these errors, making these emerging storage platforms potentially promising solutions. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.17649 [pdf, other]

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Authors: Hadas Abraham, Rayn Gabrys, Eitan Yaakobi

Abstract: DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyse… ▽ More DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems. △ Less

Submitted 31 January, 2024; originally announced January 2024.

arXiv:2401.15666 [pdf, other]

Error-Correcting Codes for Combinatorial Composite DNA

Authors: Omer Sabary, Inbal Preuss, Ryan Gabrys, Zohar Yakhini, Leon Anavy, Eitan Yaakobi

Abstract: Data storage in DNA is develo** as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, con… ▽ More Data storage in DNA is develo** as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth. △ Less

Submitted 26 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2308.14558 [pdf, other]

Storage codes and recoverable systems on lines and grids

Authors: Alexander Barg, Ohad Elishco, Ryan Gabrys, Geyang Wang, Eitan Yaakobi

Abstract: A storage code is an assignment of symbols to the vertices of a connected graph $G(V,E)$ with the property that the value of each vertex is a function of the values of its neighbors, or more generally, of a certain neighborhood of the vertex in $G$. In this work we introduce a new construction method of storage codes, enabling one to construct new codes from known ones via an interleaving procedur… ▽ More A storage code is an assignment of symbols to the vertices of a connected graph $G(V,E)$ with the property that the value of each vertex is a function of the values of its neighbors, or more generally, of a certain neighborhood of the vertex in $G$. In this work we introduce a new construction method of storage codes, enabling one to construct new codes from known ones via an interleaving procedure driven by resolvable designs. We also study storage codes on $\mathbb Z$ and ${\mathbb Z}^2$ (lines and grids), finding closed-form expressions for the capacity of several one and two-dimensional systems depending on their recovery set, using connections between storage codes, graphs, anticodes, and difference-avoiding sets. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2305.05656 [pdf, other]

Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems

Authors: Daniella Bar-Lev, Omer Sabary, Ryan Gabrys, Eitan Yaakobi

Abstract: Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly $120/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which… ▽ More Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly $120/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by initiating the study of the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand the effect of error-correcting codes and retrieval algorithms on the required sequencing coverage depth. We establish that the expected number of reads that are required for information retrieval is minimized when the channel follows a uniform distribution. We also derive upper and lower bounds on the probability distribution of this number of required reads and provide a comprehensive upper and lower bound on its expected value. We further prove that for a noiseless channel and uniform distribution, MDS codes are optimal in terms of minimizing the expected number of reads. Additionally, we study the DNA coverage depth problem under the random-access setup, in which the user aims to retrieve just a specific information unit from the entire DNA storage system. We prove that the expected retrieval time is at least k for [n,k] MDS codes as well as for other families of codes. Furthermore, we present explicit code constructions that achieve expected retrieval times below k and evaluate their performance through analytical methods and simulations. Lastly, we provide lower bounds on the maximum expected retrieval time. Our findings offer valuable insights for reducing the cost and latency of DNA storage. △ Less

Submitted 29 November, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.01365 [pdf, ps, other]

Finding a Burst of Positives via Nonadaptive Semiquantitative Group Testing

Authors: Yun-Han Li, Ryan Gabrys, ** Sima, Ilan Shomorony, Olgica Milenkovic

Abstract: Motivated by testing for pathogenic diseases we consider a new nonadaptive group testing problem for which: (1) positives occur within a burst, capturing the fact that infected test subjects often come in clusters, and (2) that the test outcomes arise from semiquantitative measurements that provide coarse information about the number of positives in any tested group. Our model generalizes prior wo… ▽ More Motivated by testing for pathogenic diseases we consider a new nonadaptive group testing problem for which: (1) positives occur within a burst, capturing the fact that infected test subjects often come in clusters, and (2) that the test outcomes arise from semiquantitative measurements that provide coarse information about the number of positives in any tested group. Our model generalizes prior work on detecting a single burst of positives with classical group testing[1] as well as work on semiquantitative group testing (SQGT)[2]. Specifically, we study the setting where the burst-length $\ell$ is known and the semiquantitative tests provide potentially nonuniform estimates on the number of positives in a test group. The estimates represent the index of a quantization bin containing the (exact) total number of positives, for arbitrary thresholds $η_1,\dots,η_s$. Interestingly, we show that the minimum number of tests needed for burst identification is essentially only a function of the largest threshold $η_s$. In this context, our main result is an order-optimal test scheme that can recover any burst of length $\ell$ using roughly $\frac{\ell}{2η_s}+\log_{s+1}(n)$ measurements. This suggests that a large saturation level $η_s$ is more important than finely quantized information when dealing with bursts. We also provide results for related modeling assumptions and specialized choices of thresholds. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2210.11818 [pdf, ps, other]

Non-binary Codes for Correcting a Burst of at Most t Deletions

Authors: Shuche Wang, Yuanyuan Tang, ** Sima, Ryan Gabrys, Farzad Farnoud

Abstract: The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary alphabets. Afterwards, we extend this result to t… ▽ More The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary alphabets. Afterwards, we extend this result to the case where the length of the burst can be at most $t$ where $t$ is a constant. Finally, we consider the setup where the sequences that are transmitted are permutations. The proposed codes are the largest known for their respective parameter regimes. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: 20 pages. The paper has been submitted to IEEE Transactions on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and Allerton2022

arXiv:2208.02330 [pdf, other]

Low-redundancy codes for correcting multiple short-duplication and edit errors

Authors: Yuanyuan Tang, Shuche Wang, Hao Lou, Ryan Gabrys, Farzad Farnoud

Abstract: Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simulta… ▽ More Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most $p$ edits, where a short duplication generates a copy of a substring with length $\leq 3$ and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to $p$ edits (in addition to duplications) at the additional cost of roughly $8p(\log_q n)(1+o(1))$ symbols of redundancy, thus achieving the same asymptotic rate, where $q\ge 4$ is the alphabet size and $p$ is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when $p$ is a constant with respect to the code length. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT2022

arXiv:2207.04522 [pdf, ps, other]

doi 10.4230/LIPIcs.APPROX/RANDOM.2022.17

Accelerating Polarization via Alphabet Extension

Authors: Iwan Duursma, Ryan Gabrys, Venkatesan Guruswami, Ting-Chun Lin, Hsin-Po Wang

Abstract: Polarization is an unprecedented coding technique in that it not only achieves channel capacity, but also does so at a faster speed of convergence than any other coding technique. This speed is measured by the ``scaling exponent'' and its importance is three-fold. Firstly, estimating the scaling exponent is challenging and demands a deeper understanding of the dynamics of communication channels. S… ▽ More Polarization is an unprecedented coding technique in that it not only achieves channel capacity, but also does so at a faster speed of convergence than any other coding technique. This speed is measured by the ``scaling exponent'' and its importance is three-fold. Firstly, estimating the scaling exponent is challenging and demands a deeper understanding of the dynamics of communication channels. Secondly, scaling exponents serve as a benchmark for different variants of polar codes that helps us select the proper variant for real-life applications. Thirdly, the need to optimize for the scaling exponent sheds light on how to reinforce the design of polar codes. In this paper, we generalize the binary erasure channel (BEC), the simplest communication channel and the protagonist of many coding theory studies, to the ``tetrahedral erasure channel'' (TEC). We then invoke Mori--Tanaka's $2 \times 2$ matrix over GF$(4)$ to construct polar codes over TEC. Our main contribution is showing that the dynamic of TECs converges to an almost--one-parameter family of channels, which then leads to an upper bound of $3.328$ on the scaling exponent. This is the first non-binary matrix whose scaling exponent is upper-bounded. It also polarizes BEC faster than all known binary matrices up to $23 \times 23$ in size. Our result indicates that expanding the alphabet is a more effective and practical alternative to enlarging the matrix in order to achieve faster polarization. △ Less

Submitted 15 July, 2023; v1 submitted 10 July, 2022; originally announced July 2022.

Comments: 22 pages, 4 figures. Accepted to RANDOM 2022. v2: 29 pages, 5 figures, 1 table; address comments from JSAIT

MSC Class: 94B65

arXiv:2204.11683 [pdf, other]

Sub-4.7 Scaling Exponent of Polar Codes

Authors: Hsin-Po Wang, Ting-Chun Lin, Alexander Vardy, Ryan Gabrys

Abstract: Polar code visibly approaches channel capacity in practice and is thereby a constituent code of the 5G standard. Compared to low-density parity-check code, however, the performance of short-length polar code has rooms for improvement that could hinder its adoption by a wider class of applications. As part of the program that addresses the performance issue at short length, it is crucial to underst… ▽ More Polar code visibly approaches channel capacity in practice and is thereby a constituent code of the 5G standard. Compared to low-density parity-check code, however, the performance of short-length polar code has rooms for improvement that could hinder its adoption by a wider class of applications. As part of the program that addresses the performance issue at short length, it is crucial to understand how fast binary memoryless symmetric channels polarize. A number, called scaling exponent, was defined to measure the speed of polarization and several estimates of the scaling exponent were given in literature. As of 2022, the tightest overestimate is 4.714 made by Mondelli, Hassani, and Urbanke in 2015. We lower the overestimate to 4.63. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Comments: 15 pages, 13 figures, 1 table

MSC Class: 94B65

arXiv:2201.12671 [pdf, ps, other]

The Gapped $k$-Deck Problem

Authors: Rebecca Golm, Mina Nahvi, Ryan Gabrys, Olgica Milenkovic

Abstract: The $k$-deck problem is concerned with finding the smallest positive integer $S(k)$ such that there exist at least two strings of length $S(k)$ that share the same $k$-deck, i.e., the multiset of subsequences of length $k$. We introduce the new problem of gapped $k$-deck reconstruction: For a given gap parameter $s$, we seek the smallest positive integer $G_s(k)$ such that there exist at least two… ▽ More The $k$-deck problem is concerned with finding the smallest positive integer $S(k)$ such that there exist at least two strings of length $S(k)$ that share the same $k$-deck, i.e., the multiset of subsequences of length $k$. We introduce the new problem of gapped $k$-deck reconstruction: For a given gap parameter $s$, we seek the smallest positive integer $G_s(k)$ such that there exist at least two distinct strings of length $G_s(k)$ that cannot be distinguished based on a "gapped" set of $k$-subsequences. The gap constraint requires the elements in the subsequences to be at least $s$ positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same $2$-gapped $k$-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on $G_2(k)$. Second, we further improve this bound using the approach by Dudik and Schulman. △ Less

Submitted 17 May, 2022; v1 submitted 29 January, 2022; originally announced January 2022.

arXiv:2201.09171 [pdf, other]

Balanced and Swap-Robust Trades for Dynamical Distributed Storage

Authors: Chao Pan, Ryan Gabrys, Xujun Liu, Charles Colbourn, Olgica Milenkovic

Abstract: Trades, introduced by Hedayat, are two sets of blocks of elements which may be exchanged (traded) without altering the counts of certain subcollections of elements within their constituent blocks. They are of importance in applications where certain combinations of elements dynamically become prohibited from being placed in the same group of elements, since in this case one can trade the offending… ▽ More Trades, introduced by Hedayat, are two sets of blocks of elements which may be exchanged (traded) without altering the counts of certain subcollections of elements within their constituent blocks. They are of importance in applications where certain combinations of elements dynamically become prohibited from being placed in the same group of elements, since in this case one can trade the offending blocks with allowed ones. This is particularly the case in distributed storage systems, where due to privacy and other constraints, data of some groups of users cannot be stored together on the same server. We introduce a new class of balanced trades, important for access balancing of servers, and perturbation resilient balanced trades, important for studying the stability of server access frequencies with respect to changes in data popularity. The constructions and bounds on our new trade schemes rely on specialized selections of defining sets in minimal trades and number-theoretic analyses. △ Less

Submitted 13 May, 2022; v1 submitted 22 January, 2022; originally announced January 2022.

Comments: 6 pages

arXiv:2201.05440 [pdf, ps, other]

Tropical Group Testing

Authors: Hsin-Po Wang, Ryan Gabrys, Alexander Vardy

Abstract: Polymerase chain reaction (PCR) testing is the gold standard for diagnosing COVID-19. PCR amplifies the virus DNA 40 times to produce measurements of viral loads that span seven orders of magnitude. Unfortunately, the outputs of these tests are imprecise and therefore quantitative group testing methods, which rely on precise measurements, are not applicable. Motivated by the ever-increasing demand… ▽ More Polymerase chain reaction (PCR) testing is the gold standard for diagnosing COVID-19. PCR amplifies the virus DNA 40 times to produce measurements of viral loads that span seven orders of magnitude. Unfortunately, the outputs of these tests are imprecise and therefore quantitative group testing methods, which rely on precise measurements, are not applicable. Motivated by the ever-increasing demand to identify individuals infected with SARS-CoV-19, we propose a new model that leverages tropical arithmetic to characterize the PCR testing process. Our proposed framework, termed tropical group testing, overcomes existing limitations of quantitative group testing by allowing for imprecise test measurements. In many cases, some of which are highlighted in this work, tropical group testing is provably more powerful than traditional binary group testing in that it require fewer tests than classical approaches, while additionally providing a mechanism to identify the viral load of each infected individual. It is also empirically stronger than related works that have attempted to combine PCR, quantitative group testing, and compressed sensing. △ Less

Submitted 17 January, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

Comments: 25 pages, 20 figures. v2 fixes typos

MSC Class: 05B20; 15A80 (Primary)

arXiv:2112.09971 [pdf, ps, other]

Beyond Single-Deletion Correcting Codes: Substitutions and Transpositions

Authors: Ryan Gabrys, Venkatesan Guruswami, João Ribeiro, Ke Wu

Abstract: We consider the problem of designing low-redundancy codes in settings where one must correct deletions in conjunction with substitutions or adjacent transpositions; a combination of errors that is usually observed in DNA-based data storage. One of the most basic versions of this problem was settled more than 50 years ago by Levenshtein, or one substitution, with nearly optimal redundancy. However,… ▽ More We consider the problem of designing low-redundancy codes in settings where one must correct deletions in conjunction with substitutions or adjacent transpositions; a combination of errors that is usually observed in DNA-based data storage. One of the most basic versions of this problem was settled more than 50 years ago by Levenshtein, or one substitution, with nearly optimal redundancy. However, this approach fails to extend to many simple and natural variations of the binary single-edit error setting. In this work, we make progress on the code design problem above in three such variations: We construct linear-time encodable and decodable length-$n$ non-binary codes correcting a single edit error with nearly optimal redundancy $\log n+O(\log\log n)$, providing an alternative simpler proof of a result by Cai, Chee, Gabrys, Kiah, and Nguyen (IEEE Trans. Inf. Theory 2021). This is achieved by employing what we call weighted VT sketches, a notion that may be of independent interest. We construct linear-time encodable and list-decodable binary codes with list-size $2$ for one deletion and one substitution with redundancy $4\log n+O(\log\log n)$. This matches the existential bound up to an $O(\log\log n)$ additive term. We show the existence of a binary code correcting one deletion or one adjacent transposition with nearly optimal redundancy $\log n+O(\log\log n)$. △ Less

Submitted 18 December, 2021; originally announced December 2021.

Comments: 33 pages, 7 figures

arXiv:2110.02352 [pdf, other]

Reconstruction of Sets of Strings from Prefix/Suffix Compositions

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Olgica Milenkovic

Abstract: The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new… ▽ More The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide upper and lower bounds on the asymptotic rate of the underlying codebooks. Our code constructions combine properties of binary Bh and Dyck strings and that can be extended to accommodate missing substrings in the pool. As auxiliary results, we obtain the first known bounds on binary Bh sequences for arbitrary even parameters h, and also describe various error models inherent to mass spectrometry analysis. This paper contains a correction of the prior work by the authors, published in [24]. In particular, the bounds on the prefix codes are now corrected. △ Less

Submitted 5 October, 2021; originally announced October 2021.

arXiv:2102.04519 [pdf, ps, other]

Semiquantitative Group Testing in at Most Two Rounds

Authors: Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic

Abstract: Semiquantitative group testing (SQGT) is a pooling method in which the test outcomes represent bounded intervals for the number of defectives. Alternatively, it may be viewed as an adder channel with quantized outputs. SQGT represents a natural choice for Covid-19 group testing as it allows for a straightforward interpretation of the cycle threshold values produced by polymerase chain reactions (P… ▽ More Semiquantitative group testing (SQGT) is a pooling method in which the test outcomes represent bounded intervals for the number of defectives. Alternatively, it may be viewed as an adder channel with quantized outputs. SQGT represents a natural choice for Covid-19 group testing as it allows for a straightforward interpretation of the cycle threshold values produced by polymerase chain reactions (PCR). Prior work on SQGT did not address the need for adaptive testing with a small number of rounds as required in practice. We propose conceptually simple methods for 2-round and nonadaptive SQGT that significantly improve upon existing schemes by using ideas on nonbinary measurement matrices based on expander graphs and list-disjunct matrices. △ Less

Submitted 8 February, 2021; originally announced February 2021.

arXiv:2011.05223 [pdf, other]

AC-DC: Amplification Curve Diagnostics for Covid-19 Group Testing

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Vishal Rana, João Ribeiro, Mahdi Cheraghchi, Venkatesan Guruswami, Olgica Milenkovic

Abstract: The first part of the paper presents a review of the gold-standard testing protocol for Covid-19, real-time, reverse transcriptase PCR, and its properties and associated measurement data such as amplification curves that can guide the development of appropriate and accurate adaptive group testing protocols. The second part of the paper is concerned with examining various off-the-shelf group testin… ▽ More The first part of the paper presents a review of the gold-standard testing protocol for Covid-19, real-time, reverse transcriptase PCR, and its properties and associated measurement data such as amplification curves that can guide the development of appropriate and accurate adaptive group testing protocols. The second part of the paper is concerned with examining various off-the-shelf group testing methods for Covid-19 and identifying their strengths and weaknesses for the application at hand. The third part of the paper contains a collection of new analytical results for adaptive semiquantitative group testing with probabilistic and combinatorial priors, including performance bounds, algorithmic solutions, and noisy testing protocols. The probabilistic setting is of special importance as it is designed to be simple to implement by nonexperts and handle heavy hitters. The worst-case paradigm extends and improves upon prior work on semiquantitative group testing with and without specialized PCR noise models. △ Less

Submitted 5 June, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

arXiv:2010.11116 [pdf, ps, other]

Reconstructing Mixtures of Coded Strings from Prefix and Suffix Compositions

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Olgica Milenkovic

Abstract: The problem of string reconstruction from substring information has found many applications due to its relevance in DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry readouts. We describe new coding methods that allow fo… ▽ More The problem of string reconstruction from substring information has found many applications due to its relevance in DNA- and polymer-based data storage. One practically important and challenging paradigm requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry readouts. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide matching upper and lower bounds on the asymptotic rate of the underlying codebooks. Under certain mild constraints on the problem parameters, one can show that the largest possible rate of a codebook that allows for all subcollections of $\leq h$ codestrings to be uniquely reconstructable from the prefix-suffix information equals $1/h$. △ Less

Submitted 21 October, 2020; originally announced October 2020.

arXiv:2003.02121 [pdf, other]

Coding for Polymer-Based Data Storage

Authors: Srilakshmi Pattabiraman, Ryan Gabrys, Olgica Milenkovic

Abstract: Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for both unique string reconstruction and correction of multiple mass errors. We consider two approaches: The first approach pertains to asymmetric errors and it is based on introducing… ▽ More Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for both unique string reconstruction and correction of multiple mass errors. We consider two approaches: The first approach pertains to asymmetric errors and it is based on introducing redundancy that scales linearly with the number of errors and logarithmically with the length of the string. The construction allows for the string to be uniquely reconstructed based only on its erroneous substring composition multiset. The key idea behind our unique reconstruction approach is to interleave (shifted) Catalan-Bertrand paths with arbitrary binary strings and "reflect" them so as to force prefixes and suffixes of the same length to have different weights. The asymptotic code rate of the scheme is one, and decoding is accomplished via a simplified version of the backtracking algorithm used for the Turnpike problem. For symmetric errors, we use a polynomial characterization of the mass information and adapt polynomial evaluation code constructions for this setting. In the process, we develop new efficient decoding algorithms for a constant number of composition errors and show that the redundancy of the scheme scales quadratically with the number of errors and logarithmically with the codelength. △ Less

Submitted 28 June, 2021; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1904.09280, arXiv:2001.04967

arXiv:2001.04967 [pdf, ps, other]

Mass Error-Correction Codes for Polymer-Based Data Storage

Authors: Ryan Gabrys, Srilakshmi Pattabiraman, Olgica Milenkovic

Abstract: We consider the problem of correcting mass readout errors in information encoded in binary polymer strings. Our work builds on results for string reconstruction problems using composition multisets [Acharya et al., 2015] and the unique string reconstruction framework proposed in [Pattabiraman et al., 2019]. Binary polymer-based data storage systems [Laure et al., 2016] operate by designing two mol… ▽ More We consider the problem of correcting mass readout errors in information encoded in binary polymer strings. Our work builds on results for string reconstruction problems using composition multisets [Acharya et al., 2015] and the unique string reconstruction framework proposed in [Pattabiraman et al., 2019]. Binary polymer-based data storage systems [Laure et al., 2016] operate by designing two molecules of significantly different masses to represent the symbols $\{0,1\}$ and perform readouts through noisy tandem mass spectrometry. Tandem mass spectrometers fragment the strings to be read into shorter substrings and only report their masses, often with errors due to imprecise ionization. Modeling the fragmentation process output in terms of composition multisets allows for designing asymptotically optimal codes capable of unique reconstruction and the correction of a single mass error [Pattabiraman et al., 2019] through the use of derivatives of Catalan paths. Nevertheless, no solutions for multiple-mass error-corrections are currently known. Our work addresses this issue by describing the first multiple-error correction codes that use the polynomial factorization approach for the Turnpike problem [Skiena et al., 1990] and the related factorization described in [Acharya et al., 2015]. Adding Reed-Solomon type coding redundancy into the corresponding polynomials allows for correcting $t$ mass errors in polynomial time using $t^2\, \log\,k$ redundant bits, where $k$ is the information string length. The redundancy can be improved to $\log\,k + t$. However, no decoding algorithm that runs polynomial-time in both $t$ and $n$ for this scheme are currently known, where $n$ is the length of the coded string. △ Less

Submitted 14 January, 2020; originally announced January 2020.

arXiv:1910.06501 [pdf, ps, other]

Optimal Codes Correcting a Single Indel / Edit for DNA-Based Data Storage

Authors: Kui Cai, Yeow Meng Chee, Ryan Gabrys, Han Mao Kiah, Tuan Thanh Nguyen

Abstract: An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this paper, we investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with… ▽ More An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this paper, we investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with log n + O(log log n) redundancy bits, while the other corrects a single indel with log n + 2 redundant bits. These two encoders are order-optimal. The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the GC-balanced constraint and require that exactly half of the symbols of any DNA codeword to be either C or G. In particular, via a modification of Knuth's balancing technique, we provide a linear-time map that translates binary messages into GC-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of GC-balanced codes that correct a single indel or a single edit. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: 15 pages

arXiv:1910.00796 [pdf, other]

doi 10.1109/TIT.2023.3247860

Transition Waste Optimization for Coded Elastic Computing

Authors: Hoang Dau, Ryan Gabrys, Yu-Chih Huang, Chen Feng, Quang-Hung Luu, Eidah Alzahrani, Zahir Tari

Abstract: Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems. Coded computing is a recently emerging paradigm where redundancy for distributed computing is introduced to alleviate the impact of slow machines (stragglers) on the completion time. We investigate coded computing solutions o… ▽ More Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems. Coded computing is a recently emerging paradigm where redundancy for distributed computing is introduced to alleviate the impact of slow machines (stragglers) on the completion time. We investigate coded computing solutions over elastic resources, where the set of available machines may change in the middle of the computation. This is motivated by recently available services in the cloud computing industry (e.g., EC2 Spot, Azure Batch) where low-priority virtual machines are offered at a fraction of the price of the on-demand instances but can be preempted on short notice. Our contributions are three-fold. We first introduce a new concept called transition waste that quantifies the number of tasks existing machines must abandon or take over when a machine joins/leaves. We then develop an efficient method to minimize the transition waste for the cyclic task allocation scheme recently proposed in the literature (Yang et al. ISIT'19). Finally, we establish a novel solution based on finite geometry achieving zero transition wastes given that the number of active machines varies within a fixed range. △ Less

Submitted 14 March, 2023; v1 submitted 2 October, 2019; originally announced October 2019.

Comments: 24 pages, accepted by IEEE Transactions on Information Theory

arXiv:1909.05694 [pdf, ps, other]

Repeat-Free Codes

Authors: Ohad Elishco, Ryan Gabrys, Eitan Yaakobi, Muriel Médard

Abstract: In this paper we consider the problem of encoding data into \textit{repeat-free} sequences in which sequences are imposed to contain any $k$-tuple at most once (for predefined $k$). First, the capacity of the repeat-free constraint are calculated. Then, an efficient algorithm, which uses two bits of redundancy, is presented to encode length-$n$ sequences for $k=2+2\log (n)$. This algorithm is then… ▽ More In this paper we consider the problem of encoding data into \textit{repeat-free} sequences in which sequences are imposed to contain any $k$-tuple at most once (for predefined $k$). First, the capacity of the repeat-free constraint are calculated. Then, an efficient algorithm, which uses two bits of redundancy, is presented to encode length-$n$ sequences for $k=2+2\log (n)$. This algorithm is then improved to support any value of $k$ of the form $k=a\log (n)$, for $1<a$, while its redundancy is $o(n)$. We also calculate the capacity of repeat-free sequences when combined with local constraints which are given by a constrained system, and the capacity of multi-dimensional repeat-free codes. △ Less

Submitted 21 June, 2021; v1 submitted 12 September, 2019; originally announced September 2019.

Comments: 21 pages

arXiv:1906.12073 [pdf, ps, other]

Access Balancing in Storage Systems by Labeling Partial Steiner Systems

Authors: Yeow Meng Chee, Charles J. Colbourn, Hoang Dau, Ryan Gabrys, Alan C. H. Ling, Dylan Lusi, Olgica Milenkovic

Abstract: Storage architectures ranging from minimum bandwidth regenerating encoded distributed storage systems to declustered-parity RAIDs can be designed using dense partial Steiner systems in order to support fast reads, writes, and recovery of failed storage units. In order to ensure good performance, popularities of the data items should be taken into account and the frequencies of accesses to the stor… ▽ More Storage architectures ranging from minimum bandwidth regenerating encoded distributed storage systems to declustered-parity RAIDs can be designed using dense partial Steiner systems in order to support fast reads, writes, and recovery of failed storage units. In order to ensure good performance, popularities of the data items should be taken into account and the frequencies of accesses to the storage units made as uniform as possible. A proposed combinatorial model ranks items by popularity and assigns data items to elements in a dense partial Steiner system so that the sums of ranks of the elements in each block are as equal as possible. By develo** necessary conditions in terms of independent sets, we demonstrate that certain Steiner systems must have a much larger difference between the largest and smallest block sums than is dictated by an elementary lower bound. In contrast, we also show that certain dense partial $S(t, t+1, v)$ designs can be labeled to realize the elementary lower bound. Furthermore, we prove that for every admissible order $v$, there is a Steiner triple system $(S(2, 3, v))$ whose largest difference in block sums is within an additive constant of the lower bound. △ Less

Submitted 28 June, 2019; originally announced June 2019.

Comments: 16 pages

arXiv:1904.09280 [pdf, other]

Reconstruction and Error-Correction Codes for Polymer-Based Data Storage

Authors: Srilakshmi Pattabiraman, Ryan Gabrys, Olgica Milenkovic

Abstract: Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for unique string reconstruction and correction of one mass error. Our approach is based on introducing redundancy that scales logarithmically with the length of the string and allows f… ▽ More Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for unique string reconstruction and correction of one mass error. Our approach is based on introducing redundancy that scales logarithmically with the length of the string and allows for the string to be uniquely reconstructed based only on its erroneous substring composition multiset. The key idea behind our unique reconstruction approach is to interleave Catalan-type paths with arbitrary binary strings and "reflect" them so as to allow prefixes and suffixes of the same length to have different weights. For error correction, we add a constant number of bits that provides information about the weights of reflected pairs of bits and hence enable recovery from a single mass error. The asymptotic code rate of the scheme is one, and decoding is accomplished via a simplified version of the backtracking algorithm used for the Turnpike problem. △ Less

Submitted 19 April, 2019; originally announced April 2019.

arXiv:1903.09992 [pdf, ps, other]

Coded trace reconstruction

Authors: Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, João Ribeiro

Abstract: Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of \emph{coded trace reconstruction}, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called \emph{traces}) corrupted by edit errors. Codes used in current portable DNA-based storage syste… ▽ More Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of \emph{coded trace reconstruction}, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called \emph{traces}) corrupted by edit errors. Codes used in current portable DNA-based storage systems with nanopore sequencers are largely based on heuristics, and have no provable robustness or performance guarantees even for an error model with i.i.d.\ deletions and constant deletion probability. Our work is a first step towards the design of efficient codes with provable guarantees for such systems. We consider a constant rate of i.i.d.\ deletions, and perform an analysis of marker-based code-constructions. This gives rise to codes with redundancy $O(n/\log n)$ (resp.\ $O(n/\log\log n)$) that can be efficiently reconstructed from $\exp(O(\log^{2/3}n))$ (resp.\ $\exp(O(\log\log n)^{2/3})$) traces, where $n$ is the message length. Then, we give a construction of a code with $O(\log n)$ bits of redundancy that can be efficiently reconstructed from $\textrm{poly}(n)$ traces if the deletion probability is small enough. Finally, we show how to combine both approaches, giving rise to an efficient code with $O(n/\log n)$ bits of redundancy which can be reconstructed from $\textrm{poly}(\log n)$ traces for a small constant deletion probability. △ Less

Submitted 9 September, 2019; v1 submitted 24 March, 2019; originally announced March 2019.

Comments: v2 and v3: added missing references; v4: added funding acknowledgment ; v5: added references to concurrent, independent work; v6: added funding acknowledgment. 26 pages, no figures. A short version of this paper was presented at ITW 2019

arXiv:1901.05559 [pdf, ps, other]

Set-Codes with Small Intersections and Small Discrepancies

Authors: R. Gabrys, H. S. Dau, C. J. Colbourn, O. Milenkovic

Abstract: We are concerned with the problem of designing large families of subsets over a common labeled ground set that have small pairwise intersections and the property that the maximum discrepancy of the label values within each of the sets is less than or equal to one. Our results, based on transversal designs, factorizations of packings and Latin rectangles, show that by jointly constructing the sets… ▽ More We are concerned with the problem of designing large families of subsets over a common labeled ground set that have small pairwise intersections and the property that the maximum discrepancy of the label values within each of the sets is less than or equal to one. Our results, based on transversal designs, factorizations of packings and Latin rectangles, show that by jointly constructing the sets and labeling scheme, one can achieve optimal family sizes for many parameter choices. Probabilistic arguments akin to those used for pseudorandom generators lead to significantly suboptimal results when compared to the proposed combinatorial methods. The design problem considered is motivated by applications in molecular data storage and theoretical computer science. △ Less

Submitted 16 January, 2019; originally announced January 2019.

arXiv:1809.04702 [pdf, other]

Reconciling Similar Sets of Data

Authors: Ryan Gabrys, Farzad Farnoud

Abstract: In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, explicit encoding and decoding algorithms are provided f… ▽ More In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, explicit encoding and decoding algorithms are provided for many cases. △ Less

Submitted 12 September, 2018; originally announced September 2018.

arXiv:1804.04548 [pdf, ps, other]

Unique Reconstruction of Coded Strings from Multiset Substring Spectra

Authors: Ryan Gabrys, Olgica Milenkovic

Abstract: The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded string reconstruction from multiset substring spectra, where the strings are restricted to lie in some codebook. In particular, we consider binary codebooks tha… ▽ More The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded string reconstruction from multiset substring spectra, where the strings are restricted to lie in some codebook. In particular, we consider binary codebooks that allow for unique string reconstruction and propose a new method, termed repeat replacement, to create the codebook. Our contributions include algorithmic solutions for repeat replacement and constructive redundancy bounds for the underlying coding schemes. We also consider extensions of the problem to noisy settings in which substrings are compromised by burst and random errors. The study is motivated by applications in DNA-based data storage systems that use high throughput readout sequencers. △ Less

Submitted 22 April, 2019; v1 submitted 12 April, 2018; originally announced April 2018.

arXiv:1712.07222 [pdf, other]

Codes Correcting Two Deletions

Authors: Ryan Gabrys, Frederic Sala

Abstract: In this work, we investigate the problem of constructing codes capable of correcting two deletions. In particular, we construct a code that requires redundancy approximately 8 log n + O(log log n) bits of redundancy, where n is the length of the code. To the best of the author's knowledge, this represents the best known construction in that it requires the lowest number of redundant bits for a cod… ▽ More In this work, we investigate the problem of constructing codes capable of correcting two deletions. In particular, we construct a code that requires redundancy approximately 8 log n + O(log log n) bits of redundancy, where n is the length of the code. To the best of the author's knowledge, this represents the best known construction in that it requires the lowest number of redundant bits for a code correcting two deletions. △ Less

Submitted 30 April, 2018; v1 submitted 19 December, 2017; originally announced December 2017.

arXiv:1709.05214 [pdf, other]

Mutually Uncorrelated Primers for DNA-Based Data Storage

Authors: S. M. Hossein Tabatabaei Yazdi, Han Mao Kiah, Ryan Gabrys, Olgica Milenkovic

Abstract: We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based data storage systems and for synchronization of communication devices. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. WMU sequences used for primer design in DNA-based data storage systems ar… ▽ More We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based data storage systems and for synchronization of communication devices. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. WMU sequences used for primer design in DNA-based data storage systems are also required to be at large mutual Hamming distance from each other, have balanced compositions of symbols, and avoid primer-dimer byproducts. We derive bounds on the size of WMU and various constrained WMU codes and present a number of constructions for balanced, error-correcting, primer-dimer free WMU codes using Dyck paths, prefix-synchronized and cyclic codes. △ Less

Submitted 13 September, 2017; originally announced September 2017.

Comments: 14 pages, 3 figures, 1 Table. arXiv admin note: text overlap with arXiv:1601.08176

arXiv:1701.08111 [pdf, ps, other]

The Hybrid k-Deck Problem: Reconstructing Sequences from Short and Long Traces

Authors: Ryan Gabrys, Olgica Milenkovic

Abstract: We introduce a new variant of the $k$-deck problem, which in its traditional formulation asks for determining the smallest $k$ that allows one to reconstruct any binary sequence of length $n$ from the multiset of its $k$-length subsequences. In our version of the problem, termed the hybrid k-deck problem, one is given a certain number of special subsequences of the sequence of length $n - t$,… ▽ More We introduce a new variant of the $k$-deck problem, which in its traditional formulation asks for determining the smallest $k$ that allows one to reconstruct any binary sequence of length $n$ from the multiset of its $k$-length subsequences. In our version of the problem, termed the hybrid k-deck problem, one is given a certain number of special subsequences of the sequence of length $n - t$, $t > 0$, and the question of interest is to determine the smallest value of $k$ such that the $k$-deck, along with the subsequences, allows for reconstructing the original sequence in an error-free manner. We first consider the case that one is given a single subsequence of the sequence of length $n - t$, obtained by deleting zeros only, and seek the value of $k$ that allows for hybrid reconstruction. We prove that in this case, $k \in [\log t+2, \min\{ t+1, O(\sqrt{n \cdot (1+\log t)}) \} ]$. We then proceed to extend the single-subsequence setup to the case where one is given $M$ subsequences of length $n - t$ obtained by deleting zeroes only. In this case, we first aggregate the asymmetric traces and then invoke the single-trace results. The analysis and problem at hand are motivated by nanopore sequencing problems for DNA-based data storage. △ Less

Submitted 27 January, 2017; originally announced January 2017.

arXiv:1604.03000 [pdf, other]

Exact Reconstruction from Insertions in Synchronization Codes

Authors: Frederic Sala, Ryan Gabrys, Clayton Schoeny, Lara Dolecek

Abstract: This work studies problems in data reconstruction, an important area with numerous applications. In particular, we examine the reconstruction of binary and non-binary sequences from synchronization (insertion/deletion-correcting) codes. These sequences have been corrupted by a fixed number of symbol insertions (larger than the minimum edit distance of the code), yielding a number of distinct trace… ▽ More This work studies problems in data reconstruction, an important area with numerous applications. In particular, we examine the reconstruction of binary and non-binary sequences from synchronization (insertion/deletion-correcting) codes. These sequences have been corrupted by a fixed number of symbol insertions (larger than the minimum edit distance of the code), yielding a number of distinct traces to be used for reconstruction. We wish to know the minimum number of traces needed for exact reconstruction. This is a general version of a problem tackled by Levenshtein for uncoded sequences. We introduce an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction. Without specific knowledge of the codewords, this upper bound is tight. We apply our results to the famous single deletion/insertion-correcting Varshamov-Tenengolts (VT) codes and show that a significant number of VT codeword pairs achieve the worst-case number of outputs needed for exact reconstruction. We also consider extensions to other channels, such as adversarial deletion and insertion/deletion channels and probabilistic channels. △ Less

Submitted 7 March, 2017; v1 submitted 11 April, 2016; originally announced April 2016.

Comments: 18 pages, 3 figures. Accepted to IEEE Transactions on Information Theory

arXiv:1602.06820 [pdf, ps, other]

Codes Correcting a Burst of Deletions or Insertions

Authors: Clayton Schoeny, Antonia Wachter-Zeh, Ryan Gabrys, Eitan Yaakobi

Abstract: This paper studies codes that correct bursts of deletions. Namely, a code will be called a $b$-burst-deletion-correcting code if it can correct a deletion of any $b$ consecutive bits. While the lower bound on the redundancy of such codes was shown by Levenshtein to be asymptotically $\log(n)+b-1$, the redundancy of the best code construction by Cheng et al. is $b(\log (n/b+1))$. In this paper we c… ▽ More This paper studies codes that correct bursts of deletions. Namely, a code will be called a $b$-burst-deletion-correcting code if it can correct a deletion of any $b$ consecutive bits. While the lower bound on the redundancy of such codes was shown by Levenshtein to be asymptotically $\log(n)+b-1$, the redundancy of the best code construction by Cheng et al. is $b(\log (n/b+1))$. In this paper we close on this gap and provide codes with redundancy at most $\log(n) + (b-1)\log(\log(n)) +b -\log(b)$. We also derive a non-asymptotic upper bound on the size of $b$-burst-deletion-correcting codes and extend the burst deletion model to two more cases: 1) A deletion burst of at most $b$ consecutive bits and 2) A deletion burst of size at most $b$ (not necessarily consecutive). We extend our code construction for the first case and study the second case for $b=3,4$. The equivalent models for insertions are also studied and are shown to be equivalent to correcting the corresponding burst of deletions. △ Less

Submitted 12 May, 2016; v1 submitted 22 February, 2016; originally announced February 2016.

arXiv:1601.06887 [pdf, other]

Balanced Permutation Codes

Authors: Ryan Gabrys, Olgica Milenkovic

Abstract: Motivated by charge balancing constraints for rank modulation schemes, we introduce the notion of balanced permutations and derive the capacity of balanced permutation codes. We also describe simple interleaving methods for permutation code constructions and show that they approach capacity Motivated by charge balancing constraints for rank modulation schemes, we introduce the notion of balanced permutations and derive the capacity of balanced permutation codes. We also describe simple interleaving methods for permutation code constructions and show that they approach capacity △ Less

Submitted 25 January, 2016; originally announced January 2016.

arXiv:1601.06885 [pdf, ps, other]

Codes in the Damerau Distance for DNA Storage

Authors: Ryan Gabrys, Eitan Yaakobi, Olgica Milenkovic

Abstract: Motivated by applications in DNA-based storage, we introduce the new problem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which, in addition to deletions, insertions and substitution errors also accounts for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacen… ▽ More Motivated by applications in DNA-based storage, we introduce the new problem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which, in addition to deletions, insertions and substitution errors also accounts for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacent transposition and then proceed to extend these results to codes that can simultaneously correct a single deletion and multiple adjacent transpositions. We conclude with constructions for joint block deletion and adjacent block transposition error-correcting codes. △ Less

Submitted 30 April, 2018; v1 submitted 25 January, 2016; originally announced January 2016.

arXiv:1506.00740 [pdf, other]

Asymmetric Lee Distance Codes for DNA-Based Storage

Authors: Ryan Gabrys, Han Mao Kiah, Olgica Milenkovic

Abstract: We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representatio… ▽ More We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representation. Our contributions are two-fold. First, we demonstrate that the new distance represents a linear combination of the Lee and Hamming distance and derive upper bounds on the size of the codes under this metric based on linear programming techniques. Second, we propose a number of code constructions which imply lower bounds. △ Less

Submitted 14 December, 2016; v1 submitted 1 June, 2015; originally announced June 2015.

arXiv:1307.7087 [pdf, ps, other]

Correcting Grain-Errors in Magnetic Media

Authors: Ryan Gabrys, Eitan Yaakobi, Lara Dolecek

Abstract: This paper studies new bounds and constructions that are applicable to the combinatorial granular channel model previously introduced by Sharov and Roth. We derive new bounds on the maximum cardinality of a grain-error-correcting code and propose constructions of codes that correct grain-errors. We demonstrate that a permutation of the classical group codes (e.g., Constantin-Rao codes) can correct… ▽ More This paper studies new bounds and constructions that are applicable to the combinatorial granular channel model previously introduced by Sharov and Roth. We derive new bounds on the maximum cardinality of a grain-error-correcting code and propose constructions of codes that correct grain-errors. We demonstrate that a permutation of the classical group codes (e.g., Constantin-Rao codes) can correct a single grain-error. In many cases of interest, our results improve upon the currently best known bounds and constructions. Some of the approaches adopted in the context of grain-errors may have application to other channel models. △ Less

Submitted 30 April, 2018; v1 submitted 26 July, 2013; originally announced July 2013.

Showing 1–42 of 42 results for author: Gabrys, R