-
Deletion Correcting Codes for Efficient DNA Synthesis
Authors:
Johan Chrisnata,
Han Mao Kiah,
Van Long Phuoc Pham
Abstract:
The synthesis of DNA strands remains the most costly part of the DNA storage system. Thus, to make DNA storage system more practical, the time and materials used in the synthesis process have to be optimized. We consider the most common type of synthesis process where multiple DNA strands are synthesized in parallel from a common alternating supersequence, one nucleotide at a time. The synthesis t…
▽ More
The synthesis of DNA strands remains the most costly part of the DNA storage system. Thus, to make DNA storage system more practical, the time and materials used in the synthesis process have to be optimized. We consider the most common type of synthesis process where multiple DNA strands are synthesized in parallel from a common alternating supersequence, one nucleotide at a time. The synthesis time or the number of synthesis cycles is then determined by the length of this common supersequence. In this model, we design quaternary codes that minimizes synthesis time that can correct deletions or insertions, which are the most prevalent types of error in array-based synthesis. We also propose polynomial-time algorithms that encode binary strings into these codes and show that the rate is close to capacity.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Optimal Reconstruction Codes for Deletion Channels
Authors:
Johan Chrisnata,
Han Mao Kiah,
Eitan Yaakobi
Abstract:
The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads $N$ is fixed (Kiah et al. 2020). Of significance, for the single-…
▽ More
The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads $N$ is fixed (Kiah et al. 2020). Of significance, for the single-deletion channel, using $\log_2\log_2 n +O(1)$ redundant bits, we designed a reconstruction code of length $n$ that reconstructs codewords from two distinct noisy reads.
In this work, we show that $\log_2\log_2 n -O(1)$ redundant bits are necessary for such reconstruction codes, thereby, demonstrating the optimality of our previous construction. Furthermore, we show that these reconstruction codes can be used in $t$-deletion channels (with $t\ge 2$) to uniquely reconstruct codewords from $n^{t-1}+O\left(n^{t-2}\right)$ distinct noisy reads.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Efficient Algorithm for the Linear Complexity of Sequences and Some Related Consequences
Authors:
Yeow Meng Chee,
Johan Chrisnata,
Tuvi Etzion,
Han Mao Kiah
Abstract:
The linear complexity of a sequence $s$ is one of the measures of its predictability. It represents the smallest degree of a linear recursion which the sequence satisfies. There are several algorithms to find the linear complexity of a periodic sequence $s$ of length $N$ (where $N$ is of some given form) over a finite field $F_q$ in $O(N)$ symbol field operations. The first such algorithm is The G…
▽ More
The linear complexity of a sequence $s$ is one of the measures of its predictability. It represents the smallest degree of a linear recursion which the sequence satisfies. There are several algorithms to find the linear complexity of a periodic sequence $s$ of length $N$ (where $N$ is of some given form) over a finite field $F_q$ in $O(N)$ symbol field operations. The first such algorithm is The Games-Chan Algorithm which considers binary sequences of period $2^n$, and is known for its extreme simplicity. We generalize this algorithm and apply it efficiently for several families of binary sequences. Our algorithm is very simple, it requires $βN$ bit operations for a small constant $β$, where $N$ is the period of the sequence. We make an analysis on the number of bit operations required by the algorithm and compare it with previous algorithms. In the process, the algorithm also finds the recursion for the shortest linear feedback shift-register which generates the sequence. Some other interesting properties related to shift-register sequences, which might not be too surprising but generally unnoted, are also consequences of our exposition.
△ Less
Submitted 25 December, 2019;
originally announced December 2019.
-
Network-Coding Solutions for Minimal Combination Networks and Their Sub-networks
Authors:
Han Cai,
Johan Chrisnata,
Tuvi Etzion,
Moshe Schwartz,
Antonia Wachter-Zeh
Abstract:
Minimal multicast networks are fascinating and efficient combinatorial objects, where the removal of a single link makes it impossible for all receivers to obtain all messages. We study the structure of such networks, and prove some constraints on their possible solutions.
We then focus on the combination network, which is one of the simplest and most insightful network in network-coding theory.…
▽ More
Minimal multicast networks are fascinating and efficient combinatorial objects, where the removal of a single link makes it impossible for all receivers to obtain all messages. We study the structure of such networks, and prove some constraints on their possible solutions.
We then focus on the combination network, which is one of the simplest and most insightful network in network-coding theory. Of particular interest are minimal combination networks. We study the gap in alphabet size between vector-linear and scalar-linear network-coding solutions for such minimal combination networks and some of their sub-networks.
For minimal multicast networks with two source messages we find the maximum possible gap. We define and study sub-networks of the combination network, which we call Kneser networks, and prove that they attain the upper bound on the gap with equality. We also prove that the study of this gap may be limited to the study of sub-networks of minimal combination networks, by using graph homomorphisms connected with the $q$-analog of Kneser graphs. Additionally, we prove a gap for minimal multicast networks with three or more source messages by studying Kneser networks.
Finally, an upper bound on the gap for full minimal combination networks shows nearly no gap, or none in some cases. This is obtained using an MDS-like bound for subspaces over a finite field.
△ Less
Submitted 13 September, 2019; v1 submitted 4 January, 2019;
originally announced January 2019.
-
Efficient Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications
Authors:
Yeow Meng Chee,
Johan Chrisnata,
Han Mao Kiah,
Tuan Thanh Nguyen
Abstract:
Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. Known code constructions are based on {\em irreducible words}.
We study efficient encoding/decoding methods for irreducible words. First, we desc…
▽ More
Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem duplications. Known code constructions are based on {\em irreducible words}.
We study efficient encoding/decoding methods for irreducible words. First, we describe an $(\ell,m)$-finite state encoder and show that when $m=Θ(1/ε)$ and $\ell=Θ(1/ε)$, the encoder achieves rate that is $ε$ away from the optimal. Next, we provide ranking/unranking algorithms for irreducible words and modify the algorithms to reduce the space requirements for the finite state encoder.
△ Less
Submitted 8 January, 2018;
originally announced January 2018.
-
Deciding the Confusability of Words under Tandem Repeats
Authors:
Yeow Meng Chee,
Johan Chrisnata,
Han Mao Kiah,
Tuan Thanh Nguyen
Abstract:
Tandem duplication in DNA is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain {\em et al.} (2016) proposed the study of codes that correct tandem duplications to improve the reliability of data storage. We investigate algorithms associated with the study of these codes.
Two words are said to…
▽ More
Tandem duplication in DNA is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain {\em et al.} (2016) proposed the study of codes that correct tandem duplications to improve the reliability of data storage. We investigate algorithms associated with the study of these codes.
Two words are said to be ${\le}k$-confusable if there exists two sequences of tandem duplications of lengths at most $k$ such that the resulting words are equal. We demonstrate that the problem of deciding whether two words is ${\le}k$-confusable is linear-time solvable through a characterisation that can be checked efficiently for $k=3$. Combining with previous results, the decision problem is linear-time solvable for $k\le 3$. We conjecture that this problem is undecidable for $k>3$.
Using insights gained from the algorithm, we study the size of tandem-duplication codes. We improve the previous known upper bound and then construct codes with larger sizes as compared to the previous constructions. We determine the sizes of optimal tandem-duplication codes for lengths up to twenty, develop recursive methods to construct tandem-duplication codes for all word lengths, and compute explicit lower bounds for the size of optimal tandem-duplication codes for lengths from 21 to 30.
△ Less
Submitted 17 November, 2017; v1 submitted 12 July, 2017;
originally announced July 2017.
-
Rates of DNA Sequence Profiles for Practical Values of Read Lengths
Authors:
Zuling Chang,
Johan Chrisnata,
Martianus Frederic Ezerman,
Han Mao Kiah
Abstract:
A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size $q$, read length $\ell$, and word length $n$.Consequently, we demonstrate that for $q\ge 2$ and $n\le q^{\ell/2-1}$, the number of profile vectors is at least $q^{κn}$ with…
▽ More
A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size $q$, read length $\ell$, and word length $n$.Consequently, we demonstrate that for $q\ge 2$ and $n\le q^{\ell/2-1}$, the number of profile vectors is at least $q^{κn}$ with $κ$ very close to one.In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors.
△ Less
Submitted 8 July, 2016;
originally announced July 2016.