Genomic Compression with Read Alignment at the Decoder
Authors:
Yotam Gershon,
Yuval Cassuto
Abstract:
We propose a new compression scheme for genomic data given as sequence fragments called reads. The scheme uses a reference genome at the decoder side only, freeing the encoder from the burdens of storing references and performing computationally costly alignment operations. The main ingredient of the scheme is a multi-layer code construction, delivering to the decoder sufficient information to ali…
▽ More
We propose a new compression scheme for genomic data given as sequence fragments called reads. The scheme uses a reference genome at the decoder side only, freeing the encoder from the burdens of storing references and performing computationally costly alignment operations. The main ingredient of the scheme is a multi-layer code construction, delivering to the decoder sufficient information to align the reads, correct their differences from the reference, validate their reconstruction, and correct reconstruction errors. The core of the method is the well-known concept of distributed source coding with decoder side information, fortified by a generalized-concatenation code construction enabling efficient embedding of all the information needed for reliable reconstruction. We first present the scheme for the case of substitution errors only between the reads and the reference, and then extend it to support reads with a single deletion and multiple substitutions. A central tool in this extension is a new distance metric that is shown analytically to improve alignment performance over existing distance metrics.
△ Less
Submitted 9 February, 2023; v1 submitted 16 May, 2022;
originally announced May 2022.
On The Decoding Error Weight of One or Two Deletion Channels
Authors:
Omer Sabary,
Daniella Bar-Lev,
Yotam Gershon,
Alexander Yucovich,
Eitan Yaakobi
Abstract:
This paper tackles two problems that are relevant to coding for insertions and deletions. These problems are motivated by several applications, among them is reconstructing strands in DNA-based storage systems. Under this paradigm, a word is transmitted over some fixed number of identical independent channels and the goal of the decoder is to output the transmitted word or some close approximation…
▽ More
This paper tackles two problems that are relevant to coding for insertions and deletions. These problems are motivated by several applications, among them is reconstructing strands in DNA-based storage systems. Under this paradigm, a word is transmitted over some fixed number of identical independent channels and the goal of the decoder is to output the transmitted word or some close approximation of it. The first part of this paper studies the deletion channel that deletes a symbol with some fixed probability $p$, while focusing on two instances of this channel. Since operating the maximum likelihood (ML) decoder in this case is computationally unfeasible, we study a slightly degraded version of this decoder for two channels and its expected normalized distance. We identify the dominant error patterns and based on these observations, it is derived that the expected normalized distance of the degraded ML decoder is roughly $\frac{3q-1}{q-1}p^2$, when the transmitted word is any $q$-ary sequence and $p$ is the channel's deletion probability. We also study the cases when the transmitted word belongs to the Varshamov Tenengolts (VT) code or the shifted VT code. Additionally, the insertion channel is studied as well as the case of two insertion channels. These theoretical results are verified by corresponding simulations. The second part of the paper studies optimal decoding for a special case of the deletion channel, the $k$-deletion channel, which deletes exactly $k$ symbols of the transmitted word uniformly at random. In this part, the goal is to understand how an optimal decoder operates in order to minimize the expected normalized distance. A full characterization of an efficient optimal decoder for this setup, referred to as the maximum likelihood* (ML*) decoder, is given for a channel that deletes one or two symbols.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.