Search | arXiv e-print repository

Recovering a Message from an Incomplete Set of Noisy Fragments

Authors: Aditya Narayan Ravi, Alireza Vahid, Ilan Shomorony

Abstract: We consider the problem of communicating over a channel that breaks the message block into fragments of random lengths, shuffles them out of order, and deletes a random fraction of the fragments. Such a channel is motivated by applications in molecular data storage and forensics, and we refer to it as the torn-paper channel. We characterize the capacity of this channel under arbitrary fragment len… ▽ More We consider the problem of communicating over a channel that breaks the message block into fragments of random lengths, shuffles them out of order, and deletes a random fraction of the fragments. Such a channel is motivated by applications in molecular data storage and forensics, and we refer to it as the torn-paper channel. We characterize the capacity of this channel under arbitrary fragment length distributions and deletion probabilities. Precisely, we show that the capacity is given by a closed-form expression that can be interpreted as F - A, where F is the coverage fraction ,i.e., the fraction of the input codeword that is covered by output fragments, and A is an alignment cost incurred due to the lack of ordering in the output fragments. We then consider a noisy version of the problem, where the fragments are corrupted by binary symmetric noise. We derive upper and lower bounds to the capacity, both of which can be seen as F - A expressions. These bounds match for specific choices of fragment length distributions, and they are approximately tight in cases where there are not too many short fragments. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: 43 pages, 3 figures

arXiv:2405.07785 [pdf, ps, other]

Capacity of Frequency-based Channels: Encoding Information in Molecular Concentrations

Authors: Yuval Gerzon, Ilan Shomorony, Nir Weinberger

Abstract: We consider a molecular channel, in which messages are encoded to the frequency of objects (or concentration of molecules) in a pool, and whose output during reading time is a noisy version of the input frequencies, as obtained by sampling with replacement from the pool. We tightly characterize the capacity of this channel using upper and lower bounds, when the number of objects in the pool of obj… ▽ More We consider a molecular channel, in which messages are encoded to the frequency of objects (or concentration of molecules) in a pool, and whose output during reading time is a noisy version of the input frequencies, as obtained by sampling with replacement from the pool. We tightly characterize the capacity of this channel using upper and lower bounds, when the number of objects in the pool of objects is constrained. We apply this result to the DNA storage channel in the short-molecule regime, and show that even though the capacity of this channel is technically zero, it can still achieve a large information density. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2401.14277 [pdf, ps, other]

An Instance-Based Approach to the Trace Reconstruction Problem

Authors: Kayvon Mazooji, Ilan Shomorony

Abstract: In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average ca… ▽ More In the trace reconstruction problem, one observes the output of passing a binary string $s \in \{0,1\}^n$ through a deletion channel $T$ times and wishes to recover $s$ from the resulting $T$ "traces." Most of the literature has focused on characterizing the hardness of this problem in terms of the number of traces $T$ needed for perfect reconstruction either in the worst case or in the average case (over input sequences $s$). In this paper, we propose an alternative, instance-based approach to the problem. We define the "Levenshtein difficulty" of a problem instance $(s,T)$ as the probability that the resulting traces do not provide enough information for correct recovery with full certainty. One can then try to characterize, for a specific $s$, how $T$ needs to scale in order for the Levenshtein difficulty to go to zero, and seek reconstruction algorithms that match this scaling for each $s$. For a class of binary strings with alternating long runs, we precisely characterize the scaling of $T$ for which the Levenshtein difficulty goes to zero. For this class, we also prove that a simple "Las Vegas algorithm" has an error probability that decays to zero with the same rate as that with which the Levenshtein difficulty tends to zero. △ Less

Submitted 21 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 7 pages, part of this paper was presented at the 58th Annual Conference on Information Sciences and Systems (CISS 2024), funding information added in updated document

arXiv:2310.18844 [pdf, other]

BanditPAM++: Faster $k$-medoids Clustering

Authors: Mo Tiwari, Ryan Kang, Donghyun Lee, Sebastian Thrun, Chris Piech, Ilan Shomorony, Martin **ye Zhang

Abstract: Clustering is a fundamental task in data science with wide-ranging applications. In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in $k$-medoids clustering, respectively. $k$-medoids clustering has recently grown in popularity… ▽ More Clustering is a fundamental task in data science with wide-ranging applications. In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in $k$-medoids clustering, respectively. $k$-medoids clustering has recently grown in popularity due to the discovery of more efficient $k$-medoids algorithms. In particular, recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy. In this paper, we present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime. First, we demonstrate that BanditPAM has a special structure that allows the reuse of clustering information $\textit{within}$ each iteration. Second, we demonstrate that BanditPAM has additional structure that permits the reuse of information $\textit{across}$ different iterations. These observations inspire our proposed algorithm, BanditPAM++, which returns the same clustering solutions as BanditPAM but often several times faster. For example, on the CIFAR10 dataset, BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$ faster. Finally, we provide a high-performance C++ implementation of BanditPAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to reproduce all of our experiments via a one-line script is available at https://github.com/ThrunGroup/BanditPAM_plusplus_experiments. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

MSC Class: 68 ACM Class: I.m; I.2.0; I.2.6; K.3.2; I.2.m

arXiv:2310.04515 [pdf, other]

Utilizing Free Clients in Federated Learning for Focused Model Enhancement

Authors: Aditya Narayan Ravi, Ilan Shomorony

Abstract: Federated Learning (FL) is a distributed machine learning approach to learn models on decentralized heterogeneous data, without the need for clients to share their data. Many existing FL approaches assume that all clients have equal importance and construct a global objective based on all clients. We consider a version of FL we call Prioritized FL, where the goal is to learn a weighted mean object… ▽ More Federated Learning (FL) is a distributed machine learning approach to learn models on decentralized heterogeneous data, without the need for clients to share their data. Many existing FL approaches assume that all clients have equal importance and construct a global objective based on all clients. We consider a version of FL we call Prioritized FL, where the goal is to learn a weighted mean objective of a subset of clients, designated as priority clients. An important question arises: How do we choose and incentivize well aligned non priority clients to participate in the federation, while discarding misaligned clients? We present FedALIGN (Federated Adaptive Learning with Inclusion of Global Needs) to address this challenge. The algorithm employs a matching strategy that chooses non priority clients based on how similar the models loss is on their data compared to the global data, thereby ensuring the use of non priority client gradients only when it is beneficial for priority clients. This approach ensures mutual benefits as non priority clients are motivated to join when the model performs satisfactorily on their data, and priority clients can utilize their updates and computational resources when their goals align. We present a convergence analysis that quantifies the trade off between client selection and speed of convergence. Our algorithm shows faster convergence and higher test accuracy than baselines for various synthetic and benchmark datasets. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: 26 pages, 6 figures

arXiv:2307.10080 [pdf, ps, other]

Fundamental Limits of Reference-Based Sequence Reordering

Authors: Nir Weinberger, Ilan Shomorony

Abstract: The problem of reconstructing a sequence of independent and identically distributed symbols from a set of equal size, consecutive, fragments, as well as a dependent reference sequence, is considered. First, in the regime in which the fragments are relatively long, and typically no fragment appears more than once, the scaling of the failure probability of maximum likelihood reconstruction algorithm… ▽ More The problem of reconstructing a sequence of independent and identically distributed symbols from a set of equal size, consecutive, fragments, as well as a dependent reference sequence, is considered. First, in the regime in which the fragments are relatively long, and typically no fragment appears more than once, the scaling of the failure probability of maximum likelihood reconstruction algorithm is exactly determined for perfect reconstruction and bounded for partial reconstruction. Second, the regime in which the fragments are relatively short and repeating fragments abound is characterized. A trade-off is stated between the fraction of fragments that cannot be adequately reconstructed vs. the distortion level allowed for the reconstruction of each fragment, while still allowing vanishing failure probability △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2305.05820 [pdf, other]

Fundamental Limits of Multiple Sequence Reconstruction from Substrings

Authors: Kel Levick, Ilan Shomorony

Abstract: The problem of reconstructing a sequence from the set of its length-$k$ substrings has received considerable attention due to its various applications in genomics. We study an uncoded version of this problem where multiple random sources are to be simultaneously reconstructed from the union of their $k$-mer sets. We consider an asymptotic regime where $m = n^α$ i.i.d. source sequences of length… ▽ More The problem of reconstructing a sequence from the set of its length-$k$ substrings has received considerable attention due to its various applications in genomics. We study an uncoded version of this problem where multiple random sources are to be simultaneously reconstructed from the union of their $k$-mer sets. We consider an asymptotic regime where $m = n^α$ i.i.d. source sequences of length $n$ are to be reconstructed from the set of their substrings of length $k=β\log n$, and seek to characterize the $(α,β)$ pairs for which reconstruction is information-theoretically feasible. We show that, as $n \to \infty$, the source sequences can be reconstructed if $β> \max(2α+1,α+2)$ and cannot be reconstructed if $β< \max( 2α+1, α+ \tfrac32)$, characterizing the feasibility region almost completely. Interestingly, our result shows that there are feasible $(α,β)$ pairs where repeats across the source strings abound, and non-trivial reconstruction algorithms are needed to achieve the fundamental limit. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: 7 pages, 2 figures

arXiv:2304.01365 [pdf, ps, other]

Finding a Burst of Positives via Nonadaptive Semiquantitative Group Testing

Authors: Yun-Han Li, Ryan Gabrys, ** Sima, Ilan Shomorony, Olgica Milenkovic

Abstract: Motivated by testing for pathogenic diseases we consider a new nonadaptive group testing problem for which: (1) positives occur within a burst, capturing the fact that infected test subjects often come in clusters, and (2) that the test outcomes arise from semiquantitative measurements that provide coarse information about the number of positives in any tested group. Our model generalizes prior wo… ▽ More Motivated by testing for pathogenic diseases we consider a new nonadaptive group testing problem for which: (1) positives occur within a burst, capturing the fact that infected test subjects often come in clusters, and (2) that the test outcomes arise from semiquantitative measurements that provide coarse information about the number of positives in any tested group. Our model generalizes prior work on detecting a single burst of positives with classical group testing[1] as well as work on semiquantitative group testing (SQGT)[2]. Specifically, we study the setting where the burst-length $\ell$ is known and the semiquantitative tests provide potentially nonuniform estimates on the number of positives in a test group. The estimates represent the index of a quantization bin containing the (exact) total number of positives, for arbitrary thresholds $η_1,\dots,η_s$. Interestingly, we show that the minimum number of tests needed for burst identification is essentially only a function of the largest threshold $η_s$. In this context, our main result is an order-optimal test scheme that can recover any burst of length $\ell$ using roughly $\frac{\ell}{2η_s}+\log_{s+1}(n)$ measurements. This suggests that a large saturation level $η_s$ is more important than finely quantized information when dealing with bursts. We also provide results for related modeling assumptions and specialized choices of thresholds. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.12990 [pdf, ps, other]

On Constant-Weight Binary $B_2$-Sequences

Authors: ** Sima, Yun-Han Li, Ilan Shomorony, Olgica Milenkovic

Abstract: Motivated by applications in polymer-based data storage we introduced the new problem of characterizing the code rate and designing constant-weight binary $B_2$-sequences. Binary $B_2$-sequences are collections of binary strings of length $n$ with the property that the real-valued sums of all distinct pairs of strings are distinct. In addition to this defining property, constant-weight binary… ▽ More Motivated by applications in polymer-based data storage we introduced the new problem of characterizing the code rate and designing constant-weight binary $B_2$-sequences. Binary $B_2$-sequences are collections of binary strings of length $n$ with the property that the real-valued sums of all distinct pairs of strings are distinct. In addition to this defining property, constant-weight binary $B_2$-sequences also satisfy the constraint that each string has a fixed, relatively small weight $ω$ that scales linearly with $n$. The constant-weight constraint ensures low-cost synthesis and uniform processing of the data readout via tandem mass spectrometers. Our main results include upper bounds on the size of the codes formulated as entropy-optimization problems and constructive lower bounds based on Sidon sequences. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2212.07551 [pdf, ps, other]

Faster Maximum Inner Product Search in High Dimensions

Authors: Mo Tiwari, Ryan Kang, Je-Yong Lee, Donghyun Lee, Chris Piech, Sebastian Thrun, Ilan Shomorony, Martin **ye Zhang

Abstract: Maximum Inner Product Search (MIPS) is a ubiquitous task in machine learning applications such as recommendation systems. Given a query vector and $n$ atom vectors in $d$-dimensional space, the goal of MIPS is to find the atom that has the highest inner product with the query vector. Existing MIPS algorithms scale at least as $O(\sqrt{d})$, which becomes computationally prohibitive in high-dimensi… ▽ More Maximum Inner Product Search (MIPS) is a ubiquitous task in machine learning applications such as recommendation systems. Given a query vector and $n$ atom vectors in $d$-dimensional space, the goal of MIPS is to find the atom that has the highest inner product with the query vector. Existing MIPS algorithms scale at least as $O(\sqrt{d})$, which becomes computationally prohibitive in high-dimensional settings. In this work, we present BanditMIPS, a novel randomized MIPS algorithm whose complexity is independent of $d$. BanditMIPS estimates the inner product for each atom by subsampling coordinates and adaptively evaluates more coordinates for more promising atoms. The specific adaptive sampling strategy is motivated by multi-armed bandits. We provide theoretical guarantees that BanditMIPS returns the correct answer with high probability, while improving the complexity in $d$ from $O(\sqrt{d})$ to $O(1)$. We also perform experiments on four synthetic and real-world datasets and demonstrate that BanditMIPS outperforms prior state-of-the-art algorithms. For example, in the Movie Lens dataset ($n$=4,000, $d$=6,000), BanditMIPS is 20$\times$ faster than the next best algorithm while returning the same answer. BanditMIPS requires no preprocessing of the data and includes a hyperparameter that practitioners may use to trade off accuracy and runtime. We also propose a variant of our algorithm, named BanditMIPS-$α$, which achieves further speedups by employing non-uniform sampling across coordinates. Finally, we demonstrate how known preprocessing techniques can be used to further accelerate BanditMIPS, and discuss applications to Matching Pursuit and Fourier analysis. △ Less

Submitted 26 June, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: 24 pages

arXiv:2212.07473 [pdf, ps, other]

MABSplit: Faster Forest Training Using Multi-Armed Bandits

Authors: Mo Tiwari, Ryan Kang, Je-Yong Lee, Sebastian Thrun, Chris Piech, Ilan Shomorony, Martin **ye Zhang

Abstract: Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decis… ▽ More Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget. All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest. △ Less

Submitted 14 December, 2022; originally announced December 2022.

Comments: Published at NeurIPS 2022, 30 pages

ACM Class: I.2.8

arXiv:2211.05552 [pdf, other]

doi 10.1561/0100000117

Information-Theoretic Foundations of DNA Data Storage

Authors: Ilan Shomorony, Reinhard Heckel

Abstract: Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions… ▽ More Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging. This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, develo** tools for achievability and converse arguments. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: Preprint of a monograph published in Foundations and Trends in Communications and Information Theory

Journal ref: Foundations and Trends in Communications and Information Theory, Vol. 19, No. 1, pp 1-106, 2022

arXiv:2210.10917 [pdf, other]

Substring Density Estimation from Traces

Authors: Kayvon Mazooji, Ilan Shomorony

Abstract: In the trace reconstruction problem, one seeks to reconstruct a binary string $s$ from a collection of traces, each of which is obtained by passing $s$ through a deletion channel. It is known that $\exp(\tilde O(n^{1/5}))$ traces suffice to reconstruct any length-$n$ string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a "density map"… ▽ More In the trace reconstruction problem, one seeks to reconstruct a binary string $s$ from a collection of traces, each of which is obtained by passing $s$ through a deletion channel. It is known that $\exp(\tilde O(n^{1/5}))$ traces suffice to reconstruct any length-$n$ string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a "density map" that indicates the locations of each length-$k$ substring throughout $s$. We show that $ε^{-2}\cdot \text{poly}(n)$ traces suffice to recover the density map with error at most $ε$. As a result, when restricted to a set of source strings whose minimum "density map distance" is at least $1/\text{poly}(n)$, the trace reconstruction problem can be solved with polynomially many traces. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: 22 pages, 3 figures

arXiv:2201.03590 [pdf, other]

Reassembly Codes for the Chop-and-Shuffle Channel

Authors: Sajjad Nassirpour, Ilan Shomorony, Alireza Vahid

Abstract: We study the problem of retrieving data from a channel that breaks the input sequence into a set of unordered fragments of random lengths, which we refer to as the chop-and-shuffle channel. The length of each fragment follows a geometric distribution. We propose nested Varshamov-Tenengolts (VT) codes to recover the data. We evaluate the error rate and the complexity of our scheme numerically. Our… ▽ More We study the problem of retrieving data from a channel that breaks the input sequence into a set of unordered fragments of random lengths, which we refer to as the chop-and-shuffle channel. The length of each fragment follows a geometric distribution. We propose nested Varshamov-Tenengolts (VT) codes to recover the data. We evaluate the error rate and the complexity of our scheme numerically. Our results show that the decoding error decreases as the input length increases, and our method has a significantly lower complexity than the baseline brute-force approach. We also propose a new construction for VT codes, quantify the maximum number of the required parity bits, and show that our approach requires fewer parity bits compared to known results. △ Less

Submitted 10 January, 2022; originally announced January 2022.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2112.01630 [pdf, other]

Achieving the Capacity of a DNA Storage Channel with Linear Coding Schemes

Authors: Kel Levick, Reinhard Heckel, Ilan Shomorony

Abstract: Due to the redundant nature of DNA synthesis and sequencing technologies, a basic model for a DNA storage system is a multi-draw "shuffling-sampling" channel. In this model, a random number of noisy copies of each sequence is observed at the channel output. Recent works have characterized the capacity of such a DNA storage channel under different noise and sequencing models, relying on sophisticat… ▽ More Due to the redundant nature of DNA synthesis and sequencing technologies, a basic model for a DNA storage system is a multi-draw "shuffling-sampling" channel. In this model, a random number of noisy copies of each sequence is observed at the channel output. Recent works have characterized the capacity of such a DNA storage channel under different noise and sequencing models, relying on sophisticated typicality-based approaches for the achievability. Here, we consider a multi-draw DNA storage channel in the setting of noise corruption by a binary erasure channel. We show that, in this setting, the capacity is achieved by linear coding schemes. This leads to a considerably simpler derivation of the capacity expression of a multi-draw DNA storage channel than existing results in the literature. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Comments: 6 pages, 5 figures, 2 appendices, submitted to CISS 2022

arXiv:2110.02868 [pdf, other]

Coded Shotgun Sequencing

Authors: Aditya Narayan Ravi, Alireza Vahid, Ilan Shomorony

Abstract: Most DNA sequencing technologies are based on the shotgun paradigm: many short reads are obtained from random unknown locations in the DNA sequence. A fundamental question, studied in arXiv:1203.6233, is what read length and coverage depth (i.e., the total number of reads) are needed to guarantee reliable sequence reconstruction. Motivated by DNA-based storage, we study the coded version of this p… ▽ More Most DNA sequencing technologies are based on the shotgun paradigm: many short reads are obtained from random unknown locations in the DNA sequence. A fundamental question, studied in arXiv:1203.6233, is what read length and coverage depth (i.e., the total number of reads) are needed to guarantee reliable sequence reconstruction. Motivated by DNA-based storage, we study the coded version of this problem;i.e., the scenario where the DNA molecule being sequenced is a codeword from a predefined codebook. Our main result is an exact characterization of the capacity of the resulting shotgun sequencing channel as a function of the read length and coverage depth. In particular, our results imply that, while in the uncoded case, $O(n)$ reads of length greater than $2\log{n}$ are needed for reliable reconstruction of a length-$n$ binary sequence, in the coded case, only $O(n/\log{n})$ reads of length greater than $\log{n}$ are needed for the capacity to be arbitrarily close to $1$. △ Less

Submitted 7 February, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: 35 pages, 4 figures, 8 appendices

arXiv:2107.04202 [pdf, other]

Sketching and Sequence Alignment: A Rate-Distortion Perspective

Authors: Ilan Shomorony, Govinda M. Kamath

Abstract: Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing… ▽ More Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length $n$ and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require $B = (1/D) \cdot O\left( \log n \right)$ bits to achieve a distortion $D$, our proposed approach only requires $B = \log^2(1/D) \cdot O(1)$ bits. This can lead to significant computational savings in pairwise alignment estimation. △ Less

Submitted 9 July, 2021; originally announced July 2021.

arXiv:2101.12124 [pdf, other]

Private DNA Sequencing: Hiding Information in Discrete Noise

Authors: Kayvon Mazooji, Roy Dong, Ilan Shomorony

Abstract: When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume these samples are known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but s… ▽ More When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume these samples are known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable X (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering X from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions. △ Less

Submitted 28 January, 2021; originally announced January 2021.

Comments: 10 pages, 5 figures, shorter version to appear in proceedings of ITW 2020

arXiv:2011.04832 [pdf, other]

Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment

Authors: Govinda M. Kamath, Tavor Z. Baharav, Ilan Shomorony

Abstract: Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. State-of-the-art approaches to speed up this task use hashing to identify short segments (k-mers) that are shared by pairs of reads, which can then be used to estimate alignment scores. However, when the number of reads is large, accurately estimating alignment sc… ▽ More Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. State-of-the-art approaches to speed up this task use hashing to identify short segments (k-mers) that are shared by pairs of reads, which can then be used to estimate alignment scores. However, when the number of reads is large, accurately estimating alignment scores for all pairs is still very costly. Moreover, in practice, one is only interested in identifying pairs of reads with large alignment scores. In this work, we propose a new approach to pairwise alignment estimation based on two key new ingredients. The first ingredient is to cast the problem of pairwise alignment estimation under a general framework of rank-one crowdsourcing models, where the workers' responses correspond to k-mer hash collisions. These models can be accurately solved via a spectral decomposition of the response matrix. The second ingredient is to utilise a multi-armed bandit algorithm to adaptively refine this spectral estimator only for read pairs that are likely to have large alignments. The resulting algorithm iteratively performs a spectral decomposition of the response matrix for adaptively chosen subsets of the read pairs. △ Less

Submitted 12 February, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: NeurIPS 2020

arXiv:2006.06856 [pdf, other]

BanditPAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits

Authors: Mo Tiwari, Martin **ye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony

Abstract: Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering, $k$-medoids clustering requires the cluster centers to be actual data points and support arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM)… ▽ More Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering, $k$-medoids clustering requires the cluster centers to be actual data points and support arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose BanditPAM, a randomized algorithm inspired by techniques from multi-armed bandits, that reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. As such, BanditPAM matches state-of-the-art clustering loss while reaching solutions much faster. We empirically validate our results on several large real-world datasets, including a coding exercise submissions dataset, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. In these experiments, we observe that BanditPAM returns the same results as state-of-the-art PAM-like algorithms up to 4x faster while performing up to 200x fewer distance computations. The improvements demonstrated by BanditPAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release highly optimized Python and C++ implementations of our algorithm. △ Less

Submitted 6 December, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 21 pages, NeurIPS 2020

arXiv:2005.12895 [pdf, other]

Communicating over the Torn-Paper Channel

Authors: Ilan Shomorony, Alireza Vahid

Abstract: We consider the problem of communicating over a channel that randomly "tears" the message block into small pieces of different sizes and shuffles them. For the binary torn-paper channel with block length $n$ and pieces of length ${\rm Geometric}(p_n)$, we characterize the capacity as $C = e^{-α}$, where $α= \lim_{n\to\infty} p_n \log n$. Our results show that the case of ${\rm Geometric}(p_n)$-len… ▽ More We consider the problem of communicating over a channel that randomly "tears" the message block into small pieces of different sizes and shuffles them. For the binary torn-paper channel with block length $n$ and pieces of length ${\rm Geometric}(p_n)$, we characterize the capacity as $C = e^{-α}$, where $α= \lim_{n\to\infty} p_n \log n$. Our results show that the case of ${\rm Geometric}(p_n)$-length fragments and the case of deterministic length-$(1/p_n)$ fragments are qualitatively different and, surprisingly, the capacity of the former is larger. Intuitively, this is due to the fact that, in the random fragments case, large fragments are sometimes observed, which boosts the capacity. △ Less

Submitted 26 May, 2020; originally announced May 2020.

arXiv:2001.06311 [pdf, other]

DNA-Based Storage: Models and Fundamental Limits

Authors: Ilan Shomorony, Reinhard Heckel

Abstract: Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and trade-offs of DNA-based storage systems by introducing a new channel model, which we call the noisy shuffling-sampling channel. Motivated by current technological constraints on DNA synthesis and sequencing, this model captures three key distinc… ▽ More Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and trade-offs of DNA-based storage systems by introducing a new channel model, which we call the noisy shuffling-sampling channel. Motivated by current technological constraints on DNA synthesis and sequencing, this model captures three key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules; (2) the molecules are corrupted by noise during synthesis and sequencing and (3) the data is read by randomly sampling from the DNA pool. We provide capacity results for this channel under specific noise and sampling assumptions and show that, in many scenarios, a simple index-based coding scheme is optimal. △ Less

Submitted 16 January, 2020; originally announced January 2020.

Comments: Submitted to IEEE Transaction of Information Theory; in parts presented at ISIT 2017 and ISIT 2019. arXiv admin note: text overlap with arXiv:1705.04732

arXiv:1912.05741 [pdf, other]

The Metagenomic Binning Problem: Clustering Markov Sequences

Authors: G. Greenberg, I. Shomorony

Abstract: The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished… ▽ More The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M<<N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: IEEE-ITW 2019 Proceedings, 9 pages, 3 figures

arXiv:1902.10832 [pdf, other]

Capacity Results for the Noisy Shuffling Channel

Authors: Ilan Shomorony, Reinhard Heckel

Abstract: Motivated by DNA-based storage, we study the noisy shuffling channel, which can be seen as the concatenation of a standard noisy channel (such as the BSC) and a shuffling channel, which breaks the data block into small pieces and shuffles them. This channel models a DNA storage system, by capturing two of its key aspects: (1) the data is written onto many short DNA molecules that are stored in an… ▽ More Motivated by DNA-based storage, we study the noisy shuffling channel, which can be seen as the concatenation of a standard noisy channel (such as the BSC) and a shuffling channel, which breaks the data block into small pieces and shuffles them. This channel models a DNA storage system, by capturing two of its key aspects: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the molecules are corrupted by noise at synthesis, sequencing, and during storage. For the BSC-shuffling channel we characterize the capacity exactly (for a large set of parameters), and show that a simple index-based coding scheme is optimal. △ Less

Submitted 27 February, 2019; originally announced February 2019.

arXiv:1705.04732 [pdf, other]

Fundamental Limits of DNA Storage Systems

Authors: Reinhard Heckel, Ilan Shomorony, Kannan Ramchandran, David N. C. Tse

Abstract: Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and tradeoffs of DNA-based storage systems under a simple model, motivated by current technological constraints on DNA synthesis and sequencing. Our model captures two key distinctive aspects of DNA storage systems: (1) the data is written onto many… ▽ More Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and tradeoffs of DNA-based storage systems under a simple model, motivated by current technological constraints on DNA synthesis and sequencing. Our model captures two key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered way and (2) the data is read by randomly sampling from this DNA pool. Under this model, we characterize the storage capacity, and show that a simple index-based coding scheme is optimal. △ Less

Submitted 12 May, 2017; originally announced May 2017.

Comments: To appear in Proc. of IEEE International Symposium on Information Theory (ISIT). Slightly extended version containing the proofs

arXiv:1605.01941 [pdf, other]

Partial DNA Assembly: A Rate-Distortion Perspective

Authors: Ilan Shomorony, Govinda M. Kamath, Fei Xia, Thomas A. Courtade, David N. Tse

Abstract: Earlier formulations of the DNA assembly problem were all in the context of perfect assembly; i.e., given a set of reads from a long genome sequence, is it possible to perfectly reconstruct the original sequence? In practice, however, it is very often the case that the read data is not sufficiently rich to permit unambiguous reconstruction of the original sequence. While a natural generalization o… ▽ More Earlier formulations of the DNA assembly problem were all in the context of perfect assembly; i.e., given a set of reads from a long genome sequence, is it possible to perfectly reconstruct the original sequence? In practice, however, it is very often the case that the read data is not sufficiently rich to permit unambiguous reconstruction of the original sequence. While a natural generalization of the perfect assembly formulation to these cases would be to consider a rate-distortion framework, partial assemblies are usually represented in terms of an assembly graph, making the definition of a distortion measure challenging. In this work, we introduce a distortion function for assembly graphs that can be understood as the logarithm of the number of Eulerian cycles in the assembly graph, each of which correspond to a candidate assembly that could have generated the observed reads. We also introduce an algorithm for the construction of an assembly graph and analyze its performance on real genomes. △ Less

Submitted 6 May, 2016; originally announced May 2016.

Comments: To be published at ISIT-2016. 11 pages, 10 figures

arXiv:1508.04359 [pdf, other]

Informational Bottlenecks in Two-Unicast Wireless Networks with Delayed CSIT

Authors: Alireza Vahid, Ilan Shomorony, Robert Calderbank

Abstract: We study the impact of delayed channel state information at the transmitters (CSIT) in two-unicast wireless networks with a layered topology and arbitrary connectivity. We introduce a technique to obtain outer bounds to the degrees-of-freedom (DoF) region through the new graph-theoretic notion of bottleneck nodes. Such nodes act as informational bottlenecks only under the assumption of delayed CSI… ▽ More We study the impact of delayed channel state information at the transmitters (CSIT) in two-unicast wireless networks with a layered topology and arbitrary connectivity. We introduce a technique to obtain outer bounds to the degrees-of-freedom (DoF) region through the new graph-theoretic notion of bottleneck nodes. Such nodes act as informational bottlenecks only under the assumption of delayed CSIT, and imply asymmetric DoF bounds of the form $mD_1 + D_2 \leq m$. Combining this outer-bound technique with new achievability schemes, we characterize the sum DoF of a class of two-unicast wireless networks, which shows that, unlike in the case of instantaneous CSIT, the DoF of two-unicast networks with delayed CSIT can take an infinite set of values. △ Less

Submitted 3 October, 2015; v1 submitted 18 August, 2015; originally announced August 2015.

Comments: In proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing

arXiv:1501.06194 [pdf, other]

Do Read Errors Matter for Genome Assembly?

Authors: Ilan Shomorony, Thomas Courtade, David Tse

Abstract: While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on t… ▽ More While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the genome and the error rate, above which perfect assembly is guaranteed. For several real genomes, including those from the GAGE dataset, we verify that this critical read length is not significantly greater than the read length required for perfect assembly from reads without errors. △ Less

Submitted 25 January, 2015; originally announced January 2015.

Comments: Submitted to ISIT 2015

arXiv:1411.3017 [pdf, other]

Sampling Large Data on Graphs

Authors: Ilan Shomorony, A. Salman Avestimehr

Abstract: We consider the problem of sampling from data defined on the nodes of a weighted graph, where the edge weights capture the data correlation structure. As shown recently, using spectral graph theory one can define a cut-off frequency for the bandlimited graph signals that can be reconstructed from a given set of samples (i.e., graph nodes). In this work, we show how this cut-off frequency can be co… ▽ More We consider the problem of sampling from data defined on the nodes of a weighted graph, where the edge weights capture the data correlation structure. As shown recently, using spectral graph theory one can define a cut-off frequency for the bandlimited graph signals that can be reconstructed from a given set of samples (i.e., graph nodes). In this work, we show how this cut-off frequency can be computed exactly. Using this characterization, we provide efficient algorithms for finding the subset of nodes of a given size with the largest cut-off frequency and for finding the smallest subset of nodes with a given cut-off frequency. In addition, we study the performance of random uniform sampling when compared to the centralized optimal sampling provided by the proposed algorithms. △ Less

Submitted 11 November, 2014; originally announced November 2014.

Comments: To be presented at GlobalSIP 2014

arXiv:1404.4995 [pdf, other]

A Generalized Cut-Set Bound for Deterministic Multi-Flow Networks and its Applications

Authors: Ilan Shomorony, A. Salman Avestimehr

Abstract: We present a new outer bound for the sum capacity of general multi-unicast deterministic networks. Intuitively, this bound can be understood as applying the cut-set bound to concatenated copies of the original network with a special restriction on the allowed transmit signal distributions. We first study applications to finite-field networks, where we obtain a general outer-bound expression in ter… ▽ More We present a new outer bound for the sum capacity of general multi-unicast deterministic networks. Intuitively, this bound can be understood as applying the cut-set bound to concatenated copies of the original network with a special restriction on the allowed transmit signal distributions. We first study applications to finite-field networks, where we obtain a general outer-bound expression in terms of ranks of the transfer matrices. We then show that, even though our outer bound is for deterministic networks, a recent result relating the capacity of AWGN KxKxK networks and the capacity of a deterministic counterpart allows us to establish an outer bound to the DoF of KxKxK wireless networks with general connectivity. This bound is tight in the case of the "adjacent-cell interference" topology, and yields graph-theoretic necessary and sufficient conditions for K DoF to be achievable in general topologies. △ Less

Submitted 19 April, 2014; originally announced April 2014.

Comments: A shorter version of this paper will appear in the Proceedings of ISIT 2014

arXiv:1305.2548 [pdf, other]

On Min-Cut Algorithms for Half-Duplex Relay Networks

Authors: Raúl Etkin, Farzad Parvaresh, Ilan Shomorony, A. Salman Avestimehr

Abstract: Computing the cut-set bound in half-duplex relay networks is a challenging optimization problem, since it requires finding the cut-set optimal half-duplex schedule. This subproblem in general involves an exponential number of variables, since the number of ways to assign each node to either transmitter or receiver mode is exponential in the number of nodes. We present a general technique that take… ▽ More Computing the cut-set bound in half-duplex relay networks is a challenging optimization problem, since it requires finding the cut-set optimal half-duplex schedule. This subproblem in general involves an exponential number of variables, since the number of ways to assign each node to either transmitter or receiver mode is exponential in the number of nodes. We present a general technique that takes advantage of specific structures in the topology of a given network and allows us to reduce the complexity of computing the half-duplex schedule that maximizes the cut-set bound (with i.i.d. input distribution). In certain classes of network topologies, our approach yields polynomial time algorithms. We use simulations to show running time improvements over alternative methods and compare the performance of various half-duplex scheduling approaches in different SNR regimes. △ Less

Submitted 11 May, 2013; originally announced May 2013.

Comments: Submitted to IEEE Transactions on Information Theory. Part of this work will be presented at ISIT 2013

arXiv:1304.1828 [pdf, other]

Network Compression: Worst-Case Analysis

Authors: Himanshu Asnani, Ilan Shomorony, A. Salman Avestimehr, Tsachy Weissman

Abstract: We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: (a) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, th… ▽ More We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: (a) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, they admit the smallest set of achievable distortion tuples, and (b) for any memoryless source to be communicated over a memoryless additive-noise network, among all noise processes of a given correlation, Gaussian noise admits the smallest achievable set of distortion tuples. We establish these results constructively by showing how schemes for the corresponding Gaussian problems can be applied to achieve similar performance for (source or noise) distributions that are not necessarily Gaussian but have the same covariance. △ Less

Submitted 5 April, 2013; originally announced April 2013.

Comments: Submitted to IEEE Transactions on Information Theory

arXiv:1210.2143 [pdf, other]

Degrees of Freedom of Two-Hop Wireless Networks: "Everyone Gets the Entire Cake"

Authors: Ilan Shomorony, A. Salman Avestimehr

Abstract: We show that fully connected two-hop wireless networks with K sources, K relays and K destinations have K degrees of freedom both in the case of time-varying channel coefficients and in the case of constant channel coefficients (in which case the result holds for almost all values of constant channel coefficients). Our main contribution is a new achievability scheme which we call Aligned Network D… ▽ More We show that fully connected two-hop wireless networks with K sources, K relays and K destinations have K degrees of freedom both in the case of time-varying channel coefficients and in the case of constant channel coefficients (in which case the result holds for almost all values of constant channel coefficients). Our main contribution is a new achievability scheme which we call Aligned Network Diagonalization. This scheme allows the data streams transmitted by the sources to undergo a diagonal linear transformation from the sources to the destinations, thus being received free of interference by their intended destination. In addition, we extend our scheme to multi-hop networks with fully connected hops, and multi-hop networks with MIMO nodes, for which the degrees of freedom are also fully characterized. △ Less

Submitted 14 May, 2013; v1 submitted 8 October, 2012; originally announced October 2012.

Comments: Presented at the 2012 Allerton Conference. Submitted to IEEE Transactions on Information Theory

arXiv:1208.1784 [pdf, other]

doi 10.1109/ITW.2012.6404654

Worst-Case Source for Distributed Compression with Quadratic Distortion

Authors: Ilan Shomorony, A. Salman Avestimehr, Himanshu Asnani, Tsachy Weissman

Abstract: We consider the k-encoder source coding problem with a quadratic distortion measure. We show that among all source distributions with a given covariance matrix K, the jointly Gaussian source requires the highest rates in order to meet a given set of distortion constraints. We consider the k-encoder source coding problem with a quadratic distortion measure. We show that among all source distributions with a given covariance matrix K, the jointly Gaussian source requires the highest rates in order to meet a given set of distortion constraints. △ Less

Submitted 8 August, 2012; originally announced August 2012.

Comments: To be presented at the IEEE Information Theory Workshop (ITW) 2012

arXiv:1205.6186 [pdf, other]

Diamond Networks with Bursty Traffic: Bounds on the Minimum Energy-Per-Bit

Authors: Ilan Shomorony, Raúl Etkin, Farzad Parvaresh, A. Salman Avestimehr

Abstract: When data traffic in a wireless network is bursty, small amounts of data sporadically become available for transmission, at times that are unknown at the receivers, and an extra amount of energy must be spent at the transmitters to overcome this lack of synchronization between the network nodes. In practice, pre-defined header sequences are used with the purpose of synchronizing the different netw… ▽ More When data traffic in a wireless network is bursty, small amounts of data sporadically become available for transmission, at times that are unknown at the receivers, and an extra amount of energy must be spent at the transmitters to overcome this lack of synchronization between the network nodes. In practice, pre-defined header sequences are used with the purpose of synchronizing the different network nodes. However, in networks where relays must be used for communication, the overhead required for synchronizing the entire network may be very significant. In this work, we study the fundamental limits of energy-efficient communication in an asynchronous diamond network with two relays. We formalize the notion of relay synchronization by saying that a relay is synchronized if the conditional entropy of the arrival time of the source message given the received signals at the relay is small. We show that the minimum energy-per-bit for bursty traffic in diamond networks is achieved with a coding scheme where each relay is either synchronized or not used at all. A consequence of this result is the derivation of a lower bound to the minimum energy-per-bit for bursty communication in diamond networks. This bound allows us to show that schemes that perform the tasks of synchronization and communication separately (i.e., with synchronization signals preceding the communication block) can achieve the minimum energy-per-bit to within a constant fraction that ranges from 2 in the synchronous case to 1 in the highly asynchronous regime. △ Less

Submitted 8 August, 2012; v1 submitted 28 May, 2012; originally announced May 2012.

Comments: Several proofs were updated

arXiv:1202.2687 [pdf, other]

Worst-Case Additive Noise in Wireless Networks

Authors: Ilan Shomorony, A. Salman Avestimehr

Abstract: A classical result in Information Theory states that the Gaussian noise is the worst-case additive noise in point-to-point channels, meaning that, for a fixed noise variance, the Gaussian noise minimizes the capacity of an additive noise channel. In this paper, we significantly generalize this result and show that the Gaussian noise is also the worst-case additive noise in wireless networks with a… ▽ More A classical result in Information Theory states that the Gaussian noise is the worst-case additive noise in point-to-point channels, meaning that, for a fixed noise variance, the Gaussian noise minimizes the capacity of an additive noise channel. In this paper, we significantly generalize this result and show that the Gaussian noise is also the worst-case additive noise in wireless networks with additive noises that are independent from the transmit signals. More specifically, we show that, if we fix the noise variance at each node, then the capacity region with Gaussian noises is a subset of the capacity region with any other set of noise distributions. We prove this result by showing that a coding scheme that achieves a given set of rates on a network with Gaussian additive noises can be used to construct a coding scheme that achieves the same set of rates on a network that has the same topology and traffic demands, but with non-Gaussian additive noises. △ Less

Submitted 29 January, 2013; v1 submitted 13 February, 2012; originally announced February 2012.

Comments: Several proofs were improved. New examples and figures were added

arXiv:1102.2498 [pdf, other]

Two-Unicast Wireless Networks: Characterizing the Degrees-of-Freedom

Authors: Ilan Shomorony, A. Salman Avestimehr

Abstract: We consider two-source two-destination (i.e., two-unicast) multi-hop wireless networks that have a layered structure with arbitrary connectivity. We show that, if the channel gains are chosen independently according to continuous distributions, then, with probability 1, two-unicast layered Gaussian networks can only have 1, 3/2 or 2 sum degrees-of-freedom (unless both source-destination pairs are… ▽ More We consider two-source two-destination (i.e., two-unicast) multi-hop wireless networks that have a layered structure with arbitrary connectivity. We show that, if the channel gains are chosen independently according to continuous distributions, then, with probability 1, two-unicast layered Gaussian networks can only have 1, 3/2 or 2 sum degrees-of-freedom (unless both source-destination pairs are disconnected, in which case no degrees-of-freedom can be achieved). We provide sufficient and necessary conditions for each case based on network connectivity and a new notion of source-destination paths with manageable interference. Our achievability scheme is based on forwarding the received signals at all nodes, except for a small fraction of them in at most two key layers. Hence, we effectively create a "condensed network" that has at most four layers (including the sources layer and the destinations layer). We design the transmission strategies based on the structure of this condensed network. The converse results are obtained by develo** information-theoretic inequalities that capture the structures of the network connectivity. Finally, we extend this result and characterize the full degrees-of-freedom region of two-unicast layered wireless networks. △ Less

Submitted 8 August, 2012; v1 submitted 12 February, 2011; originally announced February 2011.

Comments: To appear on IEEE Transactions on Information Theory

Showing 1–37 of 37 results for author: Shomorony, I