Search | arXiv e-print repository

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Authors: Qiwei Di, Tao **, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu

Abstract: Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret b… ▽ More Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde O\big(d\sqrt{\sum_{t=1}^Tσ_t^2} + d\big)$, where $σ_t$ is the variance of the pairwise comparison in round $t$, $d$ is the dimension of the context vectors, and $T$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $\tilde O(d)$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: 28 pages, 1 figure

arXiv:2303.08816 [pdf, other]

Borda Regret Minimization for Generalized Linear Dueling Bandits

Authors: Yue Wu, Tao **, Hao Lou, Farzad Farnoud, Quanquan Gu

Abstract: Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cov… ▽ More Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order $Ω(d^{2/3} T^{2/3})$ for the Borda regret minimization problem, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain this lower bound, we propose an explore-then-commit type algorithm for the stochastic setting, which has a nearly matching regret upper bound $\tilde{O}(d^{2/3} T^{2/3})$. We also propose an EXP3-type algorithm for the adversarial linear setting, where the underlying model parameter can change at each round. Our algorithm achieves an $\tilde{O}(d^{2/3} T^{2/3})$ regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated real-world environment are conducted to corroborate our theoretical analysis. △ Less

Submitted 25 September, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: 33 pages, 5 figure. This version includes new results for dueling bandits in the adversarial setting

arXiv:2210.11818 [pdf, ps, other]

Non-binary Codes for Correcting a Burst of at Most t Deletions

Authors: Shuche Wang, Yuanyuan Tang, ** Sima, Ryan Gabrys, Farzad Farnoud

Abstract: The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary alphabets. Afterwards, we extend this result to t… ▽ More The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary alphabets. Afterwards, we extend this result to the case where the length of the burst can be at most $t$ where $t$ is a constant. Finally, we consider the setup where the sequences that are transmitted are permutations. The proposed codes are the largest known for their respective parameter regimes. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: 20 pages. The paper has been submitted to IEEE Transactions on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and Allerton2022

arXiv:2208.02330 [pdf, other]

Low-redundancy codes for correcting multiple short-duplication and edit errors

Authors: Yuanyuan Tang, Shuche Wang, Hao Lou, Ryan Gabrys, Farzad Farnoud

Abstract: Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simulta… ▽ More Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most $p$ edits, where a short duplication generates a copy of a substring with length $\leq 3$ and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to $p$ edits (in addition to duplications) at the additional cost of roughly $8p(\log_q n)(1+o(1))$ symbols of redundancy, thus achieving the same asymptotic rate, where $q\ge 4$ is the alphabet size and $p$ is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when $p$ is a constant with respect to the code length. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT2022

arXiv:2110.04136 [pdf, other]

Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons

Authors: Yue Wu, Tao **, Hao Lou, Pan Xu, Farzad Farnoud, Quanquan Gu

Abstract: In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from users and improves the users' average accuracy by maintaining a… ▽ More In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from users and improves the users' average accuracy by maintaining an active set of users. We prove that our algorithm can return the true ranking of items with high probability. We also provide a sample complexity bound for the proposed algorithm which is better than that of non-active strategies in the literature. Experiments are provided to show the empirical advantage of the proposed methods over the state-of-the-art baselines. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2107.00490 [pdf, ps, other]

doi 10.1109/TIT.2022.3176778

Data Deduplication with Random Substitutions

Authors: Hao Lou, Farzad Farnoud

Abstract: Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not ex… ▽ More Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances within a constant factor of optimal with the knowledge of source model parameters. We also study the conventional variable-length deduplication algorithm and show that as source entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string, leading to high compression ratios. △ Less

Submitted 26 May, 2022; v1 submitted 1 July, 2021; originally announced July 2021.

arXiv:2011.05896 [pdf, ps, other]

Error-correcting Codes for Short Tandem Duplication and Substitution Errors

Authors: Yuanyuan Tang, Farzad Farnoud

Abstract: Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and substitution errors. We focus on tandem rep… ▽ More Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and substitution errors. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one substitution error. Because a substituted symbol can be duplicated many times (as part of substrings of various lengths), a single substitution can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional substitution is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: 9 pages

arXiv:2008.08174 [pdf, other]

Error-correcting Codes for Noisy Duplication Channels

Authors: Yuanyuan Tang, Farzad Farnoud

Abstract: Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this paper, we consider correcting duplication errors for both exact and noisy tande… ▽ More Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this paper, we consider correcting duplication errors for both exact and noisy tandem duplications of a given length k. An exact duplication inserts a copy of a substring of length k of the sequence immediately after that substring, e.g., ACGT to ACGACGT, where k = 3, while a noisy duplication inserts a copy suffering from substitution noise, e.g., ACGT to ACGATGT. Specifically, we design codes that can correct any number of exact duplication and one noisy duplication errors, where in the noisy duplication case the copy is at Hamming distance 1 from the original. Our constructions rely upon recovering the duplication root of the stored codeword. We characterize the ways in which duplication errors manifest in the root of affected sequences and design efficient codes for correcting these error patterns. We show that the proposed construction is asymptotically optimal, in the sense that it has the same asymptotic rate as optimal codes correcting exact duplications only. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: 14 pages, 2 figures, Allerton, TIT draft

arXiv:2005.03248 [pdf, other]

Coding for Optimized Writing Rate in DNA Storage

Authors: Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied… ▽ More A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied here takes into account the existence of multiple copies of the DNA sequence, which are independently distorted. Finally, quantizers for various run-length distributions are designed. △ Less

Submitted 13 May, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: To appear in ISIT 2020

arXiv:1912.01211 [pdf, other]

Rank Aggregation via Heterogeneous Thurstone Preference Models

Authors: Tao **, Pan Xu, Quanquan Gu, Farzad Farnoud

Abstract: We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise comparisons to heterogeneous populations of users. U… ▽ More We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise comparisons to heterogeneous populations of users. Under this framework, we also propose a rank aggregation algorithm based on alternating gradient descent to estimate the underlying item scores and accuracy levels of different users simultaneously from noisy pairwise comparisons. We theoretically prove that the proposed algorithm converges linearly up to a statistical error which matches that of the state-of-the-art method for the single-user BTL model. We evaluate the proposed HTM model and algorithm on both synthetic and real data, demonstrating that it outperforms existing methods. △ Less

Submitted 3 December, 2019; originally announced December 2019.

Comments: 36 pages, 2 figures, 8 tables. In AAAI 2020

arXiv:1911.05413 [pdf, ps, other]

doi 10.1109/TIT.2020.3006228

Single-Error Detection and Correction for Duplication and Substitution Channels

Authors: Yuanyuan Tang, Yonatan Yehezkeally, Moshe Schwartz, Farzad Farnoud

Abstract: Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and err… ▽ More Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitutions is unrestricted. Both error-detecting and error-correcting codes are constructed, which can handle correctly any number of tandem duplications of a fixed length $k$, and at most a single substitution occurring at any time during the mutation process. △ Less

Submitted 28 June, 2020; v1 submitted 13 November, 2019; originally announced November 2019.

Comments: Author-submitted, peer-reviewed, version

arXiv:1812.02250 [pdf, other]

Evolution of $k$-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems

Authors: Hao Lou, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analy… ▽ More Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that $k$-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems. △ Less

Submitted 5 December, 2018; originally announced December 2018.

arXiv:1809.04702 [pdf, other]

Reconciling Similar Sets of Data

Authors: Ryan Gabrys, Farzad Farnoud

Abstract: In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, explicit encoding and decoding algorithms are provided f… ▽ More In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, explicit encoding and decoding algorithms are provided for many cases. △ Less

Submitted 12 September, 2018; originally announced September 2018.

arXiv:1808.06062 [pdf, ps, other]

The Capacity of Some Pólya String Models

Authors: Ohad Elishco, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic mod… ▽ More We study random string-duplication systems, which we call Pólya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics. △ Less

Submitted 18 August, 2018; originally announced August 2018.

arXiv:1611.05537 [pdf, other]

Duplication Distance to the Root for Binary Sequences

Authors: Noga Alon, Jehoshua Bruck, Farzad Farnoud, Siddharth Jain

Abstract: We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of le… ▽ More We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length $n$ by $f(n)$, we prove that $f(n)=Θ(n)$. For the case of approximate duplication, where a $β$-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in $n$ to logarithmic at $β=1/2$. We also study the duplication distance to the root for sequences with a given root and for special classes of sequences, namely, the de Bruijn sequences, the Thue-Morse sequence, and the Fibbonaci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence. △ Less

Submitted 16 November, 2016; originally announced November 2016.

Comments: submitted to IEEE Transactions on Information Theory

arXiv:1606.00397 [pdf, ps, other]

doi 10.1109/ISIT.2016.7541455

Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms

Authors: Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we prov… ▽ More The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we provide error-correcting codes for errors caused by tandem duplications, which create a copy of a block of the sequence and insert it in a tandem manner, i.e., next to the original. In particular, we present two families of codes for correcting errors due to tandem-duplications of a fixed length, the first family can correct any number of errors while the second corrects a bounded number of errors. We also study codes for correcting tandem duplications of length up to a given constant $k$, where we are primarily focused on the cases of $k=2,3$. Finally, we provide a full classification of the sets of lengths allowed in tandem duplication that result in a unique root for all sequences. △ Less

Submitted 1 June, 2016; originally announced June 2016.

Comments: Submitted to IEEE Transactions on Information Theory

arXiv:1509.06029 [pdf, other]

doi 10.1109/ISIT.2015.7282795

Capacity and Expressiveness of Genomic Tandem Duplication

Authors: Siddharth Jain, Farzad Farnoud, Jehoshua Bruck

Abstract: The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibil… ▽ More The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a tandem duplication system as the capability of expressing arbitrary substrings. We then \textit{completely} characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting a construction for ternary square-free sequences, we show that for alphabets of size 4 or larger, bounded tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems. △ Less

Submitted 20 September, 2015; originally announced September 2015.

Comments: 19 pages, 3 figures, submitted to IEEE Transactions on Information Theory

arXiv:1401.4634 [pdf, ps, other]

The Capacity of String-Replication Systems

Authors: Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules… ▽ More It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple replication rules, including those resembling genomic replication processes. In other words, our goal is to find out the capacity, or the expressive power, of these string-replication systems. Our results include exact capacities, and bounds on the capacities, of four fundamental string-replication systems. △ Less

Submitted 18 January, 2014; originally announced January 2014.

arXiv:1401.3093 [pdf, ps, other]

Rate-Distortion for Ranking with Incomplete Information

Authors: Farzad Farnoud, Moshe Schwartz, Jehoshua Bruck

Abstract: We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium… ▽ More We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kendall Tau metric we provide bounds for small, medium, and large distortion regimes, while for the Chebyshev metric we present bounds that are valid for all distortions and are especially accurate for small distortions. In addition, for the Chebyshev metric, we provide a construction for covering codes. △ Less

Submitted 14 January, 2014; originally announced January 2014.

arXiv:1312.2163 [pdf, ps, other]

Multipermutation Codes in the Ulam Metric for Nonvolatile Memories

Authors: Farzad Farnoud, Olgica Milenkovic

Abstract: We address the problem of multipermutation code design in the Ulam metric for novel storage applications. Multipermutation codes are suitable for flash memory where cell charges may share the same rank. Changes in the charges of cells manifest themselves as errors whose effects on the retrieved signal may be measured via the Ulam distance. As part of our analysis, we study multipermutation codes i… ▽ More We address the problem of multipermutation code design in the Ulam metric for novel storage applications. Multipermutation codes are suitable for flash memory where cell charges may share the same rank. Changes in the charges of cells manifest themselves as errors whose effects on the retrieved signal may be measured via the Ulam distance. As part of our analysis, we study multipermutation codes in the Hamming metric, known as constant composition codes. We then present bounds on the size of multipermutation codes and their capacity, for both the Ulam and the Hamming metrics. Finally, we present constructions and accompanying decoders for multipermutation codes in the Ulam metric. △ Less

Submitted 7 December, 2013; originally announced December 2013.

arXiv:1307.4339 [pdf, other]

Computing Similarity Distances Between Rankings

Authors: Farzad Farnoud, Lili Su, Gregory J. Puleo, Olgica Milenkovic

Abstract: We address the problem of computing distances between rankings that take into account similarities between candidates. The need for evaluating such distances is governed by applications as diverse as rank aggregation, bioinformatics, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the simi… ▽ More We address the problem of computing distances between rankings that take into account similarities between candidates. The need for evaluating such distances is governed by applications as diverse as rank aggregation, bioinformatics, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the similarity of the candidates involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special metric-tree structures and on complete rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost decomposition for simple cycles, and a quadratic-time, $4/3$-approximation algorithm for permutations that contain multiple cycles. The proposed methods rely on investigating a newly introduced balancing property of cycles embedded in trees, cycle-merging methods, and shortest path optimization techniques. △ Less

Submitted 19 November, 2014; v1 submitted 16 July, 2013; originally announced July 2013.

Comments: 32 pages, 14 figures. Corrected proof of unbalanced case

arXiv:1212.2607 [pdf, ps, other]

A General Framework for Distributed Vote Aggregation

Authors: Behrouz Touri, Farzad Farnoud, Angelia Neidic, Olgica Milenkovic

Abstract: We present a general model for opinion dynamics in a social network together with several possibilities for object selections at times when the agents are communicating. We study the limiting behavior of such a dynamics and show that this dynamics almost surely converges. We consider some special implications of the convergence result for gossip and top-$k$ selective gossip models. In particular,… ▽ More We present a general model for opinion dynamics in a social network together with several possibilities for object selections at times when the agents are communicating. We study the limiting behavior of such a dynamics and show that this dynamics almost surely converges. We consider some special implications of the convergence result for gossip and top-$k$ selective gossip models. In particular, we provide an answer to the open problem of the convergence property of the top-$k$ selective gossip model, and show that the convergence holds in a much more general setting. Moreover, we propose an extension of the gossip and top-$k$ selective gossip models and provide some results for their limiting behavior. △ Less

Submitted 11 December, 2012; originally announced December 2012.

arXiv:1212.1471 [pdf, ps, other]

A Novel Distance-Based Approach to Constrained Rank Aggregation

Authors: Farzad Farnoud, Olgica Milenkovic, Behrouz Touri

Abstract: We consider a classical problem in choice theory -- vote aggregation -- using novel distance measures between permutations that arise in several practical applications. The distance measures are derived through an axiomatic approach, taking into account various issues arising in voting with side constraints. The side constraints of interest include non-uniform relevance of the top and the bottom o… ▽ More We consider a classical problem in choice theory -- vote aggregation -- using novel distance measures between permutations that arise in several practical applications. The distance measures are derived through an axiomatic approach, taking into account various issues arising in voting with side constraints. The side constraints of interest include non-uniform relevance of the top and the bottom of rankings (or equivalently, eliminating negative outliers in votes) and similarities between candidates (or equivalently, introducing diversity in the voting process). The proposed distance functions may be seen as weighted versions of the Kendall $τ$ distance and weighted versions of the Cayley distance. In addition to proposing the distance measures and providing the theoretical underpinnings for their applications, we also consider algorithmic aspects associated with distance-based aggregation processes. We focus on two methods. One method is based on approximating weighted distance measures by a generalized version of Spearman's footrule distance, and it has provable constant approximation guarantees. The second class of algorithms is based on a non-uniform Markov chain method inspired by PageRank, for which currently only heuristic guarantees are known. We illustrate the performance of the proposed algorithms for a number of distance measures for which the optimal solution may be easily computed. △ Less

Submitted 6 December, 2012; originally announced December 2012.

arXiv:1206.5343 [pdf, ps, other]

Nonuniform Vote Aggregation Algorithms

Authors: Farzad Farnoud, Behrouz Touri, Olgica Milenkovic

Abstract: We consider the problem of non-uniform vote aggregation, and in particular, the algorithmic aspects associated with the aggregation process. For a novel class of weighted distance measures on votes, we present two different aggregation methods. The first algorithm is based on approximating the weighted distance measure by Spearman's footrule distance, with provable constant approximation guarantee… ▽ More We consider the problem of non-uniform vote aggregation, and in particular, the algorithmic aspects associated with the aggregation process. For a novel class of weighted distance measures on votes, we present two different aggregation methods. The first algorithm is based on approximating the weighted distance measure by Spearman's footrule distance, with provable constant approximation guarantees. The second algorithm is based on a non-uniform Markov chain method inspired by PageRank, for which currently only heuristic guarantees are known. We illustrate the performance of the proposed algorithms on a number of distance measures for which the optimal solution may be easily computed. △ Less

Submitted 22 June, 2012; originally announced June 2012.

arXiv:1202.0932 [pdf, ps, other]

Error-Correction in Flash Memories via Codes in the Ulam Metric

Authors: Farzad Farnoud, Vitaly Skachek, Olgica Milenkovic

Abstract: We consider rank modulation codes for flash memories that allow for handling arbitrary charge-drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed \emph{translocation rank codes} account for more general forms of errors that arise in storage systems. Translocations represent a natural extens… ▽ More We consider rank modulation codes for flash memories that allow for handling arbitrary charge-drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed \emph{translocation rank codes} account for more general forms of errors that arise in storage systems. Translocations represent a natural extension of the notion of adjacent transpositions and as such may be analyzed using related concepts in combinatorics and rank modulation coding. Our results include derivation of the asymptotic capacity of translocation rank codes, construction techniques for asymptotically good codes, as well as simple decoding methods for one class of constructed codes. As part of our exposition, we also highlight the close connections between the new code family and permutations with short common subsequences, deletion and insertion error-correcting codes for permutations, and permutation codes in the Hamming distance. △ Less

Submitted 21 April, 2013; v1 submitted 4 February, 2012; originally announced February 2012.

arXiv:1202.0925 [pdf, ps, other]

Alternating Markov Chains for Distribution Estimation in the Presence of Errors

Authors: Farzad Farnoud, Narayana P. Santhanam, Olgica Milenkovic

Abstract: We consider a class of small-sample distribution estimators over noisy channels. Our estimators are designed for repetition channels, and rely on properties of the runs of the observed sequences. These runs are modeled via a special type of Markov chains, termed alternating Markov chains. We show that alternating chains have redundancy that scales sub-linearly with the lengths of the sequences, an… ▽ More We consider a class of small-sample distribution estimators over noisy channels. Our estimators are designed for repetition channels, and rely on properties of the runs of the observed sequences. These runs are modeled via a special type of Markov chains, termed alternating Markov chains. We show that alternating chains have redundancy that scales sub-linearly with the lengths of the sequences, and describe how to use a distribution estimator for alternating chains for the purpose of distribution estimation over repetition channels. △ Less

Submitted 4 February, 2012; originally announced February 2012.

arXiv:1007.4236 [pdf, ps, other]

Sorting of Permutations by Cost-Constrained Transpositions

Authors: Farzad Farnoud, Olgica Milenkovic

Abstract: We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For arbitrary non-negative cost functions, we describe polynomial-time, constant-approximation decomposition algorithms. For metric-path costs, we describe exact polynomial-time decomposition algorithms. Our algorithms represent a combination of Viterbi-type algorithms and… ▽ More We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For arbitrary non-negative cost functions, we describe polynomial-time, constant-approximation decomposition algorithms. For metric-path costs, we describe exact polynomial-time decomposition algorithms. Our algorithms represent a combination of Viterbi-type algorithms and graph-search techniques for minimizing the cost of individual transpositions, and dynamic programing algorithms for finding minimum cost cycle decompositions. The presented algorithms have applications in information theory, bioinformatics, and algebra. △ Less

Submitted 23 July, 2010; originally announced July 2010.

Showing 1–27 of 27 results for author: Farnoud, F