Search | arXiv e-print repository

Recover the spectrum of covariance matrix: a non-asymptotic iterative method

Authors: Juntao Duan, Ionel Popescu, Heinrich Matzinger

Abstract: It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions. It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions. △ Less

Submitted 1 January, 2022; originally announced January 2022.

arXiv:2112.00300 [pdf, ps, other]

Invariance principle of random projection for the norm

Authors: Juntao Duan, Ionel Popescu, Heinrich Matzinger

Abstract: Johnson-Lindenstrauss guarantees certain topological structure is preserved under random projections when project high dimensional deterministic vectors to low dimensional vectors. In this work, we try to understand how random matrix affect norms of random vectors. In particular we prove the distribution of the norm of random vector $X \in \mathbb{R}^n$, whose entries are i.i.d. random variables,… ▽ More Johnson-Lindenstrauss guarantees certain topological structure is preserved under random projections when project high dimensional deterministic vectors to low dimensional vectors. In this work, we try to understand how random matrix affect norms of random vectors. In particular we prove the distribution of the norm of random vector $X \in \mathbb{R}^n$, whose entries are i.i.d. random variables, is preserved by random projection $S:\mathbb{R}^n \to \mathbb{R}^m$. More precisely, \[ \frac{X^TS^TSX - mn}{\sqrt{σ^2 m^2n+2mn^2}} \xrightarrow[\quad m/n\to 0 \quad ]{ m,n\to \infty } \mathcal{N}(0,1) \] We also prove a concentration of the random norm transformed by either random projection or random embedding. Overall, our results showed random matrix has low distortion for the norm of random vectors with i.i.d. entries. △ Less

Submitted 25 July, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

arXiv:2006.00152 [pdf, ps, other]

An Analytical Formula for Spectrum Reconstruction

Authors: Zhibo Dai, Heinrich Matzinger, Ionel Popescu

Abstract: We study the spectrum reconstruction technique. As is known to all, eigenvalues play an important role in many research fields and are foundation to many practical techniques such like PCA(Principal Component Analysis). We believe that related algorithms should perform better with more accurate spectrum estimation. There was an approximation formula proposed, however, they didn't give any proof. I… ▽ More We study the spectrum reconstruction technique. As is known to all, eigenvalues play an important role in many research fields and are foundation to many practical techniques such like PCA(Principal Component Analysis). We believe that related algorithms should perform better with more accurate spectrum estimation. There was an approximation formula proposed, however, they didn't give any proof. In our research, we show why the formula works. And when both number of features and dimension of space go to infinity, we find the order of error for the approximation formula, which is related to a constant $c$-the ratio of dimension of space and number of features. △ Less

Submitted 29 May, 2020; originally announced June 2020.

arXiv:1906.03768 [pdf, ps, other]

A cost-reducing partial labeling estimator in text classification problem

Authors: Jiangning Chen, Zhibo Dai, Juntao Duan, Qianli Hu, Ruilin Li, Heinrich Matzinger, Ionel Popescu, Haoyan Zhai

Abstract: We propose a new approach to address the text classification problems when learning with partial labels is beneficial. Instead of offering each training sample a set of candidate labels, we assign negative-oriented labels to the ambiguous training examples if they are unlikely fall into certain classes. We construct our new maximum likelihood estimators with self-correction property, and prove tha… ▽ More We propose a new approach to address the text classification problems when learning with partial labels is beneficial. Instead of offering each training sample a set of candidate labels, we assign negative-oriented labels to the ambiguous training examples if they are unlikely fall into certain classes. We construct our new maximum likelihood estimators with self-correction property, and prove that under some conditions, our estimators converge faster. Also we discuss the advantages of applying one of our estimator to a fully supervised learning problem. The proposed method has potential applicability in many areas, such as crowdsourcing, natural language processing and medical image analysis. △ Less

Submitted 9 June, 2019; originally announced June 2019.

arXiv:1905.06115 [pdf, ps, other]

Naive Bayes with Correlation Factor for Text Classification Problem

Authors: Jiangning Chen, Zhibo Dai, Juntao Duan, Heinrich Matzinger, Ionel Popescu

Abstract: Naive Bayes estimator is widely used in text classification problems. However, it doesn't perform well with small-size training dataset. We propose a new method based on Naive Bayes estimator to solve this problem. A correlation factor is introduced to incorporate the correlation among different classes. Experimental results show that our estimator achieves a better accuracy compared with traditio… ▽ More Naive Bayes estimator is widely used in text classification problems. However, it doesn't perform well with small-size training dataset. We propose a new method based on Naive Bayes estimator to solve this problem. A correlation factor is introduced to incorporate the correlation among different classes. Experimental results show that our estimator achieves a better accuracy compared with traditional Naive Bayes in real world data. △ Less

Submitted 8 May, 2019; originally announced May 2019.

arXiv:1808.10261 [pdf, ps, other]

Centroid estimation based on symmetric KL divergence for Multinomial text classification problem

Authors: Jiangning Chen, Heinrich Matzinger, Haoyan Zhai, Mi Zhou

Abstract: We define a new method to estimate centroid for text classification based on the symmetric KL-divergence between the distribution of words in training documents and their class centroids. Experiments on several standard data sets indicate that the new method achieves substantial improvements over the traditional classifiers. We define a new method to estimate centroid for text classification based on the symmetric KL-divergence between the distribution of words in training documents and their class centroids. Experiments on several standard data sets indicate that the new method achieves substantial improvements over the traditional classifiers. △ Less

Submitted 24 October, 2018; v1 submitted 29 August, 2018; originally announced August 2018.

arXiv:1804.09472 [pdf, other]

Recovery of spectrum from estimated covariance matrices and statistical kernels for machine learning and big data

Authors: Saba Amsalu, Juntao Duan, Heinrich Matzinger, Ionel Popescu

Abstract: In this paper we propose two schemes for the recovery of the spectrum of a covariance matrix from the empirical covariance matrix, in the case where the dimension of the matrix is a subunitary multiple of the number of observations. We test, compare and analyze these on simulated data and also on some data coming from the stock market. In this paper we propose two schemes for the recovery of the spectrum of a covariance matrix from the empirical covariance matrix, in the case where the dimension of the matrix is a subunitary multiple of the number of observations. We test, compare and analyze these on simulated data and also on some data coming from the stock market. △ Less

Submitted 25 April, 2018; originally announced April 2018.

arXiv:1803.04049 [pdf, other]

PCA by Determinant Optimization has no Spurious Local Optima

Authors: Raphael A. Hauser, Armin Eftekhari, Heinrich F. Matzinger

Abstract: Principal component analysis (PCA) is an indispensable tool in many learning tasks that finds the best linear representation for data. Classically, principal components of a dataset are interpreted as the directions that preserve most of its "energy", an interpretation that is theoretically underpinned by the celebrated Eckart-Young-Mirsky Theorem. There are yet other ways of interpreting PCA that… ▽ More Principal component analysis (PCA) is an indispensable tool in many learning tasks that finds the best linear representation for data. Classically, principal components of a dataset are interpreted as the directions that preserve most of its "energy", an interpretation that is theoretically underpinned by the celebrated Eckart-Young-Mirsky Theorem. There are yet other ways of interpreting PCA that are rarely exploited in practice, largely because it is not known how to reliably solve the corresponding non-convex optimisation programs. In this paper, we consider one such interpretation of principal components as the directions that preserve most of the "volume" of the dataset. Our main contribution is a theorem that shows that the corresponding non-convex program has no spurious local optima. We apply a number of solvers for empirical confirmation. △ Less

Submitted 11 March, 2018; originally announced March 2018.

arXiv:1710.10124 [pdf, ps, other]

Quantifying the Estimation Error of Principal Components

Authors: Raphael Hauser, Raul Kangro, Jüri Lember, Heinrich Matzinger

Abstract: Principal component analysis is an important pattern recognition and dimensionality reduction tool in many applications. Principal components are computed as eigenvectors of a maximum likelihood covariance $\widehatΣ$ that approximates a population covariance $Σ$, and these eigenvectors are often used to extract structural information about the variables (or attributes) of the studied population.… ▽ More Principal component analysis is an important pattern recognition and dimensionality reduction tool in many applications. Principal components are computed as eigenvectors of a maximum likelihood covariance $\widehatΣ$ that approximates a population covariance $Σ$, and these eigenvectors are often used to extract structural information about the variables (or attributes) of the studied population. Since PCA is based on the eigendecomposition of the proxy covariance $\widehatΣ$ rather than the ground-truth $Σ$, it is important to understand the approximation error in each individual eigenvector as a function of the number of available samples. The recent results of Kolchinskii and Lounici yield such bounds. In the present paper we sharpen these bounds and show that eigenvectors can often be reconstructed to a required accuracy from a sample of strictly smaller size order. △ Less

Submitted 27 October, 2017; originally announced October 2017.

arXiv:1703.05788 [pdf, other]

doi 10.1007/s10955-017-1835-6

Non-normal limiting distribution for optimal alignment scores of strings in binary alphabets

Authors: Jun Tao Duan, Heinrich Matzinger, Ionel Popescu

Abstract: We consider two independent binary i.i.d. random strings $X$ and $Y$ of equal length $n$ and the optimal alignments according to a symmetric scoring functions only. We decompose the space of scoring functions into five components. Two of these components add a part to the optimal score which does not depend on the alignment and which is asymptotically normal. We show that when we restrict the nu… ▽ More We consider two independent binary i.i.d. random strings $X$ and $Y$ of equal length $n$ and the optimal alignments according to a symmetric scoring functions only. We decompose the space of scoring functions into five components. Two of these components add a part to the optimal score which does not depend on the alignment and which is asymptotically normal. We show that when we restrict the number of gaps sufficiently and add them only into one sequence, then the alignment score can be decomposed into a part which is normal and has order $O(\sqrt{n})$ and a part which is on a smaller order and tends to a Tracy-Widom distribution. Adding gaps only into one sequence is equivalent to aligning a string with its descendants in case of mutations and deletes. For testing relatedness of strings, the normal part is irrelevant, since it does not depend on the alignment hence it can be safely removed from the test statistic. △ Less

Submitted 16 March, 2017; originally announced March 2017.

arXiv:1602.05560 [pdf, other]

Lower bounds for moments of global scores of pairwise Markov chains

Authors: Jüri Lember, Heinrich Matzinger, Joonas Sova, Fabio Zucca

Abstract: Let $X_1,X_2,\ldots$ and $Y_1,Y_2,\ldots$ be two random sequences so that every random variable takes values in a finite set $\mathbb{A}$. We consider a global similarity score $L_n:=L(X_1,\ldots,X_n;Y_1,\ldots,Y_n)$ that measures the homology (relatedness) of words $(X_1,\ldots,X_n)$ and $(Y_1,\ldots,Y_n)$. A typical example of such score is the length of the longest common subsequence. We study… ▽ More Let $X_1,X_2,\ldots$ and $Y_1,Y_2,\ldots$ be two random sequences so that every random variable takes values in a finite set $\mathbb{A}$. We consider a global similarity score $L_n:=L(X_1,\ldots,X_n;Y_1,\ldots,Y_n)$ that measures the homology (relatedness) of words $(X_1,\ldots,X_n)$ and $(Y_1,\ldots,Y_n)$. A typical example of such score is the length of the longest common subsequence. We study the order of central absolute moment $E|L_n-EL_n|^r$ in the case where two-dimensional process $(X_1,Y_1),(X_2,Y_2),\ldots$ is a Markov chain on $\mathbb{A}\times \mathbb{A}$. This is a very general model involving independent Markov chains, hidden Markov models, Markov switching models and many more. Our main result establishes a general condition that guarantees that $E|L_n-EL_n|^r\asymp n^{r\over 2}$. We also perform simulations indicating the validity of the condition. △ Less

Submitted 18 February, 2016; v1 submitted 17 February, 2016; originally announced February 2016.

MSC Class: 60K35; 41A25; 60C05

arXiv:1511.00973 [pdf, ps, other]

doi 10.1214/16-AAP1183

Reconstruction of a multidimensional scenery with a branching random walk

Authors: Heinrich Matzinger, Serguei Popov, Angelica Pachon

Abstract: In this paper we consider a d-dimensional scenery seen along a simple symmetric branching random walk, where at each time each particle gives the color record it is seeing. We show that we can a.s. reconstruct the scenery up to equivalence from the color record of all the particles. For this we assume that the scenery has at least 2d + 1 colors which are i.i.d. with uniform probability. This is an… ▽ More In this paper we consider a d-dimensional scenery seen along a simple symmetric branching random walk, where at each time each particle gives the color record it is seeing. We show that we can a.s. reconstruct the scenery up to equivalence from the color record of all the particles. For this we assume that the scenery has at least 2d + 1 colors which are i.i.d. with uniform probability. This is an improvement in comparison to [22] where the particles needed to see at each time a window around their current position. In [11] the reconstruction is done for d = 2 with only one particle instead of a branching random walk, but millions of colors are necessary. △ Less

Submitted 3 November, 2015; originally announced November 2015.

Journal ref: Ann. Appl. Probab. 27(2): 651-685 (2017)

arXiv:1409.7713 [pdf, other]

An Upper Bound on the Convergence Rate of a Second Functional in Optimal Sequence Alignment

Authors: Raphael Hauser, Heinrich Matzinger, Ionel Popescu

Abstract: Consider finite sequences $X_{[1,n]}=X_1\dots X_n$ and $Y_{[1,n]}=Y_1\dots Y_n$ of length $n$, consisting of i.i.d.\ samples of random letters from a finite alphabet, and let $S$ and $T$ be chosen i.i.d.\ randomly from the unit ball in the space of symmetric scoring functions over this alphabet augmented by a gap symbol. We prove a probabilistic upper bound of linear order in $n^{0.75}$ for the de… ▽ More Consider finite sequences $X_{[1,n]}=X_1\dots X_n$ and $Y_{[1,n]}=Y_1\dots Y_n$ of length $n$, consisting of i.i.d.\ samples of random letters from a finite alphabet, and let $S$ and $T$ be chosen i.i.d.\ randomly from the unit ball in the space of symmetric scoring functions over this alphabet augmented by a gap symbol. We prove a probabilistic upper bound of linear order in $n^{0.75}$ for the deviation of the score relative to $T$ of optimal alignments with gaps of $X_{[1,n]}$ and $Y_{[1,n]}$ relative to $S$. It remains an open problem to prove a lower bound. Our result contributes to the understanding of the microstructure of optimal alignments relative to one given scoring function, extending a theory begun by the first two authors. △ Less

Submitted 26 September, 2014; originally announced September 2014.

arXiv:1407.1233 [pdf, ps, other]

doi 10.3150/13-BEJ522

Optimal alignments of longest common subsequences and their path properties

Authors: Jüri Lember, Heinrich Matzinger, Anna Vollmer

Abstract: We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all op… ▽ More We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all optimal alignments. We present several simulations indicating that the homologous (having the same common ancestor) sequences have typically the distance between the extremal alignments of much smaller size than independent sequences. In particular, the simulations suggest that for the homologous sequences, the growth of the distance between the extremal alignments is logarithmical. The main theoretical results of the paper prove that (under some assumptions) this is the case, indeed. The paper suggests that the properties of the optimal alignment paths characterize the relatedness of the sequences. △ Less

Submitted 4 July, 2014; originally announced July 2014.

Comments: Published in at http://dx.doi.org/10.3150/13-BEJ522 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ522

Journal ref: Bernoulli 2014, Vol. 20, No. 3, 1292-1343

arXiv:1304.6521 [pdf, ps, other]

doi 10.1007/s10955-013-0819-4

Letter Change Bias and Local Uniqueness in Optimal Sequence Alignments

Authors: Raphael Hauser, Heinrich Matzinger

Abstract: Considering two optimally aligned random sequences, we investigate the effect on the alignment score caused by changing a random letter in one of the two sequences. Using this idea in conjunction with large deviations theory, we show that in alignments with a low proportion of gaps the optimal alignment is locally unique in most places with high probability. This has implications in the design of… ▽ More Considering two optimally aligned random sequences, we investigate the effect on the alignment score caused by changing a random letter in one of the two sequences. Using this idea in conjunction with large deviations theory, we show that in alignments with a low proportion of gaps the optimal alignment is locally unique in most places with high probability. This has implications in the design of recently pioneered alignment methods that use the local uniqueness as a homology indicator. △ Less

Submitted 24 April, 2013; originally announced April 2013.

MSC Class: 60F10; 92D20

arXiv:1211.5491 [pdf, ps, other]

Distribution of Aligned Letter Pairs in Optimal Alignments of Random Sequences

Authors: Raphael Hauser, Heinrich Matzinger

Abstract: Considering the optimal alignment of two i.i.d. random sequences of length $n$, we show that when the scoring function is chosen randomly, almost surely the empirical distribution of aligned letter pairs in all optimal alignments converges to a unique limiting distribution as $n$ tends to infinity. This result is interesting because it helps understanding the microscopic path structure of a specia… ▽ More Considering the optimal alignment of two i.i.d. random sequences of length $n$, we show that when the scoring function is chosen randomly, almost surely the empirical distribution of aligned letter pairs in all optimal alignments converges to a unique limiting distribution as $n$ tends to infinity. This result is interesting because it helps understanding the microscopic path structure of a special type of last passage percolation problem with correlated weights, an area of long-standing open problems. Characterizing the microscopic path structure yields furthermore a robust alternative to optimal alignment scores for testing the relatedness of genetic sequences. △ Less

Submitted 23 November, 2012; originally announced November 2012.

Report number: Numerical Analysis Report NA-12-15, Mathematical Institute, University of Oxford MSC Class: 60K35 (Primary) 52A40; 60C05; 60D05; 65K10; 60F10; 62E20; 90C27 (Secondary)

arXiv:1211.5489 [pdf, ps, other]

A Monte Carlo Approach to the Fluctuation Problem in Optimal Alignments of Random Strings

Authors: Saba Amsalu, Raphael Hauser, Heinrich Matzinger

Abstract: The problem of determining the correct order of fluctuation of the optimal alignment score of two random strings of length $n$ has been open for several decades. It is known that the biased expected effect of a random letter-change on the optimal score implies an order of fluctuation linear in $\sqrt{n}$. However, in many situations where such a biased effect is observed empirically, it has been i… ▽ More The problem of determining the correct order of fluctuation of the optimal alignment score of two random strings of length $n$ has been open for several decades. It is known that the biased expected effect of a random letter-change on the optimal score implies an order of fluctuation linear in $\sqrt{n}$. However, in many situations where such a biased effect is observed empirically, it has been impossible to prove analytically. The main result of this paper shows that when the rescaled-limit of the optimal alignment score increases in a certain direction, then the biased effect exists. On the basis of this result one can quantify a confidence level for the existence of such a biased effect and hence of an order $\sqrt{n}$ fluctuation based on simulation of optimal alignments scores.This is an important step forward, as the correct order of fluctuation was previously known only for certain special distributions. To illustrate the usefulness of our new methodology, we apply it to optimal alignments of strings written in the DNA-alphabet. As scoring function, we use the BLASTZ default-substitution matrix together with a realistic gap penalty. BLASTZ is one of the most widely used sequence alignment methodologies in bioinformatics. For this DNA-setting, we show that with a high level of confidence, the fluctuation of the optimal alignment score is of order $Θ(\sqrt{n})$. An important special case of optimal alignment score is the Longest Common Subsequence (LCS) of random strings. For binary sequences with equiprobable symbols, the question of the fluctuation of the LCS remains open. The symmetry in that case does not allow for our method. On the other hand, in real-life DNA sequences, it is not the case that all letters occur with the same frequency. Thus, for many real life situations, our method allows to determine the order of the fluctuation up to a high confidence level. △ Less

Submitted 23 November, 2012; originally announced November 2012.

Report number: Numerical Analysis Report NA-12-16, Mathematical Institute, University of Oxford MSC Class: 60K35 (Primary) 60C05; 60F10; 62E20; 65C05; 90C27 (Secondary)

arXiv:1211.5072 [pdf, ps, other]

General approach to the fluctuations problem in random sequence comparison

Authors: Jüri Lember, Heinrich Matzinger, Felipe Torres

Abstract: We present a general approach to the problem of determining the asymptotic order of the variance of the optimal score between two independent random sequences defined over an arbitrary finite alphabet. Our general approach is based on identifying random variables driving the fluctuations of the optimal score and conveniently choosing functions of them which exhibit certain monotonicity properties.… ▽ More We present a general approach to the problem of determining the asymptotic order of the variance of the optimal score between two independent random sequences defined over an arbitrary finite alphabet. Our general approach is based on identifying random variables driving the fluctuations of the optimal score and conveniently choosing functions of them which exhibit certain monotonicity properties. We show how our general approach establishes a common theoretical background for the techniques used by Matzinger et al. in a series of previous articles [6, 8, 20, 24, 26, 37] studying the same problem in especial cases. Additionally, we explicitely apply our general approach to study the fluctuations of the optimal score between two random sequences over a finite alphabet (closing the study as initiated in [26]) and of the length of the longest common subsequences between two random sequences with a certain block structure (generalizing part of [37]). △ Less

Submitted 21 November, 2012; originally announced November 2012.

Comments: 39 pages

MSC Class: 60K35; 41A25; 60C05

arXiv:1210.3771 [pdf, ps, other]

Detecting the homology of DNA-sequences based on the variety of optimal alignments: a case study

Authors: Erik Hirmo, Jüri Lember, Heinrich Matzinger

Abstract: We consider a novel approach of measuring the homology of DNA sequences based of the variety of optimal alignments in the longest common subsequence sense. The proposed approach is compared with BLAST in measuring the homology of four genes. We consider a novel approach of measuring the homology of DNA sequences based of the variety of optimal alignments in the longest common subsequence sense. The proposed approach is compared with BLAST in measuring the homology of four genes. △ Less

Submitted 14 October, 2012; originally announced October 2012.

arXiv:1204.1009 [pdf, ps, other]

Sparse long blocks and the variance of the longest common subsequences in random words

Authors: S. Amsalu, C. Houdré, H. Matzinger

Abstract: Consider two independent random strings having same length and taking values uniformly in a common finite alphabet. We study the order of the variance of the length of the longest common subsequences (LCS) of these strings when long blocks, or other types of atypical substrings, are sparsely added into one of them. Under weak conditions on the derivative of the mean LCS-curve, the order of the var… ▽ More Consider two independent random strings having same length and taking values uniformly in a common finite alphabet. We study the order of the variance of the length of the longest common subsequences (LCS) of these strings when long blocks, or other types of atypical substrings, are sparsely added into one of them. Under weak conditions on the derivative of the mean LCS-curve, the order of the variance of the LCS is shown to be linear in the length of the strings. We also argue that our proofs carry over to many models used by computational biologists to simulate DNA-sequences. This is the first result where the open question of the order of the fluctuation of the LCS of random strings is solved for a realistic model. Until now, this type of result had only been established for low entropy cases. △ Less

Submitted 22 September, 2016; v1 submitted 4 April, 2012; originally announced April 2012.

Comments: 36 pages

MSC Class: 60C05; 60F05; 05A16

arXiv:1204.1005 [pdf, ps, other]

Sparse Long Blocks and the Micro-Structure of the Longest Common Subsequences

Authors: S. Amsalu, C. Houdré, H. Matzinger

Abstract: Consider two random strings having the same length and generated by an iid sequence taking its values uniformly in a fixed finite alphabet. Artificially place a long constant block into one of the strings, where a constant block is a contiguous substring consisting only of one type of symbol. The long block replaces a segment of equal size and its length is smaller than the length of the strings,… ▽ More Consider two random strings having the same length and generated by an iid sequence taking its values uniformly in a fixed finite alphabet. Artificially place a long constant block into one of the strings, where a constant block is a contiguous substring consisting only of one type of symbol. The long block replaces a segment of equal size and its length is smaller than the length of the strings, but larger than its square-root. We show that for sufficiently long strings the optimal alignment corresponding to a Longest Common Subsequence (LCS) treats the inserted block very differently depending on the size of the alphabet. For two-letter alphabets, the long constant block gets mainly aligned with the same symbol from the other string, while for three or more letters the opposite is true and the block gets mainly aligned with gaps. We further provide simulation results on the proportion of gaps in blocks of various lengths. In our simulations, the blocks are "regular blocks" in an iid sequence, and are not artificially inserted. Nonetheless, we observe for these natural blocks a phenomenon similar to the one shown in case of artificially-inserted blocks: with two letters, the long blocks get aligned with a smaller proportion of gaps; for three or more letters, the opposite is true. It thus appears that the microscopic nature of two-letter optimal alignments and three-letter optimal alignments are entirely different from each other. △ Less

Submitted 30 January, 2014; v1 submitted 4 April, 2012; originally announced April 2012.

Comments: To appear: Journal of Statistical Physics

MSC Class: 60C05; 60F05; 05A16

arXiv:1110.6853 [pdf, ps, other]

Information recovery from observations by a random walk having jump distribution with exponential tails

Authors: Andrew Hart, Fabio Machado, Heinrich Matzinger

Abstract: A {\it scenery} is a coloring $ξ$ of the integers. Let $\{S_t\}_{t\geq 0}$ be a recurrent random walk on the integers. Observing the scenery $ξ$ along the path of this random walk, one sees the color $χ_t:=ξ(S_t)$ at time $t$. The {\it scenery reconstruction problem} is concerned with recovering the scenery $ξ$, given only the sequence of observations $χ:=(χ_t)_{t\geq 0}$. The scenery reconstructi… ▽ More A {\it scenery} is a coloring $ξ$ of the integers. Let $\{S_t\}_{t\geq 0}$ be a recurrent random walk on the integers. Observing the scenery $ξ$ along the path of this random walk, one sees the color $χ_t:=ξ(S_t)$ at time $t$. The {\it scenery reconstruction problem} is concerned with recovering the scenery $ξ$, given only the sequence of observations $χ:=(χ_t)_{t\geq 0}$. The scenery reconstruction methods presented to date require the random walk to have bounded increments. Here, we present a new approach for random walks with unbounded increments which works when the tail of the increment distribution decays exponentially fast enough and the scenery has five colors. △ Less

Submitted 31 October, 2011; originally announced October 2011.

MSC Class: 60K37; 60G50

arXiv:1011.3601 [pdf, ps, other]

CLT for the proportion of infected incividuals for an epidemic model on a complete graph

Authors: F. Machado, H. Mashurian, H. Matzinger

Abstract: We prove a Central Limit Theorem for the proportion of infected individuals for an epidemic model by dealing with a discrete time system of simple random walks on a complete graph with n vertices. Each random walk makes a role of a virus. Individuals are all connected as vertices in a complete graph. A virus duplicates each time it hits a susceptible individual, dying as soon as it hits an already… ▽ More We prove a Central Limit Theorem for the proportion of infected individuals for an epidemic model by dealing with a discrete time system of simple random walks on a complete graph with n vertices. Each random walk makes a role of a virus. Individuals are all connected as vertices in a complete graph. A virus duplicates each time it hits a susceptible individual, dying as soon as it hits an already infected individual. The process stops as soon as there is no more viruses. This model is closely related to some epidemiologial models like those for virus dissemination in a computer network. △ Less

Submitted 16 November, 2010; originally announced November 2010.

MSC Class: 60F05 62P10 60J10

arXiv:1011.2688 [pdf, other]

The rate of the convergence of the mean score in random sequence comparison

Authors: Juri Lember, Heinrich Matzinger, Felipe Torres

Abstract: We consider a general class of super-additive scores measuring the similarity of two independent sequences of $n$ i.i.d. letters from a finite alphabet. Our object of interest is the mean score by letter $l_n$. By the subadditivity $l_n$ is nondecreasing and converges to a limit $l$. We give a simple method of bounding the difference $l-l_n$ and obtaining the rate of convergence. Our result genera… ▽ More We consider a general class of super-additive scores measuring the similarity of two independent sequences of $n$ i.i.d. letters from a finite alphabet. Our object of interest is the mean score by letter $l_n$. By the subadditivity $l_n$ is nondecreasing and converges to a limit $l$. We give a simple method of bounding the difference $l-l_n$ and obtaining the rate of convergence. Our result generalizes a previous result of Alexander, where only the special case of the longest common subsequence is considered. △ Less

Submitted 17 November, 2010; v1 submitted 11 November, 2010; originally announced November 2010.

Comments: 13 pages, 1 figure

MSC Class: 60K35; 41A25; 60C05

arXiv:1011.2679 [pdf, other]

Random modification effect in the size of the fluctuation of the LCS of two sequences of i.i.d. blocks

Authors: Heinrich Matzinger, Felipe Torres

Abstract: The problem of the order of the fluctuation of the Longest Common Subsequence (LCS) of two independent sequences has been open for decades. There exist contradicting conjectures on the topic, due to Chvatal - Sankoff in 1975 and Waterman in 1994. In the present article, we consider a special model of i.i.d. sequences made out of blocks. A block is a contiguous substring consisting only of one type… ▽ More The problem of the order of the fluctuation of the Longest Common Subsequence (LCS) of two independent sequences has been open for decades. There exist contradicting conjectures on the topic, due to Chvatal - Sankoff in 1975 and Waterman in 1994. In the present article, we consider a special model of i.i.d. sequences made out of blocks. A block is a contiguous substring consisting only of one type of symbol. Our model allows only three possible block lengths, each been equiprobable picked up. In this context, we introduce a random operation (random modification) on the blocks of one of the sequences. In the present article, we develop the techniques to prove the following: if we suppose that the random modification increases the length of the LCS with high probability, then the order of the fluctuation of the LCS is as conjectured by Waterman. This result is a key technical part in the study of the size of the fluctuation of the LCS for sequences of i.i.d. blocks, developed by Matzinger and Torres. △ Less

Submitted 12 November, 2010; v1 submitted 11 November, 2010; originally announced November 2010.

Comments: 18 pages

MSC Class: 60C05; 60F10; 60G50; 60G99; 26A12

arXiv:1001.1273 [pdf, other]

Fluctuations of the Longest Common Subsequence for Sequences of Independent Blocks

Authors: Heinrich Matzinger, Felipe Torres

Abstract: The problem of the fluctuation of the Longest Common Subsequence (LCS) of two i.i.d. sequences of length $n>0$ has been open for decades. There exist contradicting conjectures on the topic. Chvatal and Sankoff conjectured in 1975 that asymptotically the order should be $n^{2/3}$, while Waterman conjectured in 1994 that asymptotically the order should be $n$. A contiguous substring consisting only… ▽ More The problem of the fluctuation of the Longest Common Subsequence (LCS) of two i.i.d. sequences of length $n>0$ has been open for decades. There exist contradicting conjectures on the topic. Chvatal and Sankoff conjectured in 1975 that asymptotically the order should be $n^{2/3}$, while Waterman conjectured in 1994 that asymptotically the order should be $n$. A contiguous substring consisting only of one type of symbol is called a block. In the present work, we determine the order of the fluctuation of the LCS for a special model of sequences consisting of i.i.d. blocks whose lengths are uniformly distributed on the set $\{l-1,l,l+1\}$, with $l$ a given positive integer. We showed that the fluctuation in this model is asymptotically of order $n$, which confirm Waterman's conjecture. For achieving this goal, we developed a new method which allows us to reformulate the problem of the order of the variance as a (relatively) low dimensional optimization problem. △ Less

Submitted 12 November, 2010; v1 submitted 8 January, 2010; originally announced January 2010.

Comments: PDFLatex, 40 pages

MSC Class: 60C05; 60F10

arXiv:0911.2031 [pdf, other]

Closeness to the Diagonal for Longest Common Subsequences in Random Words

Authors: C. Houdré, H. Matzinger

Abstract: The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated. It is shown that such an optimal alignment typically matches pieces of similar short-length. This is of importance in understanding the structure of optimal alignments of two sequences. Moreover, it is also shown that any pr… ▽ More The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated. It is shown that such an optimal alignment typically matches pieces of similar short-length. This is of importance in understanding the structure of optimal alignments of two sequences. Moreover, it is also shown that any property, common to two subsequences, typically holds in most parts of the optimal alignment whenever this same property holds, with high probability, for strings of similar short-length. Our results should, in particular, prove useful for simulations since they imply that the re-scaled two dimensional representation of a LCS gets uniformly close to the diagonal as the length of the sequences grows without bound. △ Less

Submitted 20 April, 2016; v1 submitted 10 November, 2009; originally announced November 2009.

Comments: Final version to appear in ECP, 2016

MSC Class: 05A05; 60C05; 60F10

arXiv:0907.5137 [pdf, ps, other]

doi 10.1214/08-AOP436

Standard deviation of the longest common subsequence

Authors: Jüri Lember, Heinrich Matzinger

Abstract: Let $L_n$ be the length of the longest common subsequence of two independent i.i.d. sequences of Bernoulli variables of length $n$. We prove that the order of the standard deviation of $L_n$ is $\sqrt{n}$, provided the parameter of the Bernoulli variables is small enough. This validates Waterman's conjecture in this situation [Philos. Trans. R. Soc. Lond. Ser. B 344 (1994) 383--390]. The order c… ▽ More Let $L_n$ be the length of the longest common subsequence of two independent i.i.d. sequences of Bernoulli variables of length $n$. We prove that the order of the standard deviation of $L_n$ is $\sqrt{n}$, provided the parameter of the Bernoulli variables is small enough. This validates Waterman's conjecture in this situation [Philos. Trans. R. Soc. Lond. Ser. B 344 (1994) 383--390]. The order conjectured by Chvatal and Sankoff [J. Appl. Probab. 12 (1975) 306--315], however, is different. △ Less

Submitted 29 July, 2009; originally announced July 2009.

Comments: Published in at http://dx.doi.org/10.1214/08-AOP436 the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOP-AOP436 MSC Class: 60K35; 41A25 (Primary); 60C05C (Secondary)

Journal ref: Annals of Probability 2009, Vol. 37, No. 3, 1192-1235

arXiv:math/0702036 [pdf, ps, other]

doi 10.1007/s10955-016-1549-1

On the Variance of the Optimal Alignments Score for Binary Random Words and an Asymmetric Scoring Function

Authors: Christian Houdré, Heinrich Matzinger

Abstract: We investigate the order of the variance of the optimal alignments score of two independent iid binary random words having the same length. The letters are equiprobable, but the scoring function is such that one letter has a larger score than the other. In this setting, we prove that the order of variance is linear in the common length. Optimal alignments constitute a generalization of longest com… ▽ More We investigate the order of the variance of the optimal alignments score of two independent iid binary random words having the same length. The letters are equiprobable, but the scoring function is such that one letter has a larger score than the other. In this setting, we prove that the order of variance is linear in the common length. Optimal alignments constitute a generalization of longest common subsequences, they can be represented as optimal paths in a two-dimensional last passage percolation setting with dependent weights. △ Less

Submitted 15 June, 2016; v1 submitted 1 February, 2007; originally announced February 2007.

Comments: Appeared in Journal of Statistical Physics

MSC Class: 60K35; 60C05; 05A05

Journal ref: Journal of Statistical Physics, 2016

arXiv:math/0503517 [pdf, ps, other]

doi 10.1214/105051604000000972

Reconstructing a two-color scenery by observing it along a simple random walk path

Authors: Heinrich Matzinger

Abstract: Let {ξ(n)}_{n\in Z} be a two-color random scenery, that is, a random coloring of Z in two colors, such that the ξ(i)'s are i.i.d. Bernoulli variables with parameter \tfrac12. Let {S(n)}_{n\in N} be a symmetric random walk starting at 0. Our main result shows that a.s., ξ\circ S (the composition of ξand S) determines ξup to translation and reflection. In other words, by observing the scenery ξalo… ▽ More Let {ξ(n)}_{n\in Z} be a two-color random scenery, that is, a random coloring of Z in two colors, such that the ξ(i)'s are i.i.d. Bernoulli variables with parameter \tfrac12. Let {S(n)}_{n\in N} be a symmetric random walk starting at 0. Our main result shows that a.s., ξ\circ S (the composition of ξand S) determines ξup to translation and reflection. In other words, by observing the scenery ξalong the random walk path S, we can a.s. reconstruct ξup to translation and reflection. This result gives a positive answer to the question of H. Kesten of whether one can a.s. detect a single defect in almost every two-color random scenery by observing it only along a random walk path. △ Less

Submitted 24 March, 2005; originally announced March 2005.

Comments: Published at http://dx.doi.org/10.1214/105051604000000972 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AAP-AAP066 MSC Class: 60L37 (Primary) 60G10. (Secondary)

Journal ref: Annals of Applied Probability 2005, Vol. 15, No. 1B, 778-819

arXiv:math/0410404 [pdf, ps, other]

Fluctuations of the Longest Common Subsequence in the Asymmetric Case of 2- and 3-Letter Alphabets

Authors: F. Bonetto, H. Matzinger

Abstract: We investigate the asymptotic standard deviation of the Longest Common Subsequence (LCS) of two independent i.i.d. sequences of length n. The first sequence is drawn from a three letter alphabet {0,1,a}, whilst the second sequence is binary. The main result of this article is that in this asymmetric case, the standard deviation of the length of the LCS is of order square root of n. This confirms… ▽ More We investigate the asymptotic standard deviation of the Longest Common Subsequence (LCS) of two independent i.i.d. sequences of length n. The first sequence is drawn from a three letter alphabet {0,1,a}, whilst the second sequence is binary. The main result of this article is that in this asymmetric case, the standard deviation of the length of the LCS is of order square root of n. This confirms Waterman's conjecture for this special case. Our result seems to indicate that in many other situations the order of the standard deviation is also square root of n. △ Less

Submitted 18 October, 2004; originally announced October 2004.

MSC Class: 60C05 (primary) 92D20 (secondary)

Showing 1–31 of 31 results for author: Matzinger, H