Skip to main content

Showing 1–31 of 31 results for author: Matzinger, H

.
  1. arXiv:2201.00230  [pdf, other

    stat.ML cs.LG stat.CO

    Recover the spectrum of covariance matrix: a non-asymptotic iterative method

    Authors: Juntao Duan, Ionel Popescu, Heinrich Matzinger

    Abstract: It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions.

    Submitted 1 January, 2022; originally announced January 2022.

  2. arXiv:2112.00300  [pdf, ps, other

    math.PR math.ST stat.ML

    Invariance principle of random projection for the norm

    Authors: Juntao Duan, Ionel Popescu, Heinrich Matzinger

    Abstract: Johnson-Lindenstrauss guarantees certain topological structure is preserved under random projections when project high dimensional deterministic vectors to low dimensional vectors. In this work, we try to understand how random matrix affect norms of random vectors. In particular we prove the distribution of the norm of random vector $X \in \mathbb{R}^n$, whose entries are i.i.d. random variables,… ▽ More

    Submitted 25 July, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

  3. arXiv:2006.00152  [pdf, ps, other

    math.PR stat.ML

    An Analytical Formula for Spectrum Reconstruction

    Authors: Zhibo Dai, Heinrich Matzinger, Ionel Popescu

    Abstract: We study the spectrum reconstruction technique. As is known to all, eigenvalues play an important role in many research fields and are foundation to many practical techniques such like PCA(Principal Component Analysis). We believe that related algorithms should perform better with more accurate spectrum estimation. There was an approximation formula proposed, however, they didn't give any proof. I… ▽ More

    Submitted 29 May, 2020; originally announced June 2020.

  4. arXiv:1906.03768  [pdf, ps, other

    stat.ML cs.IR cs.LG

    A cost-reducing partial labeling estimator in text classification problem

    Authors: Jiangning Chen, Zhibo Dai, Juntao Duan, Qianli Hu, Ruilin Li, Heinrich Matzinger, Ionel Popescu, Haoyan Zhai

    Abstract: We propose a new approach to address the text classification problems when learning with partial labels is beneficial. Instead of offering each training sample a set of candidate labels, we assign negative-oriented labels to the ambiguous training examples if they are unlikely fall into certain classes. We construct our new maximum likelihood estimators with self-correction property, and prove tha… ▽ More

    Submitted 9 June, 2019; originally announced June 2019.

  5. arXiv:1905.06115  [pdf, ps, other

    cs.IR cs.LG stat.ML

    Naive Bayes with Correlation Factor for Text Classification Problem

    Authors: Jiangning Chen, Zhibo Dai, Juntao Duan, Heinrich Matzinger, Ionel Popescu

    Abstract: Naive Bayes estimator is widely used in text classification problems. However, it doesn't perform well with small-size training dataset. We propose a new method based on Naive Bayes estimator to solve this problem. A correlation factor is introduced to incorporate the correlation among different classes. Experimental results show that our estimator achieves a better accuracy compared with traditio… ▽ More

    Submitted 8 May, 2019; originally announced May 2019.

  6. arXiv:1808.10261  [pdf, ps, other

    cs.IR cs.LG stat.ML

    Centroid estimation based on symmetric KL divergence for Multinomial text classification problem

    Authors: Jiangning Chen, Heinrich Matzinger, Haoyan Zhai, Mi Zhou

    Abstract: We define a new method to estimate centroid for text classification based on the symmetric KL-divergence between the distribution of words in training documents and their class centroids. Experiments on several standard data sets indicate that the new method achieves substantial improvements over the traditional classifiers.

    Submitted 24 October, 2018; v1 submitted 29 August, 2018; originally announced August 2018.

  7. arXiv:1804.09472  [pdf, other

    math.PR math.ST

    Recovery of spectrum from estimated covariance matrices and statistical kernels for machine learning and big data

    Authors: Saba Amsalu, Juntao Duan, Heinrich Matzinger, Ionel Popescu

    Abstract: In this paper we propose two schemes for the recovery of the spectrum of a covariance matrix from the empirical covariance matrix, in the case where the dimension of the matrix is a subunitary multiple of the number of observations. We test, compare and analyze these on simulated data and also on some data coming from the stock market.

    Submitted 25 April, 2018; originally announced April 2018.

  8. arXiv:1803.04049  [pdf, other

    math.OC

    PCA by Determinant Optimization has no Spurious Local Optima

    Authors: Raphael A. Hauser, Armin Eftekhari, Heinrich F. Matzinger

    Abstract: Principal component analysis (PCA) is an indispensable tool in many learning tasks that finds the best linear representation for data. Classically, principal components of a dataset are interpreted as the directions that preserve most of its "energy", an interpretation that is theoretically underpinned by the celebrated Eckart-Young-Mirsky Theorem. There are yet other ways of interpreting PCA that… ▽ More

    Submitted 11 March, 2018; originally announced March 2018.

  9. arXiv:1710.10124  [pdf, ps, other

    math.ST

    Quantifying the Estimation Error of Principal Components

    Authors: Raphael Hauser, Raul Kangro, Jüri Lember, Heinrich Matzinger

    Abstract: Principal component analysis is an important pattern recognition and dimensionality reduction tool in many applications. Principal components are computed as eigenvectors of a maximum likelihood covariance $\widehatΣ$ that approximates a population covariance $Σ$, and these eigenvectors are often used to extract structural information about the variables (or attributes) of the studied population.… ▽ More

    Submitted 27 October, 2017; originally announced October 2017.

  10. Non-normal limiting distribution for optimal alignment scores of strings in binary alphabets

    Authors: Jun Tao Duan, Heinrich Matzinger, Ionel Popescu

    Abstract: We consider two independent binary i.i.d. random strings $X$ and $Y$ of equal length $n$ and the optimal alignments according to a symmetric scoring functions only. We decompose the space of scoring functions into five components. Two of these components add a part to the optimal score which does not depend on the alignment and which is asymptotically normal. We show that when we restrict the nu… ▽ More

    Submitted 16 March, 2017; originally announced March 2017.

  11. arXiv:1602.05560  [pdf, other

    math.PR

    Lower bounds for moments of global scores of pairwise Markov chains

    Authors: Jüri Lember, Heinrich Matzinger, Joonas Sova, Fabio Zucca

    Abstract: Let $X_1,X_2,\ldots$ and $Y_1,Y_2,\ldots$ be two random sequences so that every random variable takes values in a finite set $\mathbb{A}$. We consider a global similarity score $L_n:=L(X_1,\ldots,X_n;Y_1,\ldots,Y_n)$ that measures the homology (relatedness) of words $(X_1,\ldots,X_n)$ and $(Y_1,\ldots,Y_n)$. A typical example of such score is the length of the longest common subsequence. We study… ▽ More

    Submitted 18 February, 2016; v1 submitted 17 February, 2016; originally announced February 2016.

    MSC Class: 60K35; 41A25; 60C05

  12. Reconstruction of a multidimensional scenery with a branching random walk

    Authors: Heinrich Matzinger, Serguei Popov, Angelica Pachon

    Abstract: In this paper we consider a d-dimensional scenery seen along a simple symmetric branching random walk, where at each time each particle gives the color record it is seeing. We show that we can a.s. reconstruct the scenery up to equivalence from the color record of all the particles. For this we assume that the scenery has at least 2d + 1 colors which are i.i.d. with uniform probability. This is an… ▽ More

    Submitted 3 November, 2015; originally announced November 2015.

    Journal ref: Ann. Appl. Probab. 27(2): 651-685 (2017)

  13. arXiv:1409.7713  [pdf, other

    math.PR

    An Upper Bound on the Convergence Rate of a Second Functional in Optimal Sequence Alignment

    Authors: Raphael Hauser, Heinrich Matzinger, Ionel Popescu

    Abstract: Consider finite sequences $X_{[1,n]}=X_1\dots X_n$ and $Y_{[1,n]}=Y_1\dots Y_n$ of length $n$, consisting of i.i.d.\ samples of random letters from a finite alphabet, and let $S$ and $T$ be chosen i.i.d.\ randomly from the unit ball in the space of symmetric scoring functions over this alphabet augmented by a gap symbol. We prove a probabilistic upper bound of linear order in $n^{0.75}$ for the de… ▽ More

    Submitted 26 September, 2014; originally announced September 2014.

  14. Optimal alignments of longest common subsequences and their path properties

    Authors: Jüri Lember, Heinrich Matzinger, Anna Vollmer

    Abstract: We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all op… ▽ More

    Submitted 4 July, 2014; originally announced July 2014.

    Comments: Published in at http://dx.doi.org/10.3150/13-BEJ522 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

    Report number: IMS-BEJ-BEJ522

    Journal ref: Bernoulli 2014, Vol. 20, No. 3, 1292-1343

  15. Letter Change Bias and Local Uniqueness in Optimal Sequence Alignments

    Authors: Raphael Hauser, Heinrich Matzinger

    Abstract: Considering two optimally aligned random sequences, we investigate the effect on the alignment score caused by changing a random letter in one of the two sequences. Using this idea in conjunction with large deviations theory, we show that in alignments with a low proportion of gaps the optimal alignment is locally unique in most places with high probability. This has implications in the design of… ▽ More

    Submitted 24 April, 2013; originally announced April 2013.

    MSC Class: 60F10; 92D20

  16. arXiv:1211.5491  [pdf, ps, other

    math.PR

    Distribution of Aligned Letter Pairs in Optimal Alignments of Random Sequences

    Authors: Raphael Hauser, Heinrich Matzinger

    Abstract: Considering the optimal alignment of two i.i.d. random sequences of length $n$, we show that when the scoring function is chosen randomly, almost surely the empirical distribution of aligned letter pairs in all optimal alignments converges to a unique limiting distribution as $n$ tends to infinity. This result is interesting because it helps understanding the microscopic path structure of a specia… ▽ More

    Submitted 23 November, 2012; originally announced November 2012.

    Report number: Numerical Analysis Report NA-12-15, Mathematical Institute, University of Oxford MSC Class: 60K35 (Primary) 52A40; 60C05; 60D05; 65K10; 60F10; 62E20; 90C27 (Secondary)

  17. arXiv:1211.5489  [pdf, ps, other

    math.PR

    A Monte Carlo Approach to the Fluctuation Problem in Optimal Alignments of Random Strings

    Authors: Saba Amsalu, Raphael Hauser, Heinrich Matzinger

    Abstract: The problem of determining the correct order of fluctuation of the optimal alignment score of two random strings of length $n$ has been open for several decades. It is known that the biased expected effect of a random letter-change on the optimal score implies an order of fluctuation linear in $\sqrt{n}$. However, in many situations where such a biased effect is observed empirically, it has been i… ▽ More

    Submitted 23 November, 2012; originally announced November 2012.

    Report number: Numerical Analysis Report NA-12-16, Mathematical Institute, University of Oxford MSC Class: 60K35 (Primary) 60C05; 60F10; 62E20; 65C05; 90C27 (Secondary)

  18. arXiv:1211.5072  [pdf, ps, other

    math.PR

    General approach to the fluctuations problem in random sequence comparison

    Authors: Jüri Lember, Heinrich Matzinger, Felipe Torres

    Abstract: We present a general approach to the problem of determining the asymptotic order of the variance of the optimal score between two independent random sequences defined over an arbitrary finite alphabet. Our general approach is based on identifying random variables driving the fluctuations of the optimal score and conveniently choosing functions of them which exhibit certain monotonicity properties.… ▽ More

    Submitted 21 November, 2012; originally announced November 2012.

    Comments: 39 pages

    MSC Class: 60K35; 41A25; 60C05

  19. arXiv:1210.3771  [pdf, ps, other

    stat.AP q-bio.QM

    Detecting the homology of DNA-sequences based on the variety of optimal alignments: a case study

    Authors: Erik Hirmo, Jüri Lember, Heinrich Matzinger

    Abstract: We consider a novel approach of measuring the homology of DNA sequences based of the variety of optimal alignments in the longest common subsequence sense. The proposed approach is compared with BLAST in measuring the homology of four genes.

    Submitted 14 October, 2012; originally announced October 2012.

  20. arXiv:1204.1009  [pdf, ps, other

    math.PR math-ph math.CO

    Sparse long blocks and the variance of the longest common subsequences in random words

    Authors: S. Amsalu, C. Houdré, H. Matzinger

    Abstract: Consider two independent random strings having same length and taking values uniformly in a common finite alphabet. We study the order of the variance of the length of the longest common subsequences (LCS) of these strings when long blocks, or other types of atypical substrings, are sparsely added into one of them. Under weak conditions on the derivative of the mean LCS-curve, the order of the var… ▽ More

    Submitted 22 September, 2016; v1 submitted 4 April, 2012; originally announced April 2012.

    Comments: 36 pages

    MSC Class: 60C05; 60F05; 05A16

  21. arXiv:1204.1005  [pdf, ps, other

    math.PR math-ph math.CO

    Sparse Long Blocks and the Micro-Structure of the Longest Common Subsequences

    Authors: S. Amsalu, C. Houdré, H. Matzinger

    Abstract: Consider two random strings having the same length and generated by an iid sequence taking its values uniformly in a fixed finite alphabet. Artificially place a long constant block into one of the strings, where a constant block is a contiguous substring consisting only of one type of symbol. The long block replaces a segment of equal size and its length is smaller than the length of the strings,… ▽ More

    Submitted 30 January, 2014; v1 submitted 4 April, 2012; originally announced April 2012.

    Comments: To appear: Journal of Statistical Physics

    MSC Class: 60C05; 60F05; 05A16

  22. arXiv:1110.6853  [pdf, ps, other

    math.PR

    Information recovery from observations by a random walk having jump distribution with exponential tails

    Authors: Andrew Hart, Fabio Machado, Heinrich Matzinger

    Abstract: A {\it scenery} is a coloring $ξ$ of the integers. Let $\{S_t\}_{t\geq 0}$ be a recurrent random walk on the integers. Observing the scenery $ξ$ along the path of this random walk, one sees the color $χ_t:=ξ(S_t)$ at time $t$. The {\it scenery reconstruction problem} is concerned with recovering the scenery $ξ$, given only the sequence of observations $χ:=(χ_t)_{t\geq 0}$. The scenery reconstructi… ▽ More

    Submitted 31 October, 2011; originally announced October 2011.

    MSC Class: 60K37; 60G50

  23. arXiv:1011.3601  [pdf, ps, other

    math.PR

    CLT for the proportion of infected incividuals for an epidemic model on a complete graph

    Authors: F. Machado, H. Mashurian, H. Matzinger

    Abstract: We prove a Central Limit Theorem for the proportion of infected individuals for an epidemic model by dealing with a discrete time system of simple random walks on a complete graph with n vertices. Each random walk makes a role of a virus. Individuals are all connected as vertices in a complete graph. A virus duplicates each time it hits a susceptible individual, dying as soon as it hits an already… ▽ More

    Submitted 16 November, 2010; originally announced November 2010.

    MSC Class: 60F05 62P10 60J10

  24. arXiv:1011.2688  [pdf, other

    math.PR math.CO

    The rate of the convergence of the mean score in random sequence comparison

    Authors: Juri Lember, Heinrich Matzinger, Felipe Torres

    Abstract: We consider a general class of super-additive scores measuring the similarity of two independent sequences of $n$ i.i.d. letters from a finite alphabet. Our object of interest is the mean score by letter $l_n$. By the subadditivity $l_n$ is nondecreasing and converges to a limit $l$. We give a simple method of bounding the difference $l-l_n$ and obtaining the rate of convergence. Our result genera… ▽ More

    Submitted 17 November, 2010; v1 submitted 11 November, 2010; originally announced November 2010.

    Comments: 13 pages, 1 figure

    MSC Class: 60K35; 41A25; 60C05

  25. arXiv:1011.2679  [pdf, other

    math.PR math.CO

    Random modification effect in the size of the fluctuation of the LCS of two sequences of i.i.d. blocks

    Authors: Heinrich Matzinger, Felipe Torres

    Abstract: The problem of the order of the fluctuation of the Longest Common Subsequence (LCS) of two independent sequences has been open for decades. There exist contradicting conjectures on the topic, due to Chvatal - Sankoff in 1975 and Waterman in 1994. In the present article, we consider a special model of i.i.d. sequences made out of blocks. A block is a contiguous substring consisting only of one type… ▽ More

    Submitted 12 November, 2010; v1 submitted 11 November, 2010; originally announced November 2010.

    Comments: 18 pages

    MSC Class: 60C05; 60F10; 60G50; 60G99; 26A12

  26. arXiv:1001.1273  [pdf, other

    math.PR math.CO

    Fluctuations of the Longest Common Subsequence for Sequences of Independent Blocks

    Authors: Heinrich Matzinger, Felipe Torres

    Abstract: The problem of the fluctuation of the Longest Common Subsequence (LCS) of two i.i.d. sequences of length $n>0$ has been open for decades. There exist contradicting conjectures on the topic. Chvatal and Sankoff conjectured in 1975 that asymptotically the order should be $n^{2/3}$, while Waterman conjectured in 1994 that asymptotically the order should be $n$. A contiguous substring consisting only… ▽ More

    Submitted 12 November, 2010; v1 submitted 8 January, 2010; originally announced January 2010.

    Comments: PDFLatex, 40 pages

    MSC Class: 60C05; 60F10

  27. arXiv:0911.2031  [pdf, other

    math.PR math.CO

    Closeness to the Diagonal for Longest Common Subsequences in Random Words

    Authors: C. Houdré, H. Matzinger

    Abstract: The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated. It is shown that such an optimal alignment typically matches pieces of similar short-length. This is of importance in understanding the structure of optimal alignments of two sequences. Moreover, it is also shown that any pr… ▽ More

    Submitted 20 April, 2016; v1 submitted 10 November, 2009; originally announced November 2009.

    Comments: Final version to appear in ECP, 2016

    MSC Class: 05A05; 60C05; 60F10

  28. Standard deviation of the longest common subsequence

    Authors: Jüri Lember, Heinrich Matzinger

    Abstract: Let $L_n$ be the length of the longest common subsequence of two independent i.i.d. sequences of Bernoulli variables of length $n$. We prove that the order of the standard deviation of $L_n$ is $\sqrt{n}$, provided the parameter of the Bernoulli variables is small enough. This validates Waterman's conjecture in this situation [Philos. Trans. R. Soc. Lond. Ser. B 344 (1994) 383--390]. The order c… ▽ More

    Submitted 29 July, 2009; originally announced July 2009.

    Comments: Published in at http://dx.doi.org/10.1214/08-AOP436 the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOP-AOP436 MSC Class: 60K35; 41A25 (Primary); 60C05C (Secondary)

    Journal ref: Annals of Probability 2009, Vol. 37, No. 3, 1192-1235

  29. On the Variance of the Optimal Alignments Score for Binary Random Words and an Asymmetric Scoring Function

    Authors: Christian Houdré, Heinrich Matzinger

    Abstract: We investigate the order of the variance of the optimal alignments score of two independent iid binary random words having the same length. The letters are equiprobable, but the scoring function is such that one letter has a larger score than the other. In this setting, we prove that the order of variance is linear in the common length. Optimal alignments constitute a generalization of longest com… ▽ More

    Submitted 15 June, 2016; v1 submitted 1 February, 2007; originally announced February 2007.

    Comments: Appeared in Journal of Statistical Physics

    MSC Class: 60K35; 60C05; 05A05

    Journal ref: Journal of Statistical Physics, 2016

  30. Reconstructing a two-color scenery by observing it along a simple random walk path

    Authors: Heinrich Matzinger

    Abstract: Let {ξ(n)}_{n\in Z} be a two-color random scenery, that is, a random coloring of Z in two colors, such that the ξ(i)'s are i.i.d. Bernoulli variables with parameter \tfrac12. Let {S(n)}_{n\in N} be a symmetric random walk starting at 0. Our main result shows that a.s., ξ\circ S (the composition of ξand S) determines ξup to translation and reflection. In other words, by observing the scenery ξalo… ▽ More

    Submitted 24 March, 2005; originally announced March 2005.

    Comments: Published at http://dx.doi.org/10.1214/105051604000000972 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AAP-AAP066 MSC Class: 60L37 (Primary) 60G10. (Secondary)

    Journal ref: Annals of Applied Probability 2005, Vol. 15, No. 1B, 778-819

  31. arXiv:math/0410404  [pdf, ps, other

    math.CO math.PR

    Fluctuations of the Longest Common Subsequence in the Asymmetric Case of 2- and 3-Letter Alphabets

    Authors: F. Bonetto, H. Matzinger

    Abstract: We investigate the asymptotic standard deviation of the Longest Common Subsequence (LCS) of two independent i.i.d. sequences of length n. The first sequence is drawn from a three letter alphabet {0,1,a}, whilst the second sequence is binary. The main result of this article is that in this asymmetric case, the standard deviation of the length of the LCS is of order square root of n. This confirms… ▽ More

    Submitted 18 October, 2004; originally announced October 2004.

    MSC Class: 60C05 (primary) 92D20 (secondary)