Search | arXiv e-print repository

Dendrogram of mixing measures: Hierarchical clustering and model selection for finite mixture models

Authors: Dat Do, Linh Do, Scott A. McKinley, Jonathan Terhorst, XuanLong Nguyen

Abstract: We present a new way to summarize and select mixture models via the hierarchical clustering tree (dendrogram) constructed from an overfitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering and mixture modeling. The dendrogram's construction is derived from the theory of convergence of the mixing measures, and as a result, we can both consistently select the… ▽ More We present a new way to summarize and select mixture models via the hierarchical clustering tree (dendrogram) constructed from an overfitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering and mixture modeling. The dendrogram's construction is derived from the theory of convergence of the mixing measures, and as a result, we can both consistently select the true number of mixing components and obtain the pointwise optimal convergence rate for parameter estimation from the tree, even when the model parameters are only weakly identifiable. In theory, it explicates the choice of the optimal number of clusters in hierarchical clustering. In practice, the dendrogram reveals more information on the hierarchy of subpopulations compared to traditional ways of summarizing mixture models. Several simulation studies are carried out to support our theory. We also illustrate the methodology with an application to single-cell RNA sequence analysis. △ Less

Submitted 8 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 53 pages, 11 figures

arXiv:2111.10841 [pdf, other]

A linear adjustment based approach to posterior drift in transfer learning

Authors: Subha Maity, Diptavo Dutta, Jonathan Terhorst, Yuekai Sun, Moulinath Banerjee

Abstract: We present a new model and methods for the posterior drift problem where the regression function in the target domain is modeled as a linear adjustment (on an appropriate scale) of that in the source domain, an idea that inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature, and study the theoretical pro… ▽ More We present a new model and methods for the posterior drift problem where the regression function in the target domain is modeled as a linear adjustment (on an appropriate scale) of that in the source domain, an idea that inherits the simplicity and the usefulness of generalized linear models and accelerated failure time models from the classical statistics literature, and study the theoretical properties of our proposed estimator in the binary classification problem. Our approach is shown to be flexible and applicable in a variety of statistical settings, and can be adopted to transfer learning problems in various domains including epidemiology, genetics and biomedicine. As a concrete application, we illustrate the power of our approach through mortality prediction for British Asians by borrowing strength from similar data from the larger pool of British Caucasians, using the UK Biobank data. △ Less

Submitted 12 December, 2021; v1 submitted 21 November, 2021; originally announced November 2021.

arXiv:2008.06664 [pdf, other]

Exact and arbitrarily accurate non-parametric two-sample tests based on rank spacings

Authors: Dan D. Erdmann-Pham, Jonathan Terhorst, Yun S. Song

Abstract: A common method for deriving non-parametric tests is to reformulate a parametric test in terms of sample ranks. Despite being distribution free (even in finite samples), the resulting tests often display remarkable asymptotic power properties, typically matching the efficiency of their parametric counterpart. Empirically, these favorable power properties have been shown to persist in non-asymptoti… ▽ More A common method for deriving non-parametric tests is to reformulate a parametric test in terms of sample ranks. Despite being distribution free (even in finite samples), the resulting tests often display remarkable asymptotic power properties, typically matching the efficiency of their parametric counterpart. Empirically, these favorable power properties have been shown to persist in non-asymptotic regimes as well, prompting the need for finite-sample characterizations of the corresponding rank-based statistics. Here, we provide such characterization for the family of weighted $p$-norms of rank spacings, which includes the classical tests of Mann-Whitney, Dixon, and various generalizations thereof. For $p=1$, we provide exact expressions for the involved distributions, while for $p>1$ we describe the associated moment sequences and derive an algorithm to recover the distributions of interest from these sequences in a fast and stable manner. We use this framework to develop a new family of non-parametric tests mirroring properties of generalized likelihood-ratios, prove new tail bounds for Dixon's and Greenwood's statistics, and prove a previously formulated conjecture regarding the global efficiency of rank-based tests against the $F$-test in the context of scale-families. △ Less

Submitted 8 August, 2022; v1 submitted 15 August, 2020; originally announced August 2020.

Comments: 33 pages, 6 figures

arXiv:2003.01640 [pdf, other]

Explaining Groups of Points in Low-Dimensional Representations

Authors: Gregory Plumb, Jonathan Terhorst, Sriram Sankararaman, Ameet Talwalkar

Abstract: A common workflow in data exploration is to learn a low-dimensional representation of the data, identify groups of points in that representation, and examine the differences between the groups to determine what they represent. We treat this workflow as an interpretable machine learning problem by leveraging the model that learned the low-dimensional representation to help identify the key differen… ▽ More A common workflow in data exploration is to learn a low-dimensional representation of the data, identify groups of points in that representation, and examine the differences between the groups to determine what they represent. We treat this workflow as an interpretable machine learning problem by leveraging the model that learned the low-dimensional representation to help identify the key differences between the groups. To solve this problem, we introduce a new type of explanation, a Global Counterfactual Explanation (GCE), and our algorithm, Transitive Global Translations (TGT), for computing GCEs. TGT identifies the differences between each pair of groups using compressed sensing but constrains those pairwise differences to be consistent among all of the groups. Empirically, we demonstrate that TGT is able to identify explanations that accurately explain the model while being relatively sparse, and that these explanations match real patterns in the data. △ Less

Submitted 14 August, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

arXiv:1807.02763 [pdf, other]

Inference of Population History using Coalescent HMMs: Review and Outlook

Authors: Jeffrey P. Spence, Matthias Steinrücken, Jonathan Terhorst, Yun S. Song

Abstract: Studying how diverse human populations are related is of historical and anthropological interest, in addition to providing a realistic null model for testing for signatures of natural selection or disease associations. Furthermore, understanding the demographic histories of other species is playing an increasingly important role in conservation genetics. A number of statistical methods have been d… ▽ More Studying how diverse human populations are related is of historical and anthropological interest, in addition to providing a realistic null model for testing for signatures of natural selection or disease associations. Furthermore, understanding the demographic histories of other species is playing an increasingly important role in conservation genetics. A number of statistical methods have been developed to infer population demographic histories using whole-genome sequence data, with recent advances focusing on allowing for more flexible modeling choices, scaling to larger data sets, and increasing statistical power. Here we review coalescent hidden Markov models, a powerful class of population genetic inference methods that can effectively utilize linkage disequilibrium information. We highlight recent advances, give advice for practitioners, point out potential pitfalls, and present possible future research directions. △ Less

Submitted 8 July, 2018; originally announced July 2018.

Comments: 12 pages, 2 figures

arXiv:1505.04228 [pdf, other]

doi 10.1073/pnas.1503717112

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Authors: Jonathan Terhorst, Yun S. Song

Abstract: The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic which is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, currently little is known about the information-theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimat… ▽ More The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic which is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, currently little is known about the information-theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least $O(1/\log s)$, where $s$ is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number $s$ of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature. △ Less

Submitted 15 May, 2015; originally announced May 2015.

Comments: 17 pages, 1 figure

Journal ref: Proc. Natl. Acad. Sci. U.S.A., Vol. 112, No. 25 (2015) 7677-7682

arXiv:1503.01133 [pdf, other]

doi 10.1080/10618600.2016.1159212

Efficient computation of the joint sample frequency spectra for multiple populations

Authors: John A. Kamm, Jonathan Terhorst, Yun S. Song

Abstract: A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences. In particular, recently there has been growing interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including v… ▽ More A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences. In particular, recently there has been growing interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. Although much methodological progress has been made, existing SFS-based inference methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable efficient computation of the expected joint SFS for multiple populations related by a complex demographic model with arbitrary population size histories (including piecewise exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study involving tens of populations, we demonstrate our improvements to numerical stability and computational complexity. △ Less

Submitted 3 March, 2015; originally announced March 2015.

Comments: 24 pages, 5 figures

arXiv:1409.1458 [pdf, ps, other]

Communication-Efficient Distributed Dual Coordinate Ascent

Authors: Martin Jaggi, Virginia Smith, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, Michael I. Jordan

Abstract: Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algor… ▽ More Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly. △ Less

Submitted 29 September, 2014; v1 submitted 4 September, 2014; originally announced September 2014.

Comments: NIPS 2014 version, including proofs. Published in Advances in Neural Information Processing Systems 27 (NIPS 2014)

MSC Class: 90C25; 68W15 ACM Class: G.1.6; C.1.4

arXiv:1310.8420 [pdf, other]

SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling

Authors: Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma'ayan Bresler, Yun S. Song, Michael I. Jordan, David Patterson

Abstract: Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and inco… ▽ More Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu. △ Less

Submitted 5 January, 2014; v1 submitted 31 October, 2013; originally announced October 2013.

arXiv:1102.3177 [pdf, other]

The Kalmanson Complex

Authors: Jonathan Terhorst

Abstract: Let X be a finite set of cardinality n. The Kalmanson complex K_n is the simplicial complex whose vertices are non-trivial X-splits, and whose facets are maximal circular split systems over X. In this paper we examine K_n from three perspectives. In addition to the T-theoretic description, we show that K_n has a geometric realization as the Kalmanson conditions on a finite metric. A third descript… ▽ More Let X be a finite set of cardinality n. The Kalmanson complex K_n is the simplicial complex whose vertices are non-trivial X-splits, and whose facets are maximal circular split systems over X. In this paper we examine K_n from three perspectives. In addition to the T-theoretic description, we show that K_n has a geometric realization as the Kalmanson conditions on a finite metric. A third description arises in terms of binary matrices which possess the circular ones property. We prove the equivalence of these three definitions. This leads to a simplified proof of the well-known equivalence between Kalmanson and circular decomposable metrics, as well as a partial description of the f-vector of K_n. △ Less

Submitted 6 March, 2011; v1 submitted 15 February, 2011; originally announced February 2011.

Comments: Improved exposition. 24 pages, 2 figures, 1 table

MSC Class: 05E45

Showing 1–10 of 10 results for author: Terhorst, J