-
Computing Affine Combinations, Distances, and Correlations for Recursive Partition Functions
Authors:
Sean Skwerer,
He** Zhang
Abstract:
Recursive partitioning is the core of several statistical methods including CART, random forest, and boosted trees. Despite the popularity of tree based methods, to date, there did not exist methods for combining multiple trees into a single tree, or methods for systematically quantifying the discrepancy between two trees. Taking advantage of the recursive structure in trees we formulated fast alg…
▽ More
Recursive partitioning is the core of several statistical methods including CART, random forest, and boosted trees. Despite the popularity of tree based methods, to date, there did not exist methods for combining multiple trees into a single tree, or methods for systematically quantifying the discrepancy between two trees. Taking advantage of the recursive structure in trees we formulated fast algorithms for computing affine combinations, distances and correlations in a vector subspace of recursive partition functions.
△ Less
Submitted 17 March, 2016; v1 submitted 11 December, 2015;
originally announced December 2015.
-
Dynamic Geodesics in Treespace via Parametric Maximum Flow
Authors:
Sean Skwerer,
Scott Provan
Abstract:
Shortest paths in treespace, which represent minimal deformations between trees, are unique and can be computed in polynomial time. The ability to quickly compute shortest paths has enabled new approaches for statistical analysis of populations of trees and phylogenetic inference. This paper gives a new algorithm for updating geodesic paths when the end points are dynamic. Such algorithms will be…
▽ More
Shortest paths in treespace, which represent minimal deformations between trees, are unique and can be computed in polynomial time. The ability to quickly compute shortest paths has enabled new approaches for statistical analysis of populations of trees and phylogenetic inference. This paper gives a new algorithm for updating geodesic paths when the end points are dynamic. Such algorithms will be especially useful when optimizing for objectives that are functions of distances from a search point to other points e.g. for finding a tree which has the minimum average distance to a collection of trees. Our method for updating treespace shortest paths is based on parametric sensitivity analysis of the maximum flow subproblems that are optimized when solving for a treespace geodesic.
△ Less
Submitted 9 December, 2015;
originally announced December 2015.
-
Persistent homology analysis of brain artery trees
Authors:
Paul Bendich,
J. S. Marron,
Ezra Miller,
Alex Pieloch,
Sean Skwerer
Abstract:
New representations of tree-structured data objects, using ideas from topological data analysis, enable improved statistical analyses of a population of brain artery trees. A number of representations of each data tree arise from persistence diagrams that quantify branching and loo** of vessels at multiple scales. Novel approaches to the statistical analysis, through various summaries of the per…
▽ More
New representations of tree-structured data objects, using ideas from topological data analysis, enable improved statistical analyses of a population of brain artery trees. A number of representations of each data tree arise from persistence diagrams that quantify branching and loo** of vessels at multiple scales. Novel approaches to the statistical analysis, through various summaries of the persistence diagrams, lead to heightened correlations with covariates such as age and sex, relative to earlier analyses of this data set. The correlation with age continues to be significant even after controlling for correlations from earlier significant summaries
△ Less
Submitted 24 November, 2014;
originally announced November 2014.
-
Relative Optimality Conditions and Algorithms for Treespace Fréchet Means
Authors:
Sean Skwerer,
Scott Provan,
J. S. Marron
Abstract:
Recent interest in treespaces as well-founded mathematical domains for phylogenetic inference and statistical analysis for populations of anatomical trees has motivated research into efficient and rigorous methods for optimization problems on treespaces. A central problem in this area is computing an average of phylogenetic trees, which is equivalently characterized as the minimizer of the Fréchet…
▽ More
Recent interest in treespaces as well-founded mathematical domains for phylogenetic inference and statistical analysis for populations of anatomical trees has motivated research into efficient and rigorous methods for optimization problems on treespaces. A central problem in this area is computing an average of phylogenetic trees, which is equivalently characterized as the minimizer of the Fréchet function. The Fréchet mean can be used for statistical inference and exploratory data analysis: for example it can be leveraged as a test statistic to compare groups via permutation tests, or to find trends in data over time via kernel smoothing. By analyzing the differential properties of the Fréchet function along geodesics in treespace we obtained a theorem describing a decomposition of the derivative along a geodesic. This decomposition theorem is used to formulate optimality conditions which are used as a logical basis for an algorithm to verify relative optimality at points where the Fréchet function gradient does not exist.
△ Less
Submitted 16 August, 2017; v1 submitted 16 October, 2014;
originally announced November 2014.
-
Tree Oriented Data Analysis
Authors:
Sean Skwerer
Abstract:
Complex data objects arise in many areas of modern science including evolutionary biology, nueroscience, dynamics of gene expression and medical imaging. Object oriented data analysis (OODA) is the statistical analysis of datasets of complex objects. Data analysis of tree data objects is an exciting research area with interesting questions and challenging problems. This thesis focuses on tree orie…
▽ More
Complex data objects arise in many areas of modern science including evolutionary biology, nueroscience, dynamics of gene expression and medical imaging. Object oriented data analysis (OODA) is the statistical analysis of datasets of complex objects. Data analysis of tree data objects is an exciting research area with interesting questions and challenging problems. This thesis focuses on tree oriented statistical methodologies, and algorithms for solving related mathematical optimization problems.
This research is motivated by the goal of analyzing a data set of images of human brain arteries. The approach we take here is to use a novel representation of brain artery systems as points in phylogenetic treespace. The treespace property of unique global geodesics leads to a notion of geometric center called a Fréchet mean. For a sample of data points, the Fréchet function is the sum of squared distances from a point to the data points, and the Fréchet mean is the minimizer of the Fréchet function.
In this thesis we use properties of the Fréchet function to develop an algorithmic system for computing Fréchet means. Properties of the Fréchet function are also used to show a sticky law of large numbers which describes a surprising stability of the topological tree structure of sample Fréchet means at that of the population Fréchet mean. We also introduce non-parametric regression of brain artery tree structure as a response variable to age based on weighted Fréchet means.
△ Less
Submitted 22 September, 2014; v1 submitted 18 September, 2014;
originally announced September 2014.
-
Sticky central limit theorems on open books
Authors:
Thomas Hotz,
Sean Skwerer,
Stephan Huckemann,
Huiling Le,
J. S. Marron,
Jonathan C. Mattingly,
Ezra Miller,
James Nolen,
Megan Owen,
Vic Patrangenaru
Abstract:
Given a probability distribution on an open book (a metric space obtained by gluing a disjoint union of copies of a half-space along their boundary hyperplanes), we define a precise concept of when the Fréchet mean (barycenter) is sticky. This nonclassical phenomenon is quantified by a law of large numbers (LLN) stating that the empirical mean eventually almost surely lies on the (codimension $1$…
▽ More
Given a probability distribution on an open book (a metric space obtained by gluing a disjoint union of copies of a half-space along their boundary hyperplanes), we define a precise concept of when the Fréchet mean (barycenter) is sticky. This nonclassical phenomenon is quantified by a law of large numbers (LLN) stating that the empirical mean eventually almost surely lies on the (codimension $1$ and hence measure $0$) spine that is the glued hyperplane, and a central limit theorem (CLT) stating that the limiting distribution is Gaussian and supported on the spine. We also state versions of the LLN and CLT for the cases where the mean is nonsticky (i.e., not lying on the spine) and partly sticky (i.e., is, on the spine but not sticky).
△ Less
Submitted 3 December, 2013; v1 submitted 20 February, 2012;
originally announced February 2012.