Skip to main content

Showing 1–50 of 67 results for author: Vogelstein, J

Searching in archive stat. Search in all archives.
.
  1. arXiv:2307.13868  [pdf, other

    stat.ME cs.LG stat.ML

    Learning sources of variability from high-dimensional observational studies

    Authors: Eric W. Bridgeford, Jaewon Chung, Brian Gilbert, Sambit Panda, Adam Li, Cencheng Shen, Alexandra Badea, Brian Caffo, Joshua T. Vogelstein

    Abstract: Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estiman… ▽ More

    Submitted 28 November, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

  2. arXiv:2303.04871  [pdf, other

    stat.AP

    Discovering a change point and piecewise linear structure in a time series of organoid networks via the iso-mirror

    Authors: Tianyi Chen, Youngser Park, Ali Saad-Eldin, Zachary Lubberts, Avanti Athreya, Benjamin D. Pedigo, Joshua T. Vogelstein, Francesca Puppo, Gabriel A. Silva, Alysson R. Muotri, Weiwei Yang, Christopher M. White, Carey E. Priebe

    Abstract: Recent advancements have been made in the development of cell-based in-vitro neuronal networks, or organoids. In order to better understand the network structure of these organoids, a super-selective algorithm has been proposed for inferring the effective connectivity networks from multi-electrode array data. In this paper, we apply a novel statistical method called spectral mirror estimation to t… ▽ More

    Submitted 12 April, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

  3. arXiv:2302.14186  [pdf, other

    eess.SP cs.LG stat.AP stat.ME stat.ML

    Approximately optimal domain adaptation with Fisher's Linear Discriminant

    Authors: Hayden S. Helm, Ashwin De Silva, Joshua T. Vogelstein, Carey E. Priebe, Weiwei Yang

    Abstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propos… ▽ More

    Submitted 1 March, 2024; v1 submitted 27 February, 2023; originally announced February 2023.

  4. arXiv:2208.10967  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    The Value of Out-of-Distribution Data

    Authors: Ashwin De Silva, Rahul Ramesh, Carey E. Priebe, Pratik Chaudhari, Joshua T. Vogelstein

    Abstract: We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task imp… ▽ More

    Submitted 13 July, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

    Comments: Previous versions of this work have been presented at the Out-of-Distribution Generalization in Computer Vision (OOD-CV) Workshop (ECCV 2022) and the Workshop on Distribution Shifts (NeurIPS 2022)

    Journal ref: Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7366-7389, 2023

  5. arXiv:2201.13001  [pdf, other

    cs.LG cs.AI cs.DS q-bio.NC stat.ML

    Deep Discriminative to Kernel Density Graph for In- and Out-of-distribution Calibrated Inference

    Authors: Jayanta Dey, Haoyin Xu, Will LeVine, Ashwin De Silva, Tyler M. Tomita, Ali Geisa, Tiffany Chu, Jacob Desman, Joshua T. Vogelstein

    Abstract: Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distr… ▽ More

    Submitted 7 June, 2024; v1 submitted 31 January, 2022; originally announced January 2022.

  6. arXiv:2111.05366  [pdf, other

    stat.ML cs.LG math.CO

    Graph Matching via Optimal Transport

    Authors: Ali Saad-Eldin, Benjamin D. Pedigo, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

  7. arXiv:2109.14501  [pdf, other

    stat.ML cs.AI cs.LG

    Towards a theory of out-of-distribution learning

    Authors: Jayanta Dey, Ali Geisa, Ronak Mehta, Tyler M. Tomita, Hayden S. Helm, Haoyin Xu, Eric Eaton, Jeffery Dick, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Learning is a process wherein a learning agent enhances its performance through exposure of experience or data. Throughout this journey, the agent may encounter diverse learning environments. For example, data may be presented to the leaner all at once, in multiple batches, or sequentially. Furthermore, the distribution of each data sample could be either identical and independent (iid) or non-iid… ▽ More

    Submitted 7 June, 2024; v1 submitted 29 September, 2021; originally announced September 2021.

  8. arXiv:2108.13637  [pdf, other

    cs.LG cs.AI q-bio.NC stat.ML

    When are Deep Networks really better than Decision Forests at small sample sizes, and how?

    Authors: Haoyin Xu, Kaleab A. Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M. White, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies… ▽ More

    Submitted 2 November, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

  9. arXiv:2107.11732  [pdf, other

    cs.LG econ.EM q-bio.QM stat.ME

    Federated Causal Inference in Heterogeneous Observational Data

    Authors: Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T. Vogelstein, Susan Athey

    Abstract: We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the… ▽ More

    Submitted 2 April, 2023; v1 submitted 25 July, 2021; originally announced July 2021.

  10. arXiv:2104.00641  [pdf

    stat.ML cs.LG

    Dynamic Silos: Increased Modularity in Intra-organizational Communication Networks during the Covid-19 Pandemic

    Authors: Tiona Zuzul, Emily Cox Pahnke, Jonathan Larson, Patrick Bourke, Nicholas Caurvina, Neha Parikh Shah, Fereshteh Amini, Jeffrey Weston, Youngser Park, Joshua Vogelstein, Christopher White, Carey E. Priebe

    Abstract: Workplace communications around the world were drastically altered by Covid-19, related work-from-home orders, and the rise of remote work. To understand these shifts, we analyzed aggregated, anonymized metadata from over 360 billion emails within 4,361 organizations worldwide. By comparing month-to-month and year-over-year metrics, we examined changes in network community structures over 24 month… ▽ More

    Submitted 28 July, 2023; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: 48 pages, 15 figures

  11. arXiv:2011.14990  [pdf, other

    q-bio.NC stat.ME

    Discovery of Multi-Level Network Differences Across Populations of Heterogeneous Connectomes

    Authors: Vivek Gopalakrishnan, Jaewon Chung, Eric Bridgeford, Benjamin D. Pedigo, Jesús Arroyo, Lucy Upchurch, G. Allan Johnson, Nian Wang, Youngser Park, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: A connectome is a map of the structural and/or functional connections in the brain. This information-rich representation has the potential to transform our understanding of the relationship between patterns in brain connectivity and neurological processes, disorders, and diseases. However, existing computational techniques used to analyze connectomes are oftentimes insufficient for interrogating m… ▽ More

    Submitted 13 April, 2022; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: 29 pages, 12 figures

  12. arXiv:2011.06557  [pdf, other

    stat.ML cs.LG stat.ME

    A partition-based similarity for classification distributions

    Authors: Hayden S. Helm, Ronak D. Mehta, Brandon Duderstadt, Weiwei Yang, Christoper M. White, Ali Geisa, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Herein we define a measure of similarity between classification distributions that is both principled from the perspective of statistical pattern recognition and useful from the perspective of machine learning practitioners. In particular, we propose a novel similarity on classification distributions, dubbed task similarity, that quantifies how an optimally-transformed optimal representation for a… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

  13. arXiv:2008.10055  [pdf, other

    stat.ME

    Multiple Network Embedding for Anomaly Detection in Time Series of Graphs

    Authors: Guodong Chen, Jesús Arroyo, Avanti Athreya, Joshua Cape, Joshua T. Vogelstein, Youngser Park, Chris White, Jonathan Larson, Weiwei Yang, Carey E. Priebe

    Abstract: This paper considers the graph signal processing problem of anomaly detection in time series of graphs. We examine two related, complementary inference tasks: the detection of anomalous graphs within a time series, and the detection of temporally anomalous vertices. We approach these tasks via the adaptation of statistically principled methods for joint graph inference, specifically \emph{multiple… ▽ More

    Submitted 10 March, 2024; v1 submitted 23 August, 2020; originally announced August 2020.

    Comments: 51 pages, 17 figures

  14. arXiv:2007.13843  [pdf, other

    stat.ML cs.IR cs.LG cs.SI

    Robust Similarity and Distance Learning via Decision Forests

    Authors: Tyler M. Tomita, Joshua T. Vogelstein

    Abstract: Canonical distances such as Euclidean distance often fail to capture the appropriate relationships between items, subsequently leading to subpar inference and prediction. Many algorithms have been proposed for automated learning of suitable distances, most of which employ linear methods to learn a global metric over the feature space. While such methods offer nice theoretical properties, interpret… ▽ More

    Submitted 21 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Submitted to NeurIPS 2020

  15. arXiv:2007.03611  [pdf

    physics.soc-ph stat.OT

    P-Values in a Post-Truth World

    Authors: Joshua T. Vogelstein

    Abstract: The role of statisticians in society is to provide tools, techniques, and guidance with regards to how much to trust data. This role is increasingly more important with more data and more misinformation than ever before. The American Statistical Association recently released two statements on p-values, and provided four guiding principles. We evaluate their claims using these principles and find t… ▽ More

    Submitted 5 July, 2020; originally announced July 2020.

    Comments: 10 pages

  16. arXiv:2005.11911  [pdf, other

    stat.AP math.ST

    Statistical Analysis of Data Repeatability Measures

    Authors: Zeyi Wang, Eric Bridgeford, Shangsi Wang, Joshua T. Vogelstein, Brian Caffo

    Abstract: The advent of modern data collection and processing techniques has seen the size, scale, and complexity of data grow exponentially. A seminal step in leveraging these rich datasets for downstream inference is understanding the characteristics of the data which are repeatable -- the aspects of the data that are able to be identified under a duplicated analysis. Conflictingly, the utility of traditi… ▽ More

    Submitted 20 August, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

  17. arXiv:2005.11890  [pdf, other

    stat.ML cs.LG stat.CO

    mvlearn: Multiview Machine Learning in Python

    Authors: Ronan Perry, Gavin Mischler, Richard Guo, Theodore Lee, Alexander Chang, Arman Koul, Cameron Franz, Hugo Richard, Iain Carmichael, Pierre Ablin, Alexandre Gramfort, Joshua T. Vogelstein

    Abstract: As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that… ▽ More

    Submitted 25 May, 2021; v1 submitted 24 May, 2020; originally announced May 2020.

    Comments: 6 pages, 2 figures, 1 table

  18. arXiv:2005.10700  [pdf, other

    cs.LG cs.IR stat.ML

    Distance-based Positive and Unlabeled Learning for Ranking

    Authors: Hayden S. Helm, Amitabh Basu, Avanti Athreya, Youngser Park, Joshua T. Vogelstein, Carey E. Priebe, Michael Winding, Marta Zlatic, Albert Cardona, Patrick Bourke, Jonathan Larson, Marah Abdin, Piali Choudhury, Weiwei Yang, Christopher W. White

    Abstract: Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set)… ▽ More

    Submitted 28 September, 2022; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 21 pages, 5 figures

  19. arXiv:2004.12908  [pdf, other

    cs.AI cs.LG stat.ML

    A Simple Lifelong Learning Approach

    Authors: Joshua T. Vogelstein, Jayanta Dey, Hayden S. Helm, Will LeVine, Ronak D. Mehta, Tyler M. Tomita, Haoyin Xu, Ali Geisa, Qingyang Wang, Gido M. van de Ven, Chenyu Gao, Weiwei Yang, Bryan Tower, Jonathan Larson, Christopher M. White, Carey E. Priebe

    Abstract: In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain perf… ▽ More

    Submitted 11 June, 2024; v1 submitted 27 April, 2020; originally announced April 2020.

  20. arXiv:1912.12150  [pdf, other

    stat.ML cs.LG math.ST stat.ME

    The Chi-Square Test of Distance Correlation

    Authors: Cencheng Shen, Sambit Panda, Joshua T. Vogelstein

    Abstract: Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depe… ▽ More

    Submitted 14 May, 2021; v1 submitted 27 December, 2019; originally announced December 2019.

    Comments: 21 pages, 4 figures, 1 table

    Journal ref: Journal of Computational and Graphical Statistics 31(1), 254-262, 2022

  21. Valid Two-Sample Graph Testing via Optimal Transport Procrustes and Multiscale Graph Correlation with Applications in Connectomics

    Authors: Jaewon Chung, Bijan Varjavand, Jesus Arroyo, Anton Alyakin, Joshua Agterberg, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the… ▽ More

    Submitted 13 September, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

    Comments: 12 pages, 3 figures

  22. arXiv:1910.08883  [pdf, other

    stat.ML cs.LG

    High-dimensional and universally consistent k-sample tests

    Authors: Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several univ… ▽ More

    Submitted 11 October, 2023; v1 submitted 19 October, 2019; originally announced October 2019.

  23. arXiv:1909.11799  [pdf, other

    cs.LG stat.ML

    Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks

    Authors: Adam Li, Ronan Perry, Chester Huynh, Tyler M. Tomita, Ronak Mehta, Jesus Arroyo, Jesse Patsolic, Benjamin Falk, Joshua T. Vogelstein

    Abstract: Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structur… ▽ More

    Submitted 5 September, 2022; v1 submitted 25 September, 2019; originally announced September 2019.

    Comments: Updated manuscript based on review at SIMODS

    MSC Class: 68T05

  24. arXiv:1909.02688  [pdf, other

    cs.LG stat.ML

    AutoGMM: Automatic and Hierarchical Gaussian Mixture Modeling in Python

    Authors: Thomas L. Athey, Tingshan Liu, Benjamin D. Pedigo, Joshua T. Vogelstein

    Abstract: Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these p… ▽ More

    Submitted 12 August, 2021; v1 submitted 5 September, 2019; originally announced September 2019.

  25. arXiv:1908.06486  [pdf, other

    stat.ML cs.LG stat.ME

    Independence Testing for Temporal Data

    Authors: Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, Joshua T. Vogelstein

    Abstract: Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been prop… ▽ More

    Submitted 27 May, 2024; v1 submitted 18 August, 2019; originally announced August 2019.

    Comments: 19 pages main + 6 pages appendix

    Journal ref: Transactions on Machine Learning Research, 2024

  26. arXiv:1907.02844  [pdf, other

    stat.ML cs.IR cs.LG stat.ME

    Geodesic Learning via Unsupervised Decision Forests

    Authors: Meghana Madhyastha, Percy Li, James Browne, Veronika Strnadova-Neeley, Carey E. Priebe, Randal Burns, Joshua T. Vogelstein

    Abstract: Geodesic distance is the shortest path between two points in a Riemannian manifold. Manifold learning algorithms, such as Isomap, seek to learn a manifold that preserves geodesic distances. However, such methods operate on the ambient dimensionality, and are therefore fragile to noise dimensions. We developed an unsupervised random forest method (URerF) to approximately learn geodesic distances in… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

  27. arXiv:1907.02088  [pdf, other

    stat.CO cs.MS stat.ME stat.ML

    hyppo: A Multivariate Hypothesis Testing Python Package

    Authors: Sambit Panda, Satish Palaniappan, Junhao Xiong, Eric W. Bridgeford, Ronak Mehta, Cencheng Shen, Joshua T. Vogelstein

    Abstract: We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible eno… ▽ More

    Submitted 1 April, 2021; v1 submitted 3 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure

  28. arXiv:1907.00325  [pdf, other

    cs.LG stat.ML

    Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities

    Authors: Ronan Perry, Ronak Mehta, Richard Guo, Eva Yezerets, Jesús Arroyo, Mike Powell, Hayden Helm, Cencheng Shen, Joshua T. Vogelstein

    Abstract: Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when… ▽ More

    Submitted 5 October, 2021; v1 submitted 30 June, 2019; originally announced July 2019.

  29. arXiv:1906.10026  [pdf, other

    stat.ME cs.SI math.ST

    Inference for multiple heterogeneous networks with a common invariant subspace

    Authors: Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to… ▽ More

    Submitted 22 August, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

  30. arXiv:1906.03661  [pdf, other

    stat.ME stat.AP

    Community Correlations and Testing Independence Between Binary Graphs

    Authors: Cencheng Shen, Jesüs Arroyo, Junhao Xiong, Joshua T. Vogelstein

    Abstract: Graph data has a unique structure that deviates from standard data assumptions, often necessitating modifications to existing methods or the development of new ones to ensure valid statistical analysis. In this paper, we explore the notion of correlation and dependence between two binary graphs. Given vertex communities, we propose community correlations to measure the edge association, which equa… ▽ More

    Submitted 8 July, 2024; v1 submitted 9 June, 2019; originally announced June 2019.

  31. arXiv:1906.02881  [pdf, other

    stat.ML cs.LG cs.SI stat.ME

    Vertex Classification on Weighted Networks

    Authors: Hayden Helm, Joshua Vogelstein, Carey Priebe

    Abstract: This paper proposes a discrimination technique for vertices in a weighted network. We assume that the edge weights and adjacencies in the network are conditionally independent and that both sources of information encode class membership information. In particular, we introduce a edge weight distribution matrix to the standard K-Block Stochastic Block Model to model weighted networks. This allows u… ▽ More

    Submitted 6 June, 2019; originally announced June 2019.

    Comments: 11 pages

  32. arXiv:1904.05329  [pdf, other

    cs.SI stat.ML stat.OT

    GraSPy: Graph Statistics in Python

    Authors: Jaewon Chung, Benjamin D. Pedigo, Eric W. Bridgeford, Bijan K. Varjavand, Hayden S. Helm, Joshua T. Vogelstein

    Abstract: We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The… ▽ More

    Submitted 14 August, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

    Journal ref: Journal of Machine Learning Research 20.158 (2019): 1-7

  33. arXiv:1812.00029  [pdf, other

    stat.ML cs.LG

    Learning Interpretable Characteristic Kernels via Decision Forests

    Authors: Sambit Panda, Cencheng Shen, Joshua T. Vogelstein

    Abstract: Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We c… ▽ More

    Submitted 28 September, 2023; v1 submitted 30 November, 2018; originally announced December 2018.

  34. On a 'Two Truths' Phenomenon in Spectral Graph Clustering

    Authors: Carey E. Priebe, Youngser Park, Joshua T. Vogelstein, John M. Conroy, Vince Lyzinski, Minh Tang, Avanti Athreya, Joshua Cape, Eric Bridgeford

    Abstract: Clustering is concerned with coherently grou** observations without any explicit concept of true grou**s. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical resu… ▽ More

    Submitted 11 February, 2019; v1 submitted 23 August, 2018; originally announced August 2018.

    Journal ref: PNAS 116 (2019) 5995-6000

  35. The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing

    Authors: Cencheng Shen, Joshua T. Vogelstein

    Abstract: Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel-based tests, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. A fixed-point transformation was previously proposed to connect the distance methods and kernel meth… ▽ More

    Submitted 14 September, 2020; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: 24 pages main + 7 pages appendix, 3 figures

    Journal ref: AStA Advances in Statistical Analysis 105(3), 385-403, 2021

  36. Discovering the Signal Subgraph: An Iterative Screening Approach on Graphs

    Authors: Cencheng Shen, Shangsi Wang, Alexandra Badea, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Supervised learning on graphs is a challenging task due to the high dimensionality and inherent structural dependencies in the data, where each edge depends on a pair of vertices. Existing conventional methods are designed for standard Euclidean data and do not account for the structural information inherent in graphs. In this paper, we propose an iterative vertex screening method to achieve dimen… ▽ More

    Submitted 21 June, 2024; v1 submitted 23 January, 2018; originally announced January 2018.

    Comments: 8 pages main + 3 pages appendix

    Journal ref: Pattern Recognition Letters 184, 97-102, 2024

  37. arXiv:1710.09859  [pdf, other

    stat.ML cs.CV cs.DS cs.LG math.ST

    Kernel k-Groups via Hartigan's Method

    Authors: Guilherme França, Maria L. Rizzo, Joshua T. Vogelstein

    Abstract: Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clust… ▽ More

    Submitted 11 June, 2020; v1 submitted 26 October, 2017; originally announced October 2017.

    Comments: several improvements; connections with community detection and stochastic block model. Matches published version

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  38. From Distance Correlation to Multiscale Graph Correlation

    Authors: Cencheng Shen, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Understanding and develo** a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation --- a correlation measure that was recently proposed and shown to be universally consistent for depen… ▽ More

    Submitted 30 September, 2018; v1 submitted 26 October, 2017; originally announced October 2017.

    Comments: 39 pages + Appendix 22 pages, 6 figures

    Journal ref: Journal of the American Statistical Association 115(529), 280-291, 2020

  39. arXiv:1709.05454  [pdf, other

    stat.ME math.ST stat.ML

    Statistical inference on random dot product graphs: a survey

    Authors: Avanti Athreya, Donniell E. Fishkind, Keith Levin, Vince Lyzinski, Youngser Park, Yichen Qin, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graph… ▽ More

    Submitted 16 September, 2017; originally announced September 2017.

    Comments: An expository survey paper on a comprehensive paradigm for inference for random dot product graphs, centered on graph adjacency and Laplacian spectral embeddings. Paper outlines requisite background; summarizes theory, methodology, and applications from previous and ongoing work; and closes with a discussion of several open problems

    MSC Class: 62FXX; 62GXX; 62HXX; 05CXX

    Journal ref: Journal of Machine Learning Research, 2018

  40. arXiv:1709.01233  [pdf, other

    stat.ML

    Supervised Dimensionality Reduction for Big Data

    Authors: Joshua T. Vogelstein, Eric Bridgeford, Minh Tang, Da Zheng, Christopher Douville, Randal Burns, Mauro Maggioni

    Abstract: To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation tha… ▽ More

    Submitted 23 January, 2021; v1 submitted 5 September, 2017; originally announced September 2017.

    Comments: 6 figures

  41. arXiv:1707.03487  [pdf, other

    stat.ME

    Robust Estimation from Multiple Graphs under Gross Error Contamination

    Authors: Runze Tang, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Estimation of graph parameters based on a collection of graphs is essential for a wide range of graph inference tasks. In practice, weighted graphs are generally observed with edge contamination. We consider a weighted latent position graph model contaminated via an edge weight gross error model and propose an estimation methodology based on robust Lq estimation followed by low-rank adjacency spec… ▽ More

    Submitted 11 July, 2017; originally announced July 2017.

  42. arXiv:1705.03297  [pdf, other

    stat.ML

    Semiparametric spectral modeling of the Drosophila connectome

    Authors: Carey E. Priebe, Youngser Park, Minh Tang, Avanti Athreya, Vince Lyzinski, Joshua T. Vogelstein, Yichen Qin, Ben Cocanougher, Katharina Eichler, Marta Zlatic, Albert Cardona

    Abstract: We present semiparametric spectral modeling of the complete larval Drosophila mushroom body connectome. Motivated by a thorough exploratory data analysis of the network via Gaussian mixture modeling (GMM) in the adjacency spectral embedding (ASE) representation space, we introduce the latent structure model (LSM) for network modeling and inference. LSM is a generalization of the stochastic block m… ▽ More

    Submitted 9 May, 2017; originally announced May 2017.

  43. Network Dependence Testing via Diffusion Maps and Distance-Based Correlations

    Authors: You** Lee, Cencheng Shen, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Deciphering the associations between network connectivity and nodal attributes is one of the core problems in network science. The dependency structure and high-dimensionality of networks pose unique challenges to traditional dependency tests in terms of theoretical guarantees and empirical performance. We propose an approach to test network dependence via diffusion maps and distance-based correla… ▽ More

    Submitted 14 February, 2019; v1 submitted 29 March, 2017; originally announced March 2017.

    Journal ref: Biometrika 106(4), 857-873, 2019

  44. arXiv:1703.03862  [pdf, other

    stat.AP cs.LG stat.ML

    Joint Embedding of Graphs

    Authors: Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric… ▽ More

    Submitted 17 October, 2019; v1 submitted 10 March, 2017; originally announced March 2017.

  45. Discovering and Deciphering Relationships Across Disparate Data Modalities

    Authors: Joshua T. Vogelstein, Eric Bridgeford, Qing Wang, Carey E. Priebe, Mauro Maggioni, Cencheng Shen

    Abstract: Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets. While existing approaches can test whether two properties are related, they often require unfeasibly large sample sizes in real data scenarios, and do not provide any insight into how or… ▽ More

    Submitted 6 December, 2018; v1 submitted 16 September, 2016; originally announced September 2016.

    Journal ref: eLife 8, e41690, 2019

  46. arXiv:1609.01672  [pdf, other

    stat.ME stat.ML

    Connectome Smoothing via Low-rank Approximations

    Authors: Runze Tang, Michael Ketcha, Alexandra Badea, Evan D. Calabrese, Daniel S. Margulies, Joshua T. Vogelstein, Carey E. Priebe, Daniel L. Sussman

    Abstract: In statistical connectomics, the quantitative study of brain networks, estimating the mean of a population of graphs based on a sample is a core problem. Often, this problem is especially difficult because the sample or cohort size is relatively small, sometimes even a single subject. While using the element-wise sample mean of the adjacency matrices is a common approach, this method does not expl… ▽ More

    Submitted 6 December, 2018; v1 submitted 6 September, 2016; originally announced September 2016.

    Comments: 43 pages, 12 figures

  47. arXiv:1509.03927  [pdf, other

    stat.ME

    An M-Estimator for Reduced-Rank High-Dimensional Linear Dynamical System Identification

    Authors: Shaojie Chen, Kai Liu, Yuguang Yang, Yuting Xu, Seonjoo Lee, Martin Lindquist, Brian S. Caffo, Joshua T. Vogelstein

    Abstract: High-dimensional time-series data are becoming increasingly abundant across a wide variety of domains, spanning economics, neuroscience, particle physics, and cosmology. Fitting statistical models to such data, to enable parameter estimation and time-series prediction, is an important computational primitive. Existing methods, however, are unable to cope with the high-dimensional nature of these p… ▽ More

    Submitted 13 September, 2015; originally announced September 2015.

  48. arXiv:1508.05414  [pdf, other

    stat.AP q-bio.NC

    Stability and Localization of inter-individual differences in functional connectivity

    Authors: Raag D. Airan, Joshua T. Vogelstein, Jay J. Pillai, Brian Caffo, James J. Pekar, Haris I. Sair

    Abstract: Much recent attention has been paid to quantifying anatomic and functional neuroimaging on the individual subject level. For optimal individual subject characterization, specific acquisition and analysis features need to be identified that maximize inter-individual variability while concomitantly minimizing intra-subject variability. Here we develop a non-parametric statistical metric that quantif… ▽ More

    Submitted 11 May, 2016; v1 submitted 21 August, 2015; originally announced August 2015.

    Comments: 14 pages, 5 figures

  49. arXiv:1507.08376  [pdf, other

    stat.AP q-bio.NC

    A Joint Graph Inference Case Study: the C.elegans Chemical and Electrical Connectomes

    Authors: Li Chen, Joshua T. Vogelstein, Vince Lyzinski, Carey E. Priebe

    Abstract: We investigate joint graph inference for the chemical and electrical connectomes of the \textit{Caenorhabditis elegans} roundworm. The \textit{C.elegans} connectomes consist of $253$ non-isolated neurons with known functional attributes, and there are two types of synaptic connectomes, resulting in a pair of graphs. We formulate our joint graph inference from the perspectives of seeded graph match… ▽ More

    Submitted 5 August, 2015; v1 submitted 30 July, 2015; originally announced July 2015.

  50. arXiv:1506.03410  [pdf, other

    stat.ML cs.LG

    Sparse Projection Oblique Randomer Forests

    Authors: Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein

    Abstract: Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortuna… ▽ More

    Submitted 3 October, 2019; v1 submitted 10 June, 2015; originally announced June 2015.

    Comments: 31 pages; submitted to Journal of Machine Learning Research for review

    MSC Class: 68T10 ACM Class: I.5.2

    Journal ref: Journal of Machine Learning Research 21(104), 1-39, 2020