Skip to main content

Showing 1–50 of 65 results for author: Vogelstein, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2307.13868  [pdf, other

    stat.ME cs.LG stat.ML

    Learning sources of variability from high-dimensional observational studies

    Authors: Eric W. Bridgeford, Jaewon Chung, Brian Gilbert, Sambit Panda, Adam Li, Cencheng Shen, Alexandra Badea, Brian Caffo, Joshua T. Vogelstein

    Abstract: Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estiman… ▽ More

    Submitted 28 November, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

  2. arXiv:2303.17589  [pdf, other

    cs.LG cs.CV cs.NE q-bio.NC

    Polarity is all you need to learn and transfer faster

    Authors: Qingyang Wang, Michael A. Powell, Ali Geisa, Eric W. Bridgeford, Joshua T. Vogelstein

    Abstract: Natural intelligences (NIs) thrive in a dynamic world - they learn quickly, sometimes with only a few samples. In contrast, artificial intelligences (AIs) typically learn with a prohibitive number of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we investigate the role of weight polarity: development proce… ▽ More

    Submitted 30 May, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: ICML camera-ready

  3. arXiv:2302.14186  [pdf, other

    eess.SP cs.LG stat.AP stat.ME stat.ML

    Approximately optimal domain adaptation with Fisher's Linear Discriminant

    Authors: Hayden S. Helm, Ashwin De Silva, Joshua T. Vogelstein, Carey E. Priebe, Weiwei Yang

    Abstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propos… ▽ More

    Submitted 1 March, 2024; v1 submitted 27 February, 2023; originally announced February 2023.

  4. arXiv:2208.10967  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    The Value of Out-of-Distribution Data

    Authors: Ashwin De Silva, Rahul Ramesh, Carey E. Priebe, Pratik Chaudhari, Joshua T. Vogelstein

    Abstract: We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task imp… ▽ More

    Submitted 13 July, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

    Comments: Previous versions of this work have been presented at the Out-of-Distribution Generalization in Computer Vision (OOD-CV) Workshop (ECCV 2022) and the Workshop on Distribution Shifts (NeurIPS 2022)

    Journal ref: Proceedings of the 40th International Conference on Machine Learning, PMLR 202:7366-7389, 2023

  5. arXiv:2208.03211  [pdf, other

    cs.LG cs.AI cs.NE

    Why do networks have inhibitory/negative connections?

    Authors: Qingyang Wang, Michael A. Powell, Ali Geisa, Eric Bridgeford, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Why do brains have inhibitory connections? Why do deep networks have negative weights? We propose an answer from the perspective of representation capacity. We believe representing functions is the primary role of both (i) the brain in natural intelligence, and (ii) deep networks in artificial intelligence. Our answer to why there are inhibitory/negative weights is: to learn more functions. We pro… ▽ More

    Submitted 17 August, 2023; v1 submitted 5 August, 2022; originally announced August 2022.

    Comments: ICCV2023 camera-ready

  6. arXiv:2201.13001  [pdf, other

    cs.LG cs.AI cs.DS q-bio.NC stat.ML

    Deep Discriminative to Kernel Density Graph for In- and Out-of-distribution Calibrated Inference

    Authors: Jayanta Dey, Haoyin Xu, Will LeVine, Ashwin De Silva, Tyler M. Tomita, Ali Geisa, Tiffany Chu, Jacob Desman, Joshua T. Vogelstein

    Abstract: Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distr… ▽ More

    Submitted 7 June, 2024; v1 submitted 31 January, 2022; originally announced January 2022.

  7. arXiv:2201.07372  [pdf, other

    cs.LG cs.AI

    Prospective Learning: Principled Extrapolation to the Future

    Authors: Ashwin De Silva, Rahul Ramesh, Lyle Ungar, Marshall Hussain Shuler, Noah J. Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, Archana Venkataraman, Brian Caffo, Javier J. How, Justus M Kebschull, John W. Krakauer, Maxim Bichuch, Kaleab Alemayehu Kinfu, Eva Yezerets, Dinesh Jayaraman, Jong M. Shin, Soledad Villar, Ian Phillips, Carey E. Priebe, Thomas Hartung, Michael I. Miller , et al. (18 additional authors not shown)

    Abstract: Learning is a process which can update decision rules, based on past experience, such that future performance improves. Traditionally, machine learning is often evaluated under the assumption that the future will be identical to the past in distribution or change adversarially. But these assumptions can be either too optimistic or pessimistic for many problems in the real world. Real world scenari… ▽ More

    Submitted 13 July, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: Accepted at the 2nd Conference on Lifelong Learning Agents (CoLLAs), 2023

  8. arXiv:2111.05366  [pdf, other

    stat.ML cs.LG math.CO

    Graph Matching via Optimal Transport

    Authors: Ali Saad-Eldin, Benjamin D. Pedigo, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

  9. arXiv:2110.08483  [pdf, other

    cs.LG cs.AI cs.DS

    Simplest Streaming Trees

    Authors: Haoyin Xu, Jayanta Dey, Sambit Panda, Joshua T. Vogelstein

    Abstract: Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, especially on tabular data. However, most of the current implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this lim… ▽ More

    Submitted 24 October, 2023; v1 submitted 16 October, 2021; originally announced October 2021.

  10. arXiv:2109.14501  [pdf, other

    stat.ML cs.AI cs.LG

    Towards a theory of out-of-distribution learning

    Authors: Jayanta Dey, Ali Geisa, Ronak Mehta, Tyler M. Tomita, Hayden S. Helm, Haoyin Xu, Eric Eaton, Jeffery Dick, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: Learning is a process wherein a learning agent enhances its performance through exposure of experience or data. Throughout this journey, the agent may encounter diverse learning environments. For example, data may be presented to the leaner all at once, in multiple batches, or sequentially. Furthermore, the distribution of each data sample could be either identical and independent (iid) or non-iid… ▽ More

    Submitted 7 June, 2024; v1 submitted 29 September, 2021; originally announced September 2021.

  11. arXiv:2108.13637  [pdf, other

    cs.LG cs.AI q-bio.NC stat.ML

    When are Deep Networks really better than Decision Forests at small sample sizes, and how?

    Authors: Haoyin Xu, Kaleab A. Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M. White, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies… ▽ More

    Submitted 2 November, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

  12. arXiv:2107.11732  [pdf, other

    cs.LG econ.EM q-bio.QM stat.ME

    Federated Causal Inference in Heterogeneous Observational Data

    Authors: Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T. Vogelstein, Susan Athey

    Abstract: We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the… ▽ More

    Submitted 2 April, 2023; v1 submitted 25 July, 2021; originally announced July 2021.

  13. arXiv:2106.02701  [pdf, other

    cs.CV

    Hidden Markov Modeling for Maximum Likelihood Neuron Reconstruction

    Authors: Thomas L. Athey, Daniel J. Tward, Ulrich Mueller, Joshua T. Vogelstein, Michael I. Miller

    Abstract: Recent advances in brain clearing and imaging have made it possible to image entire mammalian brains at sub-micron resolution. These images offer the potential to assemble brain-wide atlases of neuron morphology, but manual neuron reconstruction remains a bottleneck. Several automatic reconstruction algorithms exist, but most focus on single neuron images. In this paper, we present a probabilistic… ▽ More

    Submitted 27 January, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

  14. arXiv:2104.01532  [pdf, other

    q-bio.NC cs.MS math.DG

    Fitting Splines to Axonal Arbors Quantifies Relationship between Branch Order and Geometry

    Authors: Thomas L. Athey, Jacopo Teneggi, Joshua T. Vogelstein, Daniel Tward, Ulrich Mueller, Michael I. Miller

    Abstract: Neuromorphology is crucial to identifying neuronal subtypes and understanding learning. It is also implicated in neurological disease. However, standard morphological analysis focuses on macroscopic features such as branching frequency and connectivity between regions, and often neglects the internal geometry of neurons. In this work, we treat neuron trace points as a sampling of differentiable cu… ▽ More

    Submitted 5 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

    Journal ref: Front. Neuroinform. 15 (2021)

  15. arXiv:2104.00641  [pdf

    stat.ML cs.LG

    Dynamic Silos: Increased Modularity in Intra-organizational Communication Networks during the Covid-19 Pandemic

    Authors: Tiona Zuzul, Emily Cox Pahnke, Jonathan Larson, Patrick Bourke, Nicholas Caurvina, Neha Parikh Shah, Fereshteh Amini, Jeffrey Weston, Youngser Park, Joshua Vogelstein, Christopher White, Carey E. Priebe

    Abstract: Workplace communications around the world were drastically altered by Covid-19, related work-from-home orders, and the rise of remote work. To understand these shifts, we analyzed aggregated, anonymized metadata from over 360 billion emails within 4,361 organizations worldwide. By comparing month-to-month and year-over-year metrics, we examined changes in network community structures over 24 month… ▽ More

    Submitted 28 July, 2023; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: 48 pages, 15 figures

  16. arXiv:2011.06557  [pdf, other

    stat.ML cs.LG stat.ME

    A partition-based similarity for classification distributions

    Authors: Hayden S. Helm, Ronak D. Mehta, Brandon Duderstadt, Weiwei Yang, Christoper M. White, Ali Geisa, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Herein we define a measure of similarity between classification distributions that is both principled from the perspective of statistical pattern recognition and useful from the perspective of machine learning practitioners. In particular, we propose a novel similarity on classification distributions, dubbed task similarity, that quantifies how an optimally-transformed optimal representation for a… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

  17. arXiv:2011.05383  [pdf, other

    cs.DC cs.LG

    PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment

    Authors: Meghana Madhyastha, Kunal Lillaney, James Browne, Joshua Vogelstein, Randal Burns

    Abstract: We present methods to serialize and deserialize tree ensembles that optimize inference latency when models are not already loaded into memory. This arises whenever models are larger than memory, but also systematically when models are deployed on low-resource devices, such as in the Internet of Things, or run as Web micro-services where resources are allocated on demand. Our packed serialized tree… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

    ACM Class: I.5.5

  18. arXiv:2007.13843  [pdf, other

    stat.ML cs.IR cs.LG cs.SI

    Robust Similarity and Distance Learning via Decision Forests

    Authors: Tyler M. Tomita, Joshua T. Vogelstein

    Abstract: Canonical distances such as Euclidean distance often fail to capture the appropriate relationships between items, subsequently leading to subpar inference and prediction. Many algorithms have been proposed for automated learning of suitable distances, most of which employ linear methods to learn a global metric over the feature space. While such methods offer nice theoretical properties, interpret… ▽ More

    Submitted 21 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Submitted to NeurIPS 2020

  19. arXiv:2005.11890  [pdf, other

    stat.ML cs.LG stat.CO

    mvlearn: Multiview Machine Learning in Python

    Authors: Ronan Perry, Gavin Mischler, Richard Guo, Theodore Lee, Alexander Chang, Arman Koul, Cameron Franz, Hugo Richard, Iain Carmichael, Pierre Ablin, Alexandre Gramfort, Joshua T. Vogelstein

    Abstract: As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that… ▽ More

    Submitted 25 May, 2021; v1 submitted 24 May, 2020; originally announced May 2020.

    Comments: 6 pages, 2 figures, 1 table

  20. arXiv:2005.10700  [pdf, other

    cs.LG cs.IR stat.ML

    Distance-based Positive and Unlabeled Learning for Ranking

    Authors: Hayden S. Helm, Amitabh Basu, Avanti Athreya, Youngser Park, Joshua T. Vogelstein, Carey E. Priebe, Michael Winding, Marta Zlatic, Albert Cardona, Patrick Bourke, Jonathan Larson, Marah Abdin, Piali Choudhury, Weiwei Yang, Christopher W. White

    Abstract: Learning to rank -- producing a ranked list of items specific to a query and with respect to a set of supervisory items -- is a problem of general interest. The setting we consider is one in which no analytic description of what constitutes a good ranking is available. Instead, we have a collection of representations and supervisory information consisting of a (target item, interesting items set)… ▽ More

    Submitted 28 September, 2022; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 21 pages, 5 figures

  21. arXiv:2004.12926  [pdf

    cs.CY cs.AI q-bio.NC

    A New Age of Computing and the Brain

    Authors: Polina Golland, Jack Gallant, Greg Hager, Hanspeter Pfister, Christos Papadimitriou, Stefan Schaal, Joshua T. Vogelstein

    Abstract: The history of computer science and brain sciences are intertwined. In his unfinished manuscript "The Computer and the Brain," von Neumann debates whether or not the brain can be thought of as a computing machine and identifies some of the similarities and differences between natural and artificial computation. Turing, in his 1950 article in Mind, argues that computing devices could ultimately emu… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: A Computing Community Consortium (CCC) workshop report, 24 pages

    Report number: ccc2014report_5

  22. arXiv:2004.12908  [pdf, other

    cs.AI cs.LG stat.ML

    A Simple Lifelong Learning Approach

    Authors: Joshua T. Vogelstein, Jayanta Dey, Hayden S. Helm, Will LeVine, Ronak D. Mehta, Tyler M. Tomita, Haoyin Xu, Ali Geisa, Qingyang Wang, Gido M. van de Ven, Chenyu Gao, Weiwei Yang, Bryan Tower, Jonathan Larson, Christopher M. White, Carey E. Priebe

    Abstract: In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain perf… ▽ More

    Submitted 11 June, 2024; v1 submitted 27 April, 2020; originally announced April 2020.

  23. arXiv:1912.12150  [pdf, other

    stat.ML cs.LG math.ST stat.ME

    The Chi-Square Test of Distance Correlation

    Authors: Cencheng Shen, Sambit Panda, Joshua T. Vogelstein

    Abstract: Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depe… ▽ More

    Submitted 14 May, 2021; v1 submitted 27 December, 2019; originally announced December 2019.

    Comments: 21 pages, 4 figures, 1 table

    Journal ref: Journal of Computational and Graphical Statistics 31(1), 254-262, 2022

  24. arXiv:1910.08883  [pdf, other

    stat.ML cs.LG

    High-dimensional and universally consistent k-sample tests

    Authors: Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution. The standard method for k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA), despite that it depends on strong, and often unsuitable, parametric assumptions. Moreover, independence testing and k-sample testing are closely related, and several univ… ▽ More

    Submitted 11 October, 2023; v1 submitted 19 October, 2019; originally announced October 2019.

  25. arXiv:1909.11799  [pdf, other

    cs.LG stat.ML

    Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks

    Authors: Adam Li, Ronan Perry, Chester Huynh, Tyler M. Tomita, Ronak Mehta, Jesus Arroyo, Jesse Patsolic, Benjamin Falk, Joshua T. Vogelstein

    Abstract: Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structur… ▽ More

    Submitted 5 September, 2022; v1 submitted 25 September, 2019; originally announced September 2019.

    Comments: Updated manuscript based on review at SIMODS

    MSC Class: 68T05

  26. arXiv:1909.02688  [pdf, other

    cs.LG stat.ML

    AutoGMM: Automatic and Hierarchical Gaussian Mixture Modeling in Python

    Authors: Thomas L. Athey, Tingshan Liu, Benjamin D. Pedigo, Joshua T. Vogelstein

    Abstract: Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these p… ▽ More

    Submitted 12 August, 2021; v1 submitted 5 September, 2019; originally announced September 2019.

  27. arXiv:1908.06486  [pdf, other

    stat.ML cs.LG stat.ME

    Independence Testing for Temporal Data

    Authors: Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, Joshua T. Vogelstein

    Abstract: Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been prop… ▽ More

    Submitted 27 May, 2024; v1 submitted 18 August, 2019; originally announced August 2019.

    Comments: 19 pages main + 6 pages appendix

    Journal ref: Transactions on Machine Learning Research, 2024

  28. arXiv:1907.03335  [pdf, other

    cs.DC cs.DB

    Graphyti: A Semi-External Memory Graph Library for FlashGraph

    Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

    Abstract: Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, $\mathcal{O}(m)$ data resides on… ▽ More

    Submitted 7 July, 2019; originally announced July 2019.

  29. arXiv:1907.02844  [pdf, other

    stat.ML cs.IR cs.LG stat.ME

    Geodesic Learning via Unsupervised Decision Forests

    Authors: Meghana Madhyastha, Percy Li, James Browne, Veronika Strnadova-Neeley, Carey E. Priebe, Randal Burns, Joshua T. Vogelstein

    Abstract: Geodesic distance is the shortest path between two points in a Riemannian manifold. Manifold learning algorithms, such as Isomap, seek to learn a manifold that preserves geodesic distances. However, such methods operate on the ambient dimensionality, and are therefore fragile to noise dimensions. We developed an unsupervised random forest method (URerF) to approximately learn geodesic distances in… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

  30. arXiv:1907.02088  [pdf, other

    stat.CO cs.MS stat.ME stat.ML

    hyppo: A Multivariate Hypothesis Testing Python Package

    Authors: Sambit Panda, Satish Palaniappan, Junhao Xiong, Eric W. Bridgeford, Ronak Mehta, Cencheng Shen, Joshua T. Vogelstein

    Abstract: We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible eno… ▽ More

    Submitted 1 April, 2021; v1 submitted 3 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure

  31. arXiv:1907.00325  [pdf, other

    cs.LG stat.ML

    Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities

    Authors: Ronan Perry, Ronak Mehta, Richard Guo, Eva Yezerets, Jesús Arroyo, Mike Powell, Hayden Helm, Cencheng Shen, Joshua T. Vogelstein

    Abstract: Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when… ▽ More

    Submitted 5 October, 2021; v1 submitted 30 June, 2019; originally announced July 2019.

  32. arXiv:1906.10026  [pdf, other

    stat.ME cs.SI math.ST

    Inference for multiple heterogeneous networks with a common invariant subspace

    Authors: Jesús Arroyo, Avanti Athreya, Joshua Cape, Guodong Chen, Carey E. Priebe, Joshua T. Vogelstein

    Abstract: The development of models for multiple heterogeneous network data is of critical importance both in statistical network theory and across multiple application domains. Although single-graph inference is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to… ▽ More

    Submitted 22 August, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

  33. arXiv:1906.02881  [pdf, other

    stat.ML cs.LG cs.SI stat.ME

    Vertex Classification on Weighted Networks

    Authors: Hayden Helm, Joshua Vogelstein, Carey Priebe

    Abstract: This paper proposes a discrimination technique for vertices in a weighted network. We assume that the edge weights and adjacencies in the network are conditionally independent and that both sources of information encode class membership information. In particular, we introduce a edge weight distribution matrix to the standard K-Block Stochastic Block Model to model weighted networks. This allows u… ▽ More

    Submitted 6 June, 2019; originally announced June 2019.

    Comments: 11 pages

  34. arXiv:1904.05329  [pdf, other

    cs.SI stat.ML stat.OT

    GraSPy: Graph Statistics in Python

    Authors: Jaewon Chung, Benjamin D. Pedigo, Eric W. Bridgeford, Bijan K. Varjavand, Hayden S. Helm, Joshua T. Vogelstein

    Abstract: We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The… ▽ More

    Submitted 14 August, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

    Journal ref: Journal of Machine Learning Research 20.158 (2019): 1-7

  35. arXiv:1902.09527  [pdf, other

    cs.DC

    clusterNOR: A NUMA-Optimized Clustering Framework

    Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

    Abstract: Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We elimina… ▽ More

    Submitted 17 January, 2021; v1 submitted 24 February, 2019; originally announced February 2019.

    Comments: arXiv admin note: Journal version of arXiv:1606.08905

  36. arXiv:1812.00029  [pdf, other

    stat.ML cs.LG

    Learning Interpretable Characteristic Kernels via Decision Forests

    Authors: Sambit Panda, Cencheng Shen, Joshua T. Vogelstein

    Abstract: Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We c… ▽ More

    Submitted 28 September, 2023; v1 submitted 30 November, 2018; originally announced December 2018.

  37. On a 'Two Truths' Phenomenon in Spectral Graph Clustering

    Authors: Carey E. Priebe, Youngser Park, Joshua T. Vogelstein, John M. Conroy, Vince Lyzinski, Minh Tang, Avanti Athreya, Joshua Cape, Eric Bridgeford

    Abstract: Clustering is concerned with coherently grou** observations without any explicit concept of true grou**s. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical resu… ▽ More

    Submitted 11 February, 2019; v1 submitted 23 August, 2018; originally announced August 2018.

    Journal ref: PNAS 116 (2019) 5995-6000

  38. arXiv:1806.07300  [pdf, other

    cs.PF cs.DC

    Forest Packing: Fast, Parallel Decision Forests

    Authors: James Browne, Tyler M. Tomita, Disa Mhembere, Randal Burns, Joshua T. Vogelstein

    Abstract: Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the travers… ▽ More

    Submitted 19 June, 2018; originally announced June 2018.

  39. The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing

    Authors: Cencheng Shen, Joshua T. Vogelstein

    Abstract: Distance-based tests, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel-based tests, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. A fixed-point transformation was previously proposed to connect the distance methods and kernel meth… ▽ More

    Submitted 14 September, 2020; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: 24 pages main + 7 pages appendix, 3 figures

    Journal ref: AStA Advances in Statistical Analysis 105(3), 385-403, 2021

  40. arXiv:1710.09859  [pdf, other

    stat.ML cs.CV cs.DS cs.LG math.ST

    Kernel k-Groups via Hartigan's Method

    Authors: Guilherme França, Maria L. Rizzo, Joshua T. Vogelstein

    Abstract: Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clust… ▽ More

    Submitted 11 June, 2020; v1 submitted 26 October, 2017; originally announced October 2017.

    Comments: several improvements; connections with community detection and stochastic block model. Matches published version

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  41. arXiv:1703.03862  [pdf, other

    stat.AP cs.LG stat.ML

    Joint Embedding of Graphs

    Authors: Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric… ▽ More

    Submitted 17 October, 2019; v1 submitted 10 March, 2017; originally announced March 2017.

  42. arXiv:1612.00356  [pdf, other

    cs.CV

    A Large Deformation Diffeomorphic Approach to Registration of CLARITY Images via Mutual Information

    Authors: Kwame S. Kutten, Nicolas Charon, Michael I. Miller, J. T. Ratnanather, Jordan Matelsky, Alexander D. Baden, Kunal Lillaney, Karl Deisseroth, Li Ye, Joshua T. Vogelstein

    Abstract: CLARITY is a method for converting biological tissues into translucent and porous hydrogel-tissue hybrids. This facilitates interrogation with light sheet microscopy and penetration of molecular probes while avoiding physical slicing. In this work, we develop a pipeline for registering CLARIfied mouse brains to an annotated brain atlas. Due to the novelty of this microscopy technique it is impract… ▽ More

    Submitted 11 August, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

  43. Probabilistic Fluorescence-Based Synapse Detection

    Authors: Anish K. Simhal, Cecilia Aguerrebere, Forrest Collman, Joshua T. Vogelstein, Kristina D. Micheva, Richard J. Weinberg, Stephen J. Smith, Guillermo Sapiro

    Abstract: Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network deve… ▽ More

    Submitted 16 November, 2016; originally announced November 2016.

    Comments: Current awaiting peer review

  44. arXiv:1606.08905  [pdf, other

    cs.DC

    knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library

    Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

    Abstract: k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine}… ▽ More

    Submitted 24 June, 2017; v1 submitted 28 June, 2016; originally announced June 2016.

  45. arXiv:1605.02060  [pdf, other

    q-bio.QM cs.CV

    Deformably Registering and Annotating Whole CLARITY Brains to an Atlas via Masked LDDMM

    Authors: Kwame S. Kutten, Joshua T. Vogelstein, Nicolas Charon, Li Ye, Karl Deisseroth, Michael I. Miller

    Abstract: The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre… ▽ More

    Submitted 6 May, 2016; originally announced May 2016.

    Journal ref: Proc. SPIE 9896 Optics, Photonics and Digital Technologies for Imaging Applications IV (2016)

  46. arXiv:1604.06414  [pdf, other

    cs.DC

    FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs

    Authors: Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, Randal Burns

    Abstract: R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory c… ▽ More

    Submitted 18 May, 2017; v1 submitted 21 April, 2016; originally announced April 2016.

  47. arXiv:1604.03629  [pdf, other

    q-bio.QM cs.CV

    Quantifying mesoscale neuroanatomy using X-ray microtomography

    Authors: Eva L. Dyer, William Gray Roncal, Hugo L. Fernandes, Doga Gürsoy, Vincent De Andrade, Rafael Vescovi, Kamel Fezzaa, Xianghui Xiao, Joshua T. Vogelstein, Chris Jacobsen, Konrad P. Körding, Narayanan Kasthuri

    Abstract: Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography (… ▽ More

    Submitted 26 July, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

    Comments: 28 pages, 9 figures

  48. Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs

    Authors: Da Zheng, Disa Mhembere, Vince Lyzinski, Joshua Vogelstein, Carey E. Priebe, Randal Burns

    Abstract: Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memor… ▽ More

    Submitted 14 October, 2016; v1 submitted 9 February, 2016; originally announced February 2016.

    Comments: published in IEEE Transactions on Parallel and Distributed Systems

  49. arXiv:1602.01421  [pdf, other

    cs.DC cs.MS

    An SSD-based eigensolver for spectral analysis on billion-node graphs

    Authors: Da Zheng, Randal Burns, Joshua Vogelstein, Carey E. Priebe, Alexander S. Szalay

    Abstract: Many eigensolvers such as ARPACK and Anasazi have been developed to compute eigenvalues of a large sparse matrix. These eigensolvers are limited by the capacity of RAM. They run in memory of a single machine for smaller eigenvalue problems and require the distributed memory for larger problems. In contrast, we develop an SSD-based eigensolver framework called FlashEigen, which extends Anasazi ei… ▽ More

    Submitted 26 February, 2016; v1 submitted 3 February, 2016; originally announced February 2016.

  50. Fast Neuromimetic Object Recognition using FPGA Outperforms GPU Implementations

    Authors: Garrick Orchard, Jacob G. Martin, R. Jacob Vogelstein, Ralph Etienne-Cummings

    Abstract: Recognition of objects in still images has traditionally been regarded as a difficult computational problem. Although modern automated methods for visual object recognition have achieved steadily increasing recognition accuracy, even the most advanced computational vision approaches are unable to obtain performance equal to that of humans. This has led to the creation of many biologically-inspired… ▽ More

    Submitted 31 October, 2015; originally announced November 2015.

    Comments: 14 pages, 8 figures, 5 tables

    Journal ref: Neural Networks and Learning Systems, IEEE Transactions on, vol.24, no.8, pp.1239-1252, 2013