Search | arXiv e-print repository

doi 10.1016/j.patrec.2024.06.011

Discovering the Signal Subgraph: An Iterative Screening Approach on Graphs

Authors: Cencheng Shen, Shangsi Wang, Alexandra Badea, Carey E. Priebe, Joshua T. Vogelstein

Abstract: Supervised learning on graphs is a challenging task due to the high dimensionality and inherent structural dependencies in the data, where each edge depends on a pair of vertices. Existing conventional methods are designed for standard Euclidean data and do not account for the structural information inherent in graphs. In this paper, we propose an iterative vertex screening method to achieve dimen… ▽ More Supervised learning on graphs is a challenging task due to the high dimensionality and inherent structural dependencies in the data, where each edge depends on a pair of vertices. Existing conventional methods are designed for standard Euclidean data and do not account for the structural information inherent in graphs. In this paper, we propose an iterative vertex screening method to achieve dimension reduction across multiple graph datasets with matched vertex sets and associated graph attributes. Our method aims to identify a signal subgraph to provide a more concise representation of the full graphs, potentially benefiting subsequent vertex classification tasks. The method screens the rows and columns of the adjacency matrix concurrently and stops when the resulting distance correlation is maximized. We establish the theoretical foundation of our method by proving that it estimates the true signal subgraph with high probability. Additionally, we establish the convergence rate of classification error under the Erdos-Renyi random graph model and prove that the subsequent classification can be asymptotically optimal, outperforming the entire graph under high-dimensional conditions. Our method is evaluated on various simulated datasets and real-world human and murine graphs derived from functional and structural magnetic resonance images. The results demonstrate its excellent performance in estimating the ground-truth signal subgraph and achieving superior classification accuracy. △ Less

Submitted 21 June, 2024; v1 submitted 23 January, 2018; originally announced January 2018.

Comments: 8 pages main + 3 pages appendix

Journal ref: Pattern Recognition Letters 184, 97-102, 2024

arXiv:1710.09859 [pdf, other]

doi 10.1109/TPAMI.2020.2998120

Kernel k-Groups via Hartigan's Method

Authors: Guilherme França, Maria L. Rizzo, Joshua T. Vogelstein

Abstract: Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clust… ▽ More Energy statistics was proposed by Sz\' ekely in the 80's inspired by Newton's gravitational potential in classical mechanics and it provides a model-free hypothesis test for equality of distributions. In its original form, energy statistics was formulated in Euclidean spaces. More recently, it was generalized to metric spaces of negative type. In this paper, we consider a formulation for the clustering problem using a weighted version of energy statistics in spaces of negative type. We show that this approach leads to a quadratically constrained quadratic program in the associated kernel space, establishing connections with graph partitioning problems and kernel methods in machine learning. To find local solutions of such an optimization problem, we propose kernel k-groups, which is an extension of Hartigan's method to kernel spaces. Kernel k-groups is cheaper than spectral clustering and has the same computational cost as kernel k-means (which is based on Lloyd's heuristic) but our numerical results show an improved performance, especially in higher dimensions. Moreover, we verify the efficiency of kernel k-groups in community detection in sparse stochastic block models which has fascinating applications in several areas of science. △ Less

Submitted 11 June, 2020; v1 submitted 26 October, 2017; originally announced October 2017.

Comments: several improvements; connections with community detection and stochastic block model. Matches published version

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

arXiv:1710.09768 [pdf, other]

doi 10.1080/01621459.2018.1543125

From Distance Correlation to Multiscale Graph Correlation

Authors: Cencheng Shen, Carey E. Priebe, Joshua T. Vogelstein

Abstract: Understanding and develo** a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation --- a correlation measure that was recently proposed and shown to be universally consistent for depen… ▽ More Understanding and develo** a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation --- a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments --- to the Multiscale Graph Correlation (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound Sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to distance correlation and other popular methods. △ Less

Submitted 30 September, 2018; v1 submitted 26 October, 2017; originally announced October 2017.

Comments: 39 pages + Appendix 22 pages, 6 figures

Journal ref: Journal of the American Statistical Association 115(529), 280-291, 2020

arXiv:1709.05454 [pdf, other]

Statistical inference on random dot product graphs: a survey

Authors: Avanti Athreya, Donniell E. Fishkind, Keith Levin, Vince Lyzinski, Youngser Park, Yichen Qin, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

Abstract: The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graph… ▽ More The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the analogues, in graph inference, of several canonical tenets of classical Euclidean inference: in particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference. △ Less

Submitted 16 September, 2017; originally announced September 2017.

Comments: An expository survey paper on a comprehensive paradigm for inference for random dot product graphs, centered on graph adjacency and Laplacian spectral embeddings. Paper outlines requisite background; summarizes theory, methodology, and applications from previous and ongoing work; and closes with a discussion of several open problems

MSC Class: 62FXX; 62GXX; 62HXX; 05CXX

Journal ref: Journal of Machine Learning Research, 2018

arXiv:1709.01233 [pdf, other]

Supervised Dimensionality Reduction for Big Data

Authors: Joshua T. Vogelstein, Eric Bridgeford, Minh Tang, Da Zheng, Christopher Douville, Randal Burns, Mauro Maggioni

Abstract: To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation tha… ▽ More To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees.We introduce an approach, XOX, to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest ver-sion, "Linear Optimal Low-rank" projection (LOL), incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that LOL and its generalizations in the XOX framework lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of >150 million features, and several genomics datasets with>500,000 features, LOL outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer. △ Less

Submitted 23 January, 2021; v1 submitted 5 September, 2017; originally announced September 2017.

Comments: 6 figures

arXiv:1707.03487 [pdf, other]

Robust Estimation from Multiple Graphs under Gross Error Contamination

Authors: Runze Tang, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Estimation of graph parameters based on a collection of graphs is essential for a wide range of graph inference tasks. In practice, weighted graphs are generally observed with edge contamination. We consider a weighted latent position graph model contaminated via an edge weight gross error model and propose an estimation methodology based on robust Lq estimation followed by low-rank adjacency spec… ▽ More Estimation of graph parameters based on a collection of graphs is essential for a wide range of graph inference tasks. In practice, weighted graphs are generally observed with edge contamination. We consider a weighted latent position graph model contaminated via an edge weight gross error model and propose an estimation methodology based on robust Lq estimation followed by low-rank adjacency spectral decomposition. We demonstrate that, under appropriate conditions, our estimator both maintains Lq robustness and wins the bias-variance tradeoff by exploiting low-rank graph structure. We illustrate the improvement offered by our estimator via both simulations and a human connectome data experiment. △ Less

Submitted 11 July, 2017; originally announced July 2017.

arXiv:1705.03297 [pdf, other]

Semiparametric spectral modeling of the Drosophila connectome

Authors: Carey E. Priebe, Youngser Park, Minh Tang, Avanti Athreya, Vince Lyzinski, Joshua T. Vogelstein, Yichen Qin, Ben Cocanougher, Katharina Eichler, Marta Zlatic, Albert Cardona

Abstract: We present semiparametric spectral modeling of the complete larval Drosophila mushroom body connectome. Motivated by a thorough exploratory data analysis of the network via Gaussian mixture modeling (GMM) in the adjacency spectral embedding (ASE) representation space, we introduce the latent structure model (LSM) for network modeling and inference. LSM is a generalization of the stochastic block m… ▽ More We present semiparametric spectral modeling of the complete larval Drosophila mushroom body connectome. Motivated by a thorough exploratory data analysis of the network via Gaussian mixture modeling (GMM) in the adjacency spectral embedding (ASE) representation space, we introduce the latent structure model (LSM) for network modeling and inference. LSM is a generalization of the stochastic block model (SBM) and a special case of the random dot product graph (RDPG) latent position model, and is amenable to semiparametric GMM in the ASE representation space. The resulting connectome code derived via semiparametric GMM composed with ASE captures latent connectome structure and elucidates biologically relevant neuronal properties. △ Less

Submitted 9 May, 2017; originally announced May 2017.

arXiv:1703.10136 [pdf, other]

doi 10.1093/biomet/asz045

Network Dependence Testing via Diffusion Maps and Distance-Based Correlations

Authors: You** Lee, Cencheng Shen, Carey E. Priebe, Joshua T. Vogelstein

Abstract: Deciphering the associations between network connectivity and nodal attributes is one of the core problems in network science. The dependency structure and high-dimensionality of networks pose unique challenges to traditional dependency tests in terms of theoretical guarantees and empirical performance. We propose an approach to test network dependence via diffusion maps and distance-based correla… ▽ More Deciphering the associations between network connectivity and nodal attributes is one of the core problems in network science. The dependency structure and high-dimensionality of networks pose unique challenges to traditional dependency tests in terms of theoretical guarantees and empirical performance. We propose an approach to test network dependence via diffusion maps and distance-based correlations. We prove that the new method yields a consistent test statistic under mild distributional assumptions on the graph structure, and demonstrate that it is able to efficiently identify the most informative graph embedding with respect to the diffusion time. The methodology is illustrated on both simulated and real data. △ Less

Submitted 14 February, 2019; v1 submitted 29 March, 2017; originally announced March 2017.

Journal ref: Biometrika 106(4), 857-873, 2019

arXiv:1703.03862 [pdf, other]

doi 10.1109/TPAMI.2019.2948619

Joint Embedding of Graphs

Authors: Shangsi Wang, Jesús Arroyo, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric… ▽ More Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric matrices and projects adjacency matrices of graphs into this subspace. The projection coefficients can be treated as features of the graphs, while the embedding components can represent vertex features. We also propose a random graph model for multiple graphs that generalizes other classical models for graphs. We show through theory and numerical experiments that under the model, the joint embedding method produces estimates of parameters with small errors. Via simulation experiments, we demonstrate that the joint embedding method produces features which lead to state of the art performance in classifying graphs. Applying the joint embedding method to human brain graphs, we find it extracts interpretable features with good prediction accuracy in different tasks. △ Less

Submitted 17 October, 2019; v1 submitted 10 March, 2017; originally announced March 2017.

arXiv:1612.00356 [pdf, other]

A Large Deformation Diffeomorphic Approach to Registration of CLARITY Images via Mutual Information

Authors: Kwame S. Kutten, Nicolas Charon, Michael I. Miller, J. T. Ratnanather, Jordan Matelsky, Alexander D. Baden, Kunal Lillaney, Karl Deisseroth, Li Ye, Joshua T. Vogelstein

Abstract: CLARITY is a method for converting biological tissues into translucent and porous hydrogel-tissue hybrids. This facilitates interrogation with light sheet microscopy and penetration of molecular probes while avoiding physical slicing. In this work, we develop a pipeline for registering CLARIfied mouse brains to an annotated brain atlas. Due to the novelty of this microscopy technique it is impract… ▽ More CLARITY is a method for converting biological tissues into translucent and porous hydrogel-tissue hybrids. This facilitates interrogation with light sheet microscopy and penetration of molecular probes while avoiding physical slicing. In this work, we develop a pipeline for registering CLARIfied mouse brains to an annotated brain atlas. Due to the novelty of this microscopy technique it is impractical to use absolute intensity values to align these images to existing standard atlases. Thus we adopt a large deformation diffeomorphic approach for registering images via mutual information matching. Furthermore we show how a cascaded multi-resolution approach can improve registration quality while reducing algorithm run time. As acquired image volumes were over a terabyte in size, they were far too large for work on personal computers. Therefore the NeuroData computational infrastructure was deployed for multi-resolution storage and visualization of these images and aligned annotations on the web. △ Less

Submitted 11 August, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

arXiv:1611.05479 [pdf, other]

doi 10.1371/journal.pcbi.1005493

Probabilistic Fluorescence-Based Synapse Detection

Authors: Anish K. Simhal, Cecilia Aguerrebere, Forrest Collman, Joshua T. Vogelstein, Kristina D. Micheva, Richard J. Weinberg, Stephen J. Smith, Guillermo Sapiro

Abstract: Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network deve… ▽ More Brain function results from communication between neurons connected by complex synaptic networks. Synapses are themselves highly complex and diverse signaling machines, containing protein products of hundreds of different genes, some in hundreds of copies, arranged in precise lattice at each individual synapse. Synapses are fundamental not only to synaptic network function but also to network development, adaptation, and memory. In addition, abnormalities of synapse numbers or molecular components are implicated in most mental and neurological disorders. Despite their obvious importance, mammalian synapse populations have so far resisted detailed quantitative study. In human brains and most animal nervous systems, synapses are very small and very densely packed: there are approximately 1 billion synapses per cubic millimeter of human cortex. This volumetric density poses very substantial challenges to proteometric analysis at the critical level of the individual synapse. The present work describes new probabilistic image analysis methods for single-synapse analysis of synapse populations in both animal and human brains. △ Less

Submitted 16 November, 2016; originally announced November 2016.

Comments: Current awaiting peer review

arXiv:1610.08484 [pdf, other]

Science In the Cloud (SIC): A use case in MRI Connectomics

Authors: Gregory Kiar, Krzysztof J. Gorgolewski, Dean Kleissas, William Gray Roncal, Brian Litt, Brian Wandell, Russel A. Poldrack, Martin Wiener, R. Jacob Vogelstein, Randal Burns, Joshua T. Vogelstein

Abstract: Modern technologies are enabling scientists to collect extraordinary amounts of complex and sophisticated data across a huge range of scales like never before. With this onslaught of data, we can allow the focal point to shift towards answering the question of how we can analyze and understand the massive amounts of data in front of us. Unfortunately, lack of standardized sharing mechanisms and pr… ▽ More Modern technologies are enabling scientists to collect extraordinary amounts of complex and sophisticated data across a huge range of scales like never before. With this onslaught of data, we can allow the focal point to shift towards answering the question of how we can analyze and understand the massive amounts of data in front of us. Unfortunately, lack of standardized sharing mechanisms and practices often make reproducing or extending scientific results very difficult. With the creation of data organization structures and tools which drastically improve code portability, we now have the opportunity to design such a framework for communicating extensible scientific discoveries. Our proposed solution leverages these existing technologies and standards, and provides an accessible and extensible model for reproducible research, called "science in the cloud" (sic). Exploiting scientific containers, cloud computing and cloud data services, we show the capability to launch a computer in the cloud and run a web service which enables intimate interaction with the tools and data presented. We hope this model will inspire the community to produce reproducible and, importantly, extensible results which will enable us to collectively accelerate the rate at which scientific breakthroughs are discovered, replicated, and extended. △ Less

Submitted 14 February, 2017; v1 submitted 26 October, 2016; originally announced October 2016.

Comments: 13 pages, 5 figures, 4 tables, 2 appendices

arXiv:1609.05148 [pdf, other]

doi 10.7554/eLife.41690

Discovering and Deciphering Relationships Across Disparate Data Modalities

Authors: Joshua T. Vogelstein, Eric Bridgeford, Qing Wang, Carey E. Priebe, Mauro Maggioni, Cencheng Shen

Abstract: Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets. While existing approaches can test whether two properties are related, they often require unfeasibly large sample sizes in real data scenarios, and do not provide any insight into how or… ▽ More Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets. While existing approaches can test whether two properties are related, they often require unfeasibly large sample sizes in real data scenarios, and do not provide any insight into how or why the procedure reached its decision. Our approach, "Multiscale Graph Correlation" (MGC), is a dependence test that juxtaposes previously disparate data science techniques, including k-nearest neighbors, kernel methods (such as support vector machines), and multiscale analysis (such as wavelets). Other methods typically require double or triple the number samples to achieve the same statistical power as MGC in a benchmark suite including high-dimensional and nonlinear relationships - spanning polynomial (linear, quadratic, cubic), trigonometric (sinusoidal, circular, ellipsoidal, spiral), geometric (square, diamond, W-shape), and other functions, with dimensionality ranging from 1 to 1000. Moreover, MGC uniquely provides a simple and elegant characterization of the potentially complex latent geometry underlying the relationship, providing insight while maintaining computational efficiency. In several real data applications, including brain imaging and cancer genetics, MGC is the only method that can both detect the presence of a dependency and provide specific guidance for the next experiment and/or analysis to conduct. △ Less

Submitted 6 December, 2018; v1 submitted 16 September, 2016; originally announced September 2016.

Journal ref: eLife 8, e41690, 2019

arXiv:1609.01672 [pdf, other]

Connectome Smoothing via Low-rank Approximations

Authors: Runze Tang, Michael Ketcha, Alexandra Badea, Evan D. Calabrese, Daniel S. Margulies, Joshua T. Vogelstein, Carey E. Priebe, Daniel L. Sussman

Abstract: In statistical connectomics, the quantitative study of brain networks, estimating the mean of a population of graphs based on a sample is a core problem. Often, this problem is especially difficult because the sample or cohort size is relatively small, sometimes even a single subject. While using the element-wise sample mean of the adjacency matrices is a common approach, this method does not expl… ▽ More In statistical connectomics, the quantitative study of brain networks, estimating the mean of a population of graphs based on a sample is a core problem. Often, this problem is especially difficult because the sample or cohort size is relatively small, sometimes even a single subject. While using the element-wise sample mean of the adjacency matrices is a common approach, this method does not exploit any underlying structural properties of the graphs. We propose using a low-rank method which incorporates tools for dimension selection and diagonal augmentation to smooth the estimates and improve performance over the naive methodology for small sample sizes. Theoretical results for the stochastic blockmodel show that this method offers major improvements when there are many vertices. Similarly, we demonstrate that the low-rank methods outperform the standard sample mean for a variety of independent edge distributions as well as human connectome data derived from magnetic resonance imaging, especially when sample sizes are small. Moreover, the low-rank methods yield "eigen-connectomes", which correlate with the lobe-structure of the human brain and superstructures of the mouse brain. These results indicate that low-rank methods are an important part of the tool box for researchers studying populations of graphs in general, and statistical connectomics in particular. △ Less

Submitted 6 December, 2018; v1 submitted 6 September, 2016; originally announced September 2016.

Comments: 43 pages, 12 figures

arXiv:1608.06548 [pdf]

Grand Challenges for Global Brain Sciences

Authors: Joshua T. Vogelstein, Katrin Amunts, Andreas Andreou, Dora Angelaki, Giorgio Ascoli, Cori Bargmann, Randal Burns, Corrado Cali, Frances Chance, Miyoung Chun, George Church, Hollis Cline, Todd Coleman, Stephanie de La Rochefoucauld, Winfried Denk, Ana Belen Elgoyhen, Ralph Etienne Cummings, Alan Evans, Kenneth Harris, Michael Hausser, Sean Hill, Samuel Inverso, Chad Jackson, Viren Jain, Rob Kass , et al. (37 additional authors not shown)

Abstract: The next grand challenges for society and science are in the brain sciences. A collection of 60+ scientists from around the world, together with 10+ observers from national, private, and foundations, spent two days together discussing the top challenges that we could solve as a global community in the next decade. We eventually settled on three challenges, spanning anatomy, physiology, and medicin… ▽ More The next grand challenges for society and science are in the brain sciences. A collection of 60+ scientists from around the world, together with 10+ observers from national, private, and foundations, spent two days together discussing the top challenges that we could solve as a global community in the next decade. We eventually settled on three challenges, spanning anatomy, physiology, and medicine. Addressing all three challenges requires novel computational infrastructure. The group proposed the advent of The International Brain Station (TIBS), to address these challenges, and launch brain sciences to the next level of understanding. △ Less

Submitted 27 October, 2016; v1 submitted 23 August, 2016; originally announced August 2016.

Comments: 6 pages

arXiv:1606.08905 [pdf, other]

knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library

Authors: Disa Mhembere, Da Zheng, Carey E. Priebe, Joshua T. Vogelstein, Randal Burns

Abstract: k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine}… ▽ More k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The \textit{k-means NUMA Optimized Routine} (\textsf{knor}) library has (i) in-memory (\textsf{knori}), (ii) distributed memory (\textsf{knord}), and (iii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. \textsf{knori} boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. \textsf{knors} scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. \textsf{knord} retains \textsf{knori}'s performance characteristics, while scaling in-memory through distributed computation in the cloud. \textsf{knor} modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate \textsf{knor} outperforms distributed commercial products like H$_2$O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of $10^7$ to $10^9$ points. △ Less

Submitted 24 June, 2017; v1 submitted 28 June, 2016; originally announced June 2016.

arXiv:1605.02060 [pdf, other]

doi 10.1117/12.2227444

Deformably Registering and Annotating Whole CLARITY Brains to an Atlas via Masked LDDMM

Authors: Kwame S. Kutten, Joshua T. Vogelstein, Nicolas Charon, Li Ye, Karl Deisseroth, Michael I. Miller

Abstract: The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre… ▽ More The CLARITY method renders brains optically transparent to enable high-resolution imaging in the structurally intact brain. Anatomically annotating CLARITY brains is necessary for discovering which regions contain signals of interest. Manually annotating whole-brain, terabyte CLARITY images is difficult, time-consuming, subjective, and error-prone. Automatically registering CLARITY images to a pre-annotated brain atlas offers a solution, but is difficult for several reasons. Removal of the brain from the skull and subsequent storage and processing cause variable non-rigid deformations, thus compounding inter-subject anatomical variability. Additionally, the signal in CLARITY images arises from various biochemical contrast agents which only sparsely label brain structures. This sparse labeling challenges the most commonly used registration algorithms that need to match image histogram statistics to the more densely labeled histological brain atlases. The standard method is a multiscale Mutual Information B-spline algorithm that dynamically generates an average template as an intermediate registration target. We determined that this method performs poorly when registering CLARITY brains to the Allen Institute's Mouse Reference Atlas (ARA), because the image histogram statistics are poorly matched. Therefore, we developed a method (Mask-LDDMM) for registering CLARITY images, that automatically find the brain boundary and learns the optimal deformation between the brain and atlas masks. Using Mask-LDDMM without an average template provided better results than the standard approach when registering CLARITY brains to the ARA. The LDDMM pipelines developed here provide a fast automated way to anatomically annotate CLARITY images. Our code is available as open source software at http://NeuroData.io. △ Less

Submitted 6 May, 2016; originally announced May 2016.

Journal ref: Proc. SPIE 9896 Optics, Photonics and Digital Technologies for Imaging Applications IV (2016)

arXiv:1604.06414 [pdf, other]

FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs

Authors: Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, Randal Burns

Abstract: R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory c… ▽ More R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory capacity by utilizing solid-state drives (SSDs) automatically. It provides a small number of generalized operations (GenOps) upon which we reimplement a large number of matrix functions in the R base package. As such, FlashR parallelizes and scales existing R code with little/no modification. To reduce data movement between CPU and SSDs, FlashR evaluates matrix operations lazily, fuses operations at runtime, and uses cache-aware, two-level matrix partitioning. We evaluate FlashR on a variety of machine learning and statistics algorithms on inputs of up to four billion data points. FlashR out-of-core tracks closely the performance of FlashR in-memory. The R code for machine learning algorithms executed in FlashR outperforms the in-memory execution of H2O and Spark MLlib by a factor of 2-10 and outperforms Revolution R Open by more than an order of magnitude. △ Less

Submitted 18 May, 2017; v1 submitted 21 April, 2016; originally announced April 2016.

arXiv:1604.03629 [pdf, other]

Quantifying mesoscale neuroanatomy using X-ray microtomography

Authors: Eva L. Dyer, William Gray Roncal, Hugo L. Fernandes, Doga Gürsoy, Vincent De Andrade, Rafael Vescovi, Kamel Fezzaa, Xianghui Xiao, Joshua T. Vogelstein, Chris Jacobsen, Konrad P. Körding, Narayanan Kasthuri

Abstract: Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography (… ▽ More Methods for resolving the 3D microstructure of the brain typically start by thinly slicing and staining the brain, and then imaging each individual section with visible light photons or electrons. In contrast, X-rays can be used to image thick samples, providing a rapid approach for producing large 3D brain maps without sectioning. Here we demonstrate the use of synchrotron X-ray microtomography ($μ$CT) for producing mesoscale $(1~μm^3)$ resolution brain maps from millimeter-scale volumes of mouse brain. We introduce a pipeline for $μ$CT-based brain map** that combines methods for sample preparation, imaging, automated segmentation of image volumes into cells and blood vessels, and statistical analysis of the resulting brain structures. Our results demonstrate that X-ray tomography promises rapid quantification of large brain volumes, complementing other brain map** and connectomics efforts. △ Less

Submitted 26 July, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: 28 pages, 9 figures

arXiv:1602.02864 [pdf, other]

doi 10.1109/TPDS.2016.2618791

Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs

Authors: Da Zheng, Disa Mhembere, Vince Lyzinski, Joshua Vogelstein, Carey E. Priebe, Randal Burns

Abstract: Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memor… ▽ More Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of kee** data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks--PageRank, eigensolving, and non-negative matrix factorization--and show that our SEM implementations significantly advance the state of the art. △ Less

Submitted 14 October, 2016; v1 submitted 9 February, 2016; originally announced February 2016.

Comments: published in IEEE Transactions on Parallel and Distributed Systems

arXiv:1602.01421 [pdf, other]

An SSD-based eigensolver for spectral analysis on billion-node graphs

Authors: Da Zheng, Randal Burns, Joshua Vogelstein, Carey E. Priebe, Alexander S. Szalay

Abstract: Many eigensolvers such as ARPACK and Anasazi have been developed to compute eigenvalues of a large sparse matrix. These eigensolvers are limited by the capacity of RAM. They run in memory of a single machine for smaller eigenvalue problems and require the distributed memory for larger problems. In contrast, we develop an SSD-based eigensolver framework called FlashEigen, which extends Anasazi ei… ▽ More Many eigensolvers such as ARPACK and Anasazi have been developed to compute eigenvalues of a large sparse matrix. These eigensolvers are limited by the capacity of RAM. They run in memory of a single machine for smaller eigenvalue problems and require the distributed memory for larger problems. In contrast, we develop an SSD-based eigensolver framework called FlashEigen, which extends Anasazi eigensolvers to SSDs, to compute eigenvalues of a graph with hundreds of millions or even billions of vertices in a single machine. FlashEigen performs sparse matrix multiplication in a semi-external memory fashion, i.e., we keep the sparse matrix on SSDs and the dense matrix in memory. We store the entire vector subspace on SSDs and reduce I/O to improve performance through caching the most recent dense matrix. Our result shows that FlashEigen is able to achieve 40%-60% performance of its in-memory implementation and has performance comparable to the Anasazi eigensolvers on a machine with 48 CPU cores. Furthermore, it is capable of scaling to a graph with 3.4 billion vertices and 129 billion edges. It takes about four hours to compute eight eigenvalues of the billion-node graph using 120 GB memory. △ Less

Submitted 26 February, 2016; v1 submitted 3 February, 2016; originally announced February 2016.

arXiv:1511.00100 [pdf, other]

doi 10.1109/TNNLS.2013.2253563

Fast Neuromimetic Object Recognition using FPGA Outperforms GPU Implementations

Authors: Garrick Orchard, Jacob G. Martin, R. Jacob Vogelstein, Ralph Etienne-Cummings

Abstract: Recognition of objects in still images has traditionally been regarded as a difficult computational problem. Although modern automated methods for visual object recognition have achieved steadily increasing recognition accuracy, even the most advanced computational vision approaches are unable to obtain performance equal to that of humans. This has led to the creation of many biologically-inspired… ▽ More Recognition of objects in still images has traditionally been regarded as a difficult computational problem. Although modern automated methods for visual object recognition have achieved steadily increasing recognition accuracy, even the most advanced computational vision approaches are unable to obtain performance equal to that of humans. This has led to the creation of many biologically-inspired models of visual object recognition, among them the HMAX model. HMAX is traditionally known to achieve high accuracy in visual object recognition tasks at the expense of significant computational complexity. Increasing complexity, in turn, increases computation time, reducing the number of images that can be processed per unit time. In this paper we describe how the computationally intensive, biologically inspired HMAX model for visual object recognition can be modified for implementation on a commercial Field Programmable Gate Array, specifically the Xilinx Virtex 6 ML605 evaluation board with XC6VLX240T FPGA. We show that with minor modifications to the traditional HMAX model we can perform recognition on images of size 128x128 pixels at a rate of 190 images per second with a less than 1% loss in recognition accuracy in both binary and multi-class visual object recognition tasks. △ Less

Submitted 31 October, 2015; originally announced November 2015.

Comments: 14 pages, 8 figures, 5 tables

Journal ref: Neural Networks and Learning Systems, IEEE Transactions on, vol.24, no.8, pp.1239-1252, 2013

arXiv:1509.03927 [pdf, other]

An M-Estimator for Reduced-Rank High-Dimensional Linear Dynamical System Identification

Authors: Shaojie Chen, Kai Liu, Yuguang Yang, Yuting Xu, Seonjoo Lee, Martin Lindquist, Brian S. Caffo, Joshua T. Vogelstein

Abstract: High-dimensional time-series data are becoming increasingly abundant across a wide variety of domains, spanning economics, neuroscience, particle physics, and cosmology. Fitting statistical models to such data, to enable parameter estimation and time-series prediction, is an important computational primitive. Existing methods, however, are unable to cope with the high-dimensional nature of these p… ▽ More High-dimensional time-series data are becoming increasingly abundant across a wide variety of domains, spanning economics, neuroscience, particle physics, and cosmology. Fitting statistical models to such data, to enable parameter estimation and time-series prediction, is an important computational primitive. Existing methods, however, are unable to cope with the high-dimensional nature of these problems, due to both computational and statistical reasons. We mitigate both kinds of issues via proposing an M-estimator for Reduced-rank System IDentification (MR. SID). A combination of low-rank approximations, L-1 and L-2 penalties, and some numerical linear algebra tricks, yields an estimator that is computationally efficient and numerically stable. Simulations and real data examples demonstrate the utility of this approach in a variety of problems. In particular, we demonstrate that MR. SID can estimate spatial filters, connectivity graphs, and time-courses from native resolution functional magnetic resonance imaging data. Other applications and extensions are immediately available, as our approach is a generalization of the classical Kalman Filter-Smoother Expectation-Maximization algorithm. △ Less

Submitted 13 September, 2015; originally announced September 2015.

arXiv:1508.05414 [pdf, other]

Stability and Localization of inter-individual differences in functional connectivity

Authors: Raag D. Airan, Joshua T. Vogelstein, Jay J. Pillai, Brian Caffo, James J. Pekar, Haris I. Sair

Abstract: Much recent attention has been paid to quantifying anatomic and functional neuroimaging on the individual subject level. For optimal individual subject characterization, specific acquisition and analysis features need to be identified that maximize inter-individual variability while concomitantly minimizing intra-subject variability. Here we develop a non-parametric statistical metric that quantif… ▽ More Much recent attention has been paid to quantifying anatomic and functional neuroimaging on the individual subject level. For optimal individual subject characterization, specific acquisition and analysis features need to be identified that maximize inter-individual variability while concomitantly minimizing intra-subject variability. Here we develop a non-parametric statistical metric that quantifies the degree to which a parameter set allows this individual subject differentiation. We apply this metric to analyzing publicly available test-retest resting-state fMRI (rs-fMRI) data sets. We find that for the question of maximizing individual differentiation, there is a relative tradeoff between increasing sampling through increased sampling frequency or increased acquisition time; that for the sizes of the interrogated data sets, only 4-5 min of acquisition time is necessary to perfectly differentiate each subject; and that brain regions that most contribute to individuals unique characterization lie in association cortices thought to contribute to higher cognitive function. These findings may guide optimal rs-fMRI experiment design and may aid elucidation of the neural bases for subject-to-subject differences. △ Less

Submitted 11 May, 2016; v1 submitted 21 August, 2015; originally announced August 2015.

Comments: 14 pages, 5 figures

arXiv:1507.08376 [pdf, other]

A Joint Graph Inference Case Study: the C.elegans Chemical and Electrical Connectomes

Authors: Li Chen, Joshua T. Vogelstein, Vince Lyzinski, Carey E. Priebe

Abstract: We investigate joint graph inference for the chemical and electrical connectomes of the \textit{Caenorhabditis elegans} roundworm. The \textit{C.elegans} connectomes consist of $253$ non-isolated neurons with known functional attributes, and there are two types of synaptic connectomes, resulting in a pair of graphs. We formulate our joint graph inference from the perspectives of seeded graph match… ▽ More We investigate joint graph inference for the chemical and electrical connectomes of the \textit{Caenorhabditis elegans} roundworm. The \textit{C.elegans} connectomes consist of $253$ non-isolated neurons with known functional attributes, and there are two types of synaptic connectomes, resulting in a pair of graphs. We formulate our joint graph inference from the perspectives of seeded graph matching and joint vertex classification. Our results suggest that connectomic inference should proceed in the joint space of the two connectomes, which has significant neuroscientific implications. △ Less

Submitted 5 August, 2015; v1 submitted 30 July, 2015; originally announced July 2015.

arXiv:1506.03410 [pdf, other]

Sparse Projection Oblique Randomer Forests

Authors: Tyler M. Tomita, James Browne, Cencheng Shen, Jaewon Chung, Jesse L. Patsolic, Benjamin Falk, Jason Yim, Carey E. Priebe, Randal Burns, Mauro Maggioni, Joshua T. Vogelstein

Abstract: Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortuna… ▽ More Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency. We introduce yet another decision forest, called "Sparse Projection Oblique Randomer Forests" (SPORF). SPORF uses very sparse random projections, i.e., linear combinations of a small subset of features. SPORF significantly improves accuracy over existing state-of-the-art algorithms on a standard benchmark suite for classification with >100 problems of varying dimension, sample size, and number of classes. To illustrate how SPORF addresses the limitations of both axis-aligned and existing oblique decision forest methods, we conduct extensive simulated experiments. SPORF typically yields improved performance over existing decision forests, while mitigating computational efficiency and scalability and maintaining interpretability. SPORF can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains. △ Less

Submitted 3 October, 2019; v1 submitted 10 June, 2015; originally announced June 2015.

Comments: 31 pages; submitted to Journal of Machine Learning Research for review

MSC Class: 68T10 ACM Class: I.5.2

Journal ref: Journal of Machine Learning Research 21(104), 1-39, 2020

arXiv:1506.02079 [pdf, other]

Gradient-Domain Fusion for Color Correction in Large EM Image Stacks

Authors: Michael Kazhdan, Kunal Lillaney, William Roncal, Davi Bock, Joshua Vogelstein, Randal Burns

Abstract: We propose a new gradient-domain technique for processing registered EM image stacks to remove inter-image discontinuities while preserving intra-image detail. To this end, we process the image stack by first performing anisotropic smoothing along the slice axis and then solving a Poisson equation within each slice to re-introduce the detail. The final image stack is continuous across the slice ax… ▽ More We propose a new gradient-domain technique for processing registered EM image stacks to remove inter-image discontinuities while preserving intra-image detail. To this end, we process the image stack by first performing anisotropic smoothing along the slice axis and then solving a Poisson equation within each slice to re-introduce the detail. The final image stack is continuous across the slice axis and maintains sharp details within each slice. Adapting existing out-of-core techniques for solving the linear system, we describe a parallel algorithm with time complexity that is linear in the size of the data and space complexity that is sub-linear, allowing us to process datasets as large as five teravoxels with a 600 MB memory footprint. △ Less

Submitted 5 June, 2015; originally announced June 2015.

arXiv:1412.4098 [pdf, other]

doi 10.1016/j.patrec.2017.04.005

Manifold Matching using Shortest-Path Distance and Joint Neighborhood Selection

Authors: Cencheng Shen, Joshua T. Vogelstein, Carey E. Priebe

Abstract: Matching datasets of multiple modalities has become an important task in data analysis. Existing methods often rely on the embedding and transformation of each single modality without utilizing any correspondence information, which often results in sub-optimal matching performance. In this paper, we propose a nonlinear manifold matching algorithm using shortest-path distance and joint neighborhood… ▽ More Matching datasets of multiple modalities has become an important task in data analysis. Existing methods often rely on the embedding and transformation of each single modality without utilizing any correspondence information, which often results in sub-optimal matching performance. In this paper, we propose a nonlinear manifold matching algorithm using shortest-path distance and joint neighborhood selection. Specifically, a joint nearest-neighbor graph is built for all modalities. Then the shortest-path distance within each modality is calculated from the joint neighborhood graph, followed by embedding into and matching in a common low-dimensional Euclidean space. Compared to existing algorithms, our approach exhibits superior performance for matching disparate datasets of multiple modalities. △ Less

Submitted 10 April, 2017; v1 submitted 12 December, 2014; originally announced December 2014.

Comments: 13 pages, 8 figures, 2 tables

Journal ref: Pattern Recognition Letters 92, 41-48, 2017

arXiv:1411.6880 [pdf, other]

An Automated Images-to-Graphs Framework for High Resolution Connectomics

Authors: William Gray Roncal, Dean M. Kleissas, Joshua T. Vogelstein, Priya Manavalan, Kunal Lillaney, Michael Pekala, Randal Burns, R. Jacob Vogelstein, Carey E. Priebe, Mark A. Chevillet, Gregory D. Hager

Abstract: Reconstructing a map of neuronal connectivity is a critical challenge in contemporary neuroscience. Recent advances in high-throughput serial section electron microscopy (EM) have produced massive 3D image volumes of nanoscale brain tissue for the first time. The resolution of EM allows for individual neurons and their synaptic connections to be directly observed. Recovering neuronal networks by m… ▽ More Reconstructing a map of neuronal connectivity is a critical challenge in contemporary neuroscience. Recent advances in high-throughput serial section electron microscopy (EM) have produced massive 3D image volumes of nanoscale brain tissue for the first time. The resolution of EM allows for individual neurons and their synaptic connections to be directly observed. Recovering neuronal networks by manually tracing each neuronal process at this scale is unmanageable, and therefore researchers are develo** automated image processing modules. Thus far, state-of-the-art algorithms focus only on the solution to a particular task (e.g., neuron segmentation or synapse identification). In this manuscript we present the first fully automated images-to-graphs pipeline (i.e., a pipeline that begins with an imaged volume of neural tissue and produces a brain graph without any human interaction). To evaluate overall performance and select the best parameters and methods, we also develop a metric to assess the quality of the output graphs. We evaluate a set of algorithms and parameters, searching possible operating points to identify the best available brain graph for our assessment metric. Finally, we deploy a reference end-to-end version of the pipeline on a large, publicly available data set. This provides a baseline result and framework for community analysis and future algorithm development and testing. All code and data derivatives have been made publicly available toward eventually unlocking new biofidelic computational primitives and understanding of neuropathologies. △ Less

Submitted 30 April, 2015; v1 submitted 25 November, 2014; originally announced November 2014.

Comments: 13 pages, first two authors contributed equally V2: Added additional experiments and clarifications; added information on infrastructure and pipeline environment

arXiv:1411.2158 [pdf, ps, other]

doi 10.1093/biomet/asx008

Covariate-assisted spectral clustering

Authors: Norbert Binkiewicz, Joshua T. Vogelstein, Karl Rohe

Abstract: Biological and social systems consist of myriad interacting units. The interactions can be represented in the form of a graph or network. Measurements of these graphs can reveal the underlying structure of these interactions, which provides insight into the systems that generated the graphs. Moreover, in applications such as connectomics, social networks, and genomics, graph data are accompanied b… ▽ More Biological and social systems consist of myriad interacting units. The interactions can be represented in the form of a graph or network. Measurements of these graphs can reveal the underlying structure of these interactions, which provides insight into the systems that generated the graphs. Moreover, in applications such as connectomics, social networks, and genomics, graph data are accompanied by contextualizing measures on each node. We utilize these node covariates to help uncover latent communities in a graph, using a modification of spectral clustering. Statistical guarantees are provided under a joint mixture model that we call the node-contextualized stochastic blockmodel, including a bound on the mis-clustering rate. The bound is used to derive conditions for achieving perfect clustering. For most simulated cases, covariate-assisted spectral clustering yields results superior to regularized spectral clustering without node covariates and to an adaptation of canonical correlation analysis. We apply our clustering method to large brain graphs derived from diffusion MRI data, using the node locations or neurological region membership as covariates. In both cases, covariate-assisted spectral clustering yields clusters that are easier to interpret neurologically. △ Less

Submitted 30 October, 2016; v1 submitted 8 November, 2014; originally announced November 2014.

Comments: 28 pages, 4 figures, includes substantial changes to theoretical results

Journal ref: Biometrika, Volume 104, Issue 2, 1 June 2017, Pages 361-377

arXiv:1408.0500 [pdf, other]

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

Authors: Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, Alexander S. Szalay

Abstract: Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance los… ▽ More Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss. We do so by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism. Our semi-external memory graph engine called FlashGraph stores vertex state in memory and edge lists on SSDs. It hides latency by overlap** computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge lists requested by applications from SSDs; to increase I/O throughput and reduce CPU overhead for I/O, it conservatively merges I/O requests. These designs maximize performance for applications with different I/O characteristics. FlashGraph exposes a general and flexible vertex-centric programming interface that can express a wide variety of graph algorithms and their optimizations. We demonstrate that FlashGraph in semi-external memory performs many algorithms with performance up to 80% of its in-memory implementation and significantly outperforms PowerGraph, a popular distributed in-memory graph engine. △ Less

Submitted 25 January, 2015; v1 submitted 3 August, 2014; originally announced August 2014.

Comments: published in FAST'15

arXiv:1406.7851 [pdf, other]

doi 10.1080/01621459.2016.1219260

Nonparametric Bayes Modeling of Populations of Networks

Authors: Daniele Durante, David B. Dunson, Joshua T. Vogelstein

Abstract: Replicated network data are increasingly available in many research fields. In connectomic applications, inter-connections among brain regions are collected for each patient under study, motivating statistical models which can flexibly characterize the probabilistic generative mechanism underlying these network-valued data. Available models for a single network are not designed specifically for in… ▽ More Replicated network data are increasingly available in many research fields. In connectomic applications, inter-connections among brain regions are collected for each patient under study, motivating statistical models which can flexibly characterize the probabilistic generative mechanism underlying these network-valued data. Available models for a single network are not designed specifically for inference on the entire probability mass function of a network-valued random variable and therefore lack flexibility in characterizing the distribution of relevant topological structures. We propose a flexible Bayesian nonparametric approach for modeling the population distribution of network-valued data. The joint distribution of the edges is defined via a mixture model which reduces dimensionality and efficiently incorporates network information within each mixture component by leveraging latent space representations. The formulation leads to an efficient Gibbs sampler and provides simple and coherent strategies for inference and goodness-of-fit assessments. We provide theoretical results on the flexibility of our model and illustrate improved performance --- compared to state-of-the-art models --- in simulations and application to human brain networks. △ Less

Submitted 5 June, 2016; v1 submitted 30 June, 2014; originally announced June 2014.

Journal ref: Journal of the American Statistical Association (2017). 112, 1516-1530

arXiv:1405.3133 [pdf, other]

Graph Matching: Relax at Your Own Risk

Authors: Vince Lyzinski, Donniell Fishkind, Marcelo Fiori, Joshua T. Vogelstein, Carey E. Priebe, Guillermo Sapiro

Abstract: Graph matching---aligning a pair of graphs to minimize their edge disagreements---has received wide-spread attention from both theoretical and applied communities over the past several decades, including combinatorics, computer vision, and connectomics. Its attention can be partially attributed to its computational difficulty. Although many heuristics have previously been proposed in the literatur… ▽ More Graph matching---aligning a pair of graphs to minimize their edge disagreements---has received wide-spread attention from both theoretical and applied communities over the past several decades, including combinatorics, computer vision, and connectomics. Its attention can be partially attributed to its computational difficulty. Although many heuristics have previously been proposed in the literature to approximately solve graph matching, very few have any theoretical support for their performance. A common technique is to relax the discrete problem to a continuous problem, therefore enabling practitioners to bring gradient-descent-type algorithms to bear. We prove that an indefinite relaxation (when solved exactly) almost always discovers the optimal permutation, while a common convex relaxation almost always fails to discover the optimal permutation. These theoretical results suggest that initializing the indefinite algorithm with the convex optimum might yield improved practical performance. Indeed, experimental results illuminate and corroborate these theoretical findings, demonstrating that excellent results are achieved in both benchmark and real data problems by amalgamating the two approaches. △ Less

Submitted 9 January, 2015; v1 submitted 13 May, 2014; originally announced May 2014.

Comments: 14 pages, 11 figures, 3 tables

arXiv:1404.4800 [pdf, other]

Automatic Annotation of Axoplasmic Reticula in Pursuit of Connectomes

Authors: Ayushi Sinha, William Gray Roncal, Narayanan Kasthuri, Ming Chuang, Priya Manavalan, Dean M. Kleissas, Joshua T. Vogelstein, R. Jacob Vogelstein, Randal Burns, Jeff W. Lichtman, Michael Kazhdan

Abstract: In this paper, we present a new pipeline which automatically identifies and annotates axoplasmic reticula, which are small subcellular structures present only in axons. We run our algorithm on the Kasthuri11 dataset, which was color corrected using gradient-domain techniques to adjust contrast. We use a bilateral filter to smooth out the noise in this data while preserving edges, which highlights… ▽ More In this paper, we present a new pipeline which automatically identifies and annotates axoplasmic reticula, which are small subcellular structures present only in axons. We run our algorithm on the Kasthuri11 dataset, which was color corrected using gradient-domain techniques to adjust contrast. We use a bilateral filter to smooth out the noise in this data while preserving edges, which highlights axoplasmic reticula. These axoplasmic reticula are then annotated using a morphological region growing algorithm. Additionally, we perform Laplacian sharpening on the bilaterally filtered data to enhance edges, and repeat the morphological region growing algorithm to annotate more axoplasmic reticula. We track our annotations through the slices to improve precision, and to create long objects to aid in segment merging. This method annotates axoplasmic reticula with high precision. Our algorithm can easily be adapted to annotate axoplasmic reticula in different sets of brain data by changing a few thresholds. The contribution of this work is the introduction of a straightforward and robust pipeline which annotates axoplasmic reticula with high precision, contributing towards advancements in automatic feature annotations in neural EM data. △ Less

Submitted 16 April, 2014; originally announced April 2014.

Comments: 2 pages, 1 figure

arXiv:1403.3724 [pdf, other]

VESICLE: Volumetric Evaluation of Synaptic Interfaces using Computer vision at Large Scale

Authors: William Gray Roncal, Michael Pekala, Verena Kaynig-Fittkau, Dean M. Kleissas, Joshua T. Vogelstein, Hanspeter Pfister, Randal Burns, R. Jacob Vogelstein, Mark A. Chevillet, Gregory D. Hager

Abstract: An open challenge problem at the forefront of modern neuroscience is to obtain a comprehensive map** of the neural pathways that underlie human brain function; an enhanced understanding of the wiring diagram of the brain promises to lead to new breakthroughs in diagnosing and treating neurological disorders. Inferring brain structure from image data, such as that obtained via electron microscopy… ▽ More An open challenge problem at the forefront of modern neuroscience is to obtain a comprehensive map** of the neural pathways that underlie human brain function; an enhanced understanding of the wiring diagram of the brain promises to lead to new breakthroughs in diagnosing and treating neurological disorders. Inferring brain structure from image data, such as that obtained via electron microscopy (EM), entails solving the problem of identifying biological structures in large data volumes. Synapses, which are a key communication structure in the brain, are particularly difficult to detect due to their small size and limited contrast. Prior work in automated synapse detection has relied upon time-intensive biological preparations (post-staining, isotropic slice thicknesses) in order to simplify the problem. This paper presents VESICLE, the first known approach designed for mammalian synapse detection in anisotropic, non-post-stained data. Our methods explicitly leverage biological context, and the results exceed existing synapse detection methods in terms of accuracy and scalability. We provide two different approaches - one a deep learning classifier (VESICLE-CNN) and one a lightweight Random Forest approach (VESICLE-RF) to offer alternatives in the performance-scalability space. Addressing this synapse detection challenge enables the analysis of high-throughput imaging data soon expected to reach petabytes of data, and provide tools for more rapid estimation of brain-graphs. Finally, to facilitate community efforts, we developed tools for large-scale object detection, and demonstrated this framework to find $\approx$ 50,000 synapses in 60,000 $μm ^3$ (220 GB on disk) of electron microscopy data. △ Less

Submitted 7 September, 2015; v1 submitted 14 March, 2014; originally announced March 2014.

Comments: v4: added clarifying figures and updates for readability. v3: fixed metadata. 11 pp v2: Added CNN classifier, significant changes to improve performance and generalization

Journal ref: Proceedings of the British Machine Vision Conference (BMVC), pages 81.1-81.13. BMVA Press, September 2015

arXiv:1401.3813 [pdf, other]

Seeded Graph Matching Via Joint Optimization of Fidelity and Commensurability

Authors: Heather Patsolic, Sancar Adali, Joshua T. Vogelstein, Youngser Park, Carey E. Friebe, Gongkai Li, Vince Lyzinski

Abstract: We present a novel approximate graph matching algorithm that incorporates seeded data into the graph matching paradigm. Our Joint Optimization of Fidelity and Commensurability (JOFC) algorithm embeds two graphs into a common Euclidean space where the matching inference task can be performed. Through real and simulated data examples, we demonstrate the versatility of our algorithm in matching graph… ▽ More We present a novel approximate graph matching algorithm that incorporates seeded data into the graph matching paradigm. Our Joint Optimization of Fidelity and Commensurability (JOFC) algorithm embeds two graphs into a common Euclidean space where the matching inference task can be performed. Through real and simulated data examples, we demonstrate the versatility of our algorithm in matching graphs with various characteristics--weightedness, directedness, loopiness, many-to-one and many-to-many matchings, and soft seedings. △ Less

Submitted 8 December, 2019; v1 submitted 15 January, 2014; originally announced January 2014.

Comments: 26 pages, 7 figures. Updated content and added application of simultaneous matching for several time-steps for zebrafish connectomes

arXiv:1312.4875 [pdf, other]

doi 10.1109/GlobalSIP.2013.6736878

MIGRAINE: MRI Graph Reliability Analysis and Inference for Connectomics

Authors: William Gray Roncal, Zachary H. Koterba, Disa Mhembere, Dean M. Kleissas, Joshua T. Vogelstein, Randal Burns, Anita R. Bowles, Dimitrios K. Donavos, Sephira Ryman, Rex E. Jung, Lei Wu, Vince Calhoun, R. Jacob Vogelstein

Abstract: Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resona… ▽ More Currently, connectomes (e.g., functional or structural brain graphs) can be estimated in humans at $\approx 1~mm^3$ scale using a combination of diffusion weighted magnetic resonance imaging, functional magnetic resonance imaging and structural magnetic resonance imaging scans. This manuscript summarizes a novel, scalable implementation of open-source algorithms to rapidly estimate magnetic resonance connectomes, using both anatomical regions of interest (ROIs) and voxel-size vertices. To assess the reliability of our pipeline, we develop a novel nonparametric non-Euclidean reliability metric. Here we provide an overview of the methods used, demonstrate our implementation, and discuss available user extensions. We conclude with results showing the efficacy and reliability of the pipeline over previous state-of-the-art. △ Less

Submitted 17 December, 2013; originally announced December 2013.

Comments: Published as part of 2013 IEEE GlobalSIP conference

arXiv:1312.4318 [pdf, other]

doi 10.1109/GlobalSIP.2013.6736874

Computing Scalable Multivariate Glocal Invariants of Large (Brain-) Graphs

Authors: Disa Mhembere, William Gray Roncal, Daniel Sussman, Carey E. Priebe, Rex Jung, Sephira Ryman, R. Jacob Vogelstein, Joshua T. Vogelstein, Randal Burns

Abstract: Graphs are quickly emerging as a leading abstraction for the representation of data. One important application domain originates from an emerging discipline called "connectomics". Connectomics studies the brain as a graph; vertices correspond to neurons (or collections thereof) and edges correspond to structural or functional connections between them. To explore the variability of connectomes---to… ▽ More Graphs are quickly emerging as a leading abstraction for the representation of data. One important application domain originates from an emerging discipline called "connectomics". Connectomics studies the brain as a graph; vertices correspond to neurons (or collections thereof) and edges correspond to structural or functional connections between them. To explore the variability of connectomes---to address both basic science questions regarding the structure of the brain, and medical health questions about psychiatry and neurology---one can study the topological properties of these brain-graphs. We define multivariate glocal graph invariants: these are features of the graph that capture various local and global topological properties of the graphs. We show that the collection of features can collectively be computed via a combination of daisy-chaining, sparse matrix representation and computations, and efficient approximations. Our custom open-source Python package serves as a back-end to a Web-service that we have created to enable researchers to upload graphs, and download the corresponding invariants in a number of different formats. Moreover, we built this package to support distributed processing on multicore machines. This is therefore an enabling technology for network science, lowering the barrier of entry by providing tools to biologists and analysts who otherwise lack these capabilities. As a demonstration, we run our code on 120 brain-graphs, each with approximately 16M vertices and up to 90M edges. △ Less

Submitted 16 December, 2013; originally announced December 2013.

Comments: Published as part of 2013 IEEE GlobalSIP conference

arXiv:1312.1869 [pdf, other]

Parallel inversion of huge covariance matrices

Authors: Anjishnu Banerjee, Joshua Vogelstein, David Dunson

Abstract: An extremely common bottleneck encountered in statistical learning algorithms is inversion of huge covariance matrices, examples being in evaluating Gaussian likelihoods for a large number of data points. We propose general parallel algorithms for inverting positive definite matrices, which are nearly rank deficient. Such matrix inversions are needed in Gaussian process computations, among other s… ▽ More An extremely common bottleneck encountered in statistical learning algorithms is inversion of huge covariance matrices, examples being in evaluating Gaussian likelihoods for a large number of data points. We propose general parallel algorithms for inverting positive definite matrices, which are nearly rank deficient. Such matrix inversions are needed in Gaussian process computations, among other settings, and remain a bottleneck even with the increasing literature on low rank approximations. We propose a general class of algorithms for parallelizing computations to dramatically speed up computation time by orders of magnitude exploiting multicore architectures. We implement our algorithm on a cloud computing platform, providing pseudo and actual code. The algorithm can be easily implemented on any multicore parallel computing resource. Some illustrations are provided to give a flavor for the gains and what becomes possible in freeing up this bottleneck. △ Less

Submitted 6 December, 2013; originally announced December 2013.

Comments: 17 pages, 3 tables, 3 figures

arXiv:1312.1099 [pdf, other]

Multiscale Dictionary Learning for Estimating Conditional Distributions

Authors: Francesca Petralia, Joshua Vogelstein, David B. Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given high-dimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a co… ▽ More Nonparametric estimation of the conditional distribution of a response given high-dimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. △ Less

Submitted 4 December, 2013; originally announced December 2013.

Journal ref: Proceeding of Neural Information Processing Systems, Lake Tahoe, Nevada December 2013

arXiv:1311.6425 [pdf, other]

Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

Authors: Marcelo Fiori, Pablo Sprechmann, Joshua Vogelstein, Pablo Musé, Guillermo Sapiro

Abstract: Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsity-related techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently… ▽ More Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsity-related techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available. △ Less

Submitted 25 November, 2013; originally announced November 2013.

Comments: NIPS 2013

arXiv:1311.5954 [pdf, other]

doi 10.1109/TPAMI.2015.2456913

Robust Vertex Classification

Authors: Li Chen, Cencheng Shen, Joshua Vogelstein, Carey Priebe

Abstract: For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs, adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but this approach requires knowledge of and critically depends on the model dimension. In this paper, we propose a sparse representation vertex classifier which does not require infor… ▽ More For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs, adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but this approach requires knowledge of and critically depends on the model dimension. In this paper, we propose a sparse representation vertex classifier which does not require information about the model dimension. This classifier represents a test vertex as a sparse combination of the vertices in the training set and uses the recovered coefficients to classify the test vertex. We prove consistency of our proposed classifier for stochastic blockmodels, and demonstrate that the sparse representation classifier can predict vertex labels with higher accuracy than adjacency spectral embedding approaches via both simulation studies and real data experiments. Our results demonstrate the robustness and effectiveness of our proposed vertex classifier when the model dimension is unknown. △ Less

Submitted 22 April, 2015; v1 submitted 22 November, 2013; originally announced November 2013.

Comments: 18 pages, 13 figures

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence 38(3), 578-590, 2016

arXiv:1310.1297 [pdf, other]

Spectral Clustering for Divide-and-Conquer Graph Matching

Authors: Vince Lyzinski, Daniel L. Sussman, Donniell E. Fishkind, Henry Pao, Li Chen, Joshua T. Vogelstein, Youngser Park, Carey E. Priebe

Abstract: We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through ou… ▽ More We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through our divide-and-conquer procedure. We also demonstrate the effectiveness of our approach in matching very large graphs in simulated and real data examples, showing up to a factor of 8 improvement in runtime with minimal sacrifice in accuracy. △ Less

Submitted 12 March, 2015; v1 submitted 4 October, 2013; originally announced October 2013.

Comments: 32 pages, 8 figures

arXiv:1310.0041 [pdf, other]

Gradient-Domain Processing for Large EM Image Stacks

Authors: Michael Kazhdan, Randal Burns, Bobby Kasthuri, Jeff Lichtman, Jacob Vogelstein, Joshua Vogelstein

Abstract: We propose a new gradient-domain technique for processing registered EM image stacks to remove the inter-image discontinuities while preserving intra-image detail. To this end, we process the image stack by first performing anisotropic diffusion to smooth the data along the slice axis and then solving a screened-Poisson equation within each slice to re-introduce the detail. The final image stack i… ▽ More We propose a new gradient-domain technique for processing registered EM image stacks to remove the inter-image discontinuities while preserving intra-image detail. To this end, we process the image stack by first performing anisotropic diffusion to smooth the data along the slice axis and then solving a screened-Poisson equation within each slice to re-introduce the detail. The final image stack is both continuous across the slice axis (facilitating the tracking of information between slices) and maintains sharp details within each slice (supporting automatic feature detection). To support this editing, we describe the implementation of the first multigrid solver designed for efficient gradient domain processing of large, out-of-core, voxel grids. △ Less

Submitted 30 September, 2013; originally announced October 2013.

arXiv:1306.3543 [pdf, other]

The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience

Authors: Randal Burns, William Gray Roncal, Dean Kleissas, Kunal Lillaney, Priya Manavalan, Eric Perlman, Daniel R. Berger, Davi D. Bock, Kwanghun Chung, Logan Grosenick, Narayanan Kasthuri, Nicholas C. Weiler, Karl Deisseroth, Michael Kazhdan, Jeff Lichtman, R. Clay Reid, Stephen J. Smith, Alexander S. Szalay, Joshua T. Vogelstein, R. Jacob Vogelstein

Abstract: We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes---neural connectivity maps of the brain---using the parallel execution of computer vision algorithms on hi… ▽ More We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes---neural connectivity maps of the brain---using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at http://openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems---reads to parallel disk arrays and writes to solid-state storage---to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effectiveness of spatial data organization. △ Less

Submitted 18 June, 2013; v1 submitted 14 June, 2013; originally announced June 2013.

Comments: 11 pages, 13 figures

arXiv:1304.5894 [pdf]

Bayesian crack detection in ultra high resolution multimodal images of paintings

Authors: Bruno Cornelis, Yun Yang, Joshua T. Vogelstein, Ann Dooms, Ingrid Daubechies, David Dunson

Abstract: The preservation of our cultural heritage is of paramount importance. Thanks to recent developments in digital acquisition techniques, powerful image analysis algorithms are developed which can be useful non-invasive tools to assist in the restoration and preservation of art. In this paper we propose a semi-supervised crack detection method that can be used for high-dimensional acquisitions of pai… ▽ More The preservation of our cultural heritage is of paramount importance. Thanks to recent developments in digital acquisition techniques, powerful image analysis algorithms are developed which can be useful non-invasive tools to assist in the restoration and preservation of art. In this paper we propose a semi-supervised crack detection method that can be used for high-dimensional acquisitions of paintings coming from different modalities. Our dataset consists of a recently acquired collection of images of the Ghent Altarpiece (1432), one of Northern Europe's most important art masterpieces. Our goal is to build a classifier that is able to discern crack pixels from the background consisting of non-crack pixels, making optimal use of the information that is provided by each modality. To accomplish this we employ a recently developed non-parametric Bayesian classifier, that uses tensor factorizations to characterize any conditional probability. A prior is placed on the parameters of the factorization such that every possible interaction between predictors is allowed while still identifying a sparse subset among these predictors. The proposed Bayesian classifier, which we will refer to as conditional Bayesian tensor factorization or CBTF, is assessed by visually comparing classification results with the Random Forest (RF) algorithm. △ Less

Submitted 23 April, 2013; v1 submitted 22 April, 2013; originally announced April 2013.

Comments: 8 pages, double column

arXiv:1304.4657 [pdf, other]

DELTACON: A Principled Massive-Graph Similarity Function

Authors: Danai Koutra, Joshua T. Vogelstein, Christos Faloutsos

Abstract: How much did a network change since yesterday? How different is the wiring between Bob's brain (a left-handed male) and Alice's brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, a… ▽ More How much did a network change since yesterday? How different is the wiring between Bob's brain (a left-handed male) and Alice's brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph. △ Less

Submitted 16 April, 2013; originally announced April 2013.

Comments: 2013 SIAM International Conference in Data Mining (SDM)

ACM Class: E.1; G.2.2

arXiv:1304.0542 [pdf, other]

Multichannel Electrophysiological Spike Sorting via Joint Dictionary Learning & Mixture Modeling

Authors: David E. Carlson, Joshua T. Vogelstein, Qisong Wu, Wenzhao Lian, Mingyuan Zhou, Colin R. Stoetzner, Daryl Kipke, Douglas Weber, David B. Dunson, Lawrence Carin

Abstract: We propose a construction for joint feature learning and clustering of multichannel extracellular electrophysiological data across multiple recording periods for action potential detection and discrimination ("spike sorting"). Our construction improves over the previous state-of-the art principally in four ways. First, via sharing information across channels, we can better distinguish between sing… ▽ More We propose a construction for joint feature learning and clustering of multichannel extracellular electrophysiological data across multiple recording periods for action potential detection and discrimination ("spike sorting"). Our construction improves over the previous state-of-the art principally in four ways. First, via sharing information across channels, we can better distinguish between single-unit spikes and artifacts. Second, our proposed "focused mixture model" (FMM) elegantly deals with units appearing, disappearing, or reappearing over multiple recording days, an important consideration for any chronic experiment. Third, by jointly learning features and clusters, we improve performance over previous attempts that proceeded via a two-stage ("frequentist") learning process. Fourth, by directly modeling spike rate, we improve detection of sparsely spiking neurons. Moreover, our Bayesian construction seamlessly handles missing data. We present state-of-the-art performance without requiring manually tuning of many hyper-parameters on both a public dataset with partial ground truth and a new experimental dataset. △ Less

Submitted 4 August, 2013; v1 submitted 2 April, 2013; originally announced April 2013.

Comments: 14 pages, 9 figures

arXiv:1211.3601 [pdf, other]

Statistical inference on errorfully observed graphs

Authors: Carey E. Priebe, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein

Abstract: Statistical inference on graphs is a burgeoning field in the applied and theoretical statistics communities, as well as throughout the wider world of science, engineering, business, etc. In many applications, we are faced with the reality of errorfully observed graphs. That is, the existence of an edge between two vertices is based on some imperfect assessment. In this paper, we consider a graph… ▽ More Statistical inference on graphs is a burgeoning field in the applied and theoretical statistics communities, as well as throughout the wider world of science, engineering, business, etc. In many applications, we are faced with the reality of errorfully observed graphs. That is, the existence of an edge between two vertices is based on some imperfect assessment. In this paper, we consider a graph $G = (V,E)$. We wish to perform an inference task -- the inference task considered here is "vertex classification". However, we do not observe $G$; rather, for each potential edge $uv \in {{V}\choose{2}}$ we observe an "edge-feature" which we use to classify $uv$ as edge/not-edge. Thus we errorfully observe $G$ when we observe the graph $\widetilde{G} = (V,\widetilde{E})$ as the edges in $\widetilde{E}$ arise from the classifications of the "edge-features", and are expected to be errorful. Moreover, we face a quantity/quality trade-off regarding the edge-features we observe -- more informative edge-features are more expensive, and hence the number of potential edges that can be assessed decreases with the quality of the edge-features. We studied this problem by formulating a quantity/quality tradeoff for a simple class of random graphs model, namely the stochastic blockmodel. We then consider a simple but optimal vertex classifier for classifying $v$ and we derive the optimal quantity/quality operating point for subsequent graph inference in the face of this trade-off. The optimal operating points for the quantity/quality trade-off are surprising and illustrate the issue that methods for intermediate tasks should be chosen to maximize performance for the ultimate inference task. Finally, we investigate the quantity/quality tradeoff for errorful obesrvations of the {\it C.\ elegans} connectome graph. △ Less

Submitted 21 July, 2014; v1 submitted 15 November, 2012; originally announced November 2012.

Comments: 30 pages, 8 figures

arXiv:1205.0309 [pdf, other]

Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown

Authors: Donniell E. Fishkind, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

Abstract: For random graphs distributed according to a stochastic block model, we consider the inferential task of partioning vertices into blocks using spectral techniques. Spectral partioning using the normalized Laplacian and the adjacency matrix have both been shown to be consistent as the number of vertices tend to infinity. Importantly, both procedures require that the number of blocks and the rank of… ▽ More For random graphs distributed according to a stochastic block model, we consider the inferential task of partioning vertices into blocks using spectral techniques. Spectral partioning using the normalized Laplacian and the adjacency matrix have both been shown to be consistent as the number of vertices tend to infinity. Importantly, both procedures require that the number of blocks and the rank of the communication probability matrix are known, even as the rest of the parameters may be unknown. In this article, we prove that the (suitably modified) adjacency-spectral partitioning procedure, requiring only an upper bound on the rank of the communication probability matrix, is consistent. Indeed, this result demonstrates a robustness to model mis-specification; an overestimate of the rank may impose a moderate performance penalty, but the procedure is still consistent. Furthermore, we extend this procedure to the setting where adjacencies may have multiple modalities and we allow for either directed or undirected graphs. △ Less

Submitted 21 August, 2012; v1 submitted 1 May, 2012; originally announced May 2012.

Comments: 26 pages, 2 figure

Showing 51–100 of 108 results for author: Vogelstein, J