-
PDE-Inspired Algorithms for Semi-Supervised Learning on Point Clouds
Authors:
Oliver M. Crook,
Tim Hurst,
Carola-Bibiane Schönlieb,
Matthew Thorpe,
Konstantinos C. Zygalakis
Abstract:
Given a data set and a subset of labels the problem of semi-supervised learning on point clouds is to extend the labels to the entire data set. In this paper we extend the labels by minimising the constrained discrete $p$-Dirichlet energy. Under suitable conditions the discrete problem can be connected, in the large data limit, with the minimiser of a weighted continuum $p$-Dirichlet energy with t…
▽ More
Given a data set and a subset of labels the problem of semi-supervised learning on point clouds is to extend the labels to the entire data set. In this paper we extend the labels by minimising the constrained discrete $p$-Dirichlet energy. Under suitable conditions the discrete problem can be connected, in the large data limit, with the minimiser of a weighted continuum $p$-Dirichlet energy with the same constraints. We take advantage of this connection by designing numerical schemes that first estimate the density of the data and then apply PDE methods, such as pseudo-spectral methods, to solve the corresponding Euler-Lagrange equation. We prove that our scheme is consistent in the large data limit for two methods of density estimation: kernel density estimation and spline kernel density estimation.
△ Less
Submitted 23 September, 2019;
originally announced September 2019.
-
Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics
Authors:
Oliver M. Crook,
Kathryn S. Lilley,
Laurent Gatto,
Paul D. W. Kirk
Abstract:
Understanding sub-cellular protein localisation is an essential component to analyse context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution map** of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of sp…
▽ More
Understanding sub-cellular protein localisation is an essential component to analyse context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution map** of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. Proteins with a priori labelled locations motivate using semi-supervised learning to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. As in other recent work, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied it order to reduce the computational complexity of inversion and hence accelerate computation. A stand-alone R-package implementing these methods using high-performance C++ libraries is available at: https://github.com/ococrook/toeplitz
△ Less
Submitted 11 March, 2019; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
Authors:
Oliver M. Crook,
Laurent Gatto,
Paul D. W. Kirk
Abstract:
The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang and Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS)…
▽ More
The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang and Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5,157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel
△ Less
Submitted 12 October, 2018;
originally announced October 2018.