-
Regularised Canonical Correlation Analysis: graphical lasso, biplots and beyond
Authors:
Lennie Wells,
Kumar Thurimella,
Sergio Bacallado
Abstract:
Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption…
▽ More
Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption of conditional independencies and based on the Graphical Lasso. Our method has desirable theoretical guarantees and good empirical performance, demonstrated through extensive simulations and real-world biological datasets. Recognizing the difficulties of model selection in high dimensions and other practical challenges of applying CCA in real-world settings, we introduce a novel framework for evaluating and interpreting regularized CCA models in the context of Exploratory Data Analysis (EDA), which we hope will empower researchers and pave the way for wider adoption.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Tanimoto Random Features for Scalable Molecular Machine Learning
Authors:
Austin Tripp,
Sergio Bacallado,
Sukriti Singh,
José Miguel Hernández-Lobato
Abstract:
The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features…
▽ More
The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features to allow this kernel to scale to large datasets, and in the process discover a novel extension of the kernel to real-valued vectors. We theoretically characterize these random features, and provide error bounds on the spectral norm of the Gram matrix. Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks.
△ Less
Submitted 13 November, 2023; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Conditional Neural Processes for Molecules
Authors:
Miguel Garcia-Ortegon,
Andreas Bender,
Sergio Bacallado
Abstract:
Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs h…
▽ More
Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.
△ Less
Submitted 23 February, 2023; v1 submitted 17 October, 2022;
originally announced October 2022.
-
DOCKSTRING: easy molecular docking yields better benchmarks for ligand design
Authors:
Miguel García-Ortegón,
Gregor N. C. Simm,
Austin J. Tripp,
José Miguel Hernández-Lobato,
Andreas Bender,
Sergio Bacallado
Abstract:
The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate's interaction wit…
▽ More
The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate's interaction with the target. By contrast, molecular docking is a widely successful method in drug discovery to estimate binding affinities. However, docking simulations require a significant amount of domain knowledge to set up correctly which hampers adoption. To this end, we present DOCKSTRING, a bundle for meaningful and robust comparison of ML models consisting of three components: (1) an open-source Python package for straightforward computation of docking scores; (2) an extensive dataset of docking scores and poses of more than 260K ligands for 58 medically-relevant targets; and (3) a set of pharmaceutically-relevant benchmark tasks including regression, virtual screening, and de novo design. The Python package implements a robust ligand and target preparation protocol that allows non-experts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more appropriate evaluation objective than simple physicochemical properties, yielding more realistic benchmark tasks and molecular candidates.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
The *-Edge-Reinforced Random Walk
Authors:
Sergio Bacallado,
Christophe Sabot,
Pierre Tarrès
Abstract:
We define a linearly reinforced process called the *-Edge-Reinforced Random Walk (*-ERRW ) which can be seen as a Yaglom reversible, hence non-reversible, extension of the Edge-Reinforced Random Walk (ERRW) introduced by Coppersmith and Diaconis in 1986. This family of processes also generalizes the r-dependent ERRW introduced by Bacallado (2009). Under some assumptions on the initial weights, the…
▽ More
We define a linearly reinforced process called the *-Edge-Reinforced Random Walk (*-ERRW ) which can be seen as a Yaglom reversible, hence non-reversible, extension of the Edge-Reinforced Random Walk (ERRW) introduced by Coppersmith and Diaconis in 1986. This family of processes also generalizes the r-dependent ERRW introduced by Bacallado (2009). Under some assumptions on the initial weights, the *-ERRW is partially exchangeable in the sense of Diaconis and Freedman (1980), and thus it is a random walk in a random environment. The main result of the paper gives the explicit expression of the mixing law, hence extending the "magic formula" of Coppersmith and Diaconis from the case of mixtures of reversible Markov chains to the case of mixtures of Yaglom reversible Markov chains.
△ Less
Submitted 29 November, 2023; v1 submitted 17 February, 2021;
originally announced February 2021.
-
BETS: The dangers of selection bias in early analyses of the coronavirus disease (COVID-19) pandemic
Authors:
Qingyuan Zhao,
Nianqiao Ju,
Sergio Bacallado,
Rajen D. Shah
Abstract:
The coronavirus disease 2019 (COVID-19) has quickly grown from a regional outbreak in Wuhan, China to a global pandemic. Early estimates of the epidemic growth and incubation period of COVID-19 may have been biased due to sample selection. Using detailed case reports from 14 locations in and outside mainland China, we obtained 378 Wuhan-exported cases who left Wuhan before an abrupt travel quarant…
▽ More
The coronavirus disease 2019 (COVID-19) has quickly grown from a regional outbreak in Wuhan, China to a global pandemic. Early estimates of the epidemic growth and incubation period of COVID-19 may have been biased due to sample selection. Using detailed case reports from 14 locations in and outside mainland China, we obtained 378 Wuhan-exported cases who left Wuhan before an abrupt travel quarantine. We developed a generative model we call BETS for four key epidemiological events---Beginning of exposure, End of exposure, time of Transmission, and time of Symptom onset (BETS)---and derived explicit formulas to correct for the sample selection. We gave a detailed illustration of why some early and highly influential analyses of the COVID-19 pandemic were severely biased. All our analyses, regardless of which subsample and model were being used, point to an epidemic doubling time of 2 to 2.5 days during the early outbreak in Wuhan. A Bayesian nonparametric analysis further suggests that about 5% of the symptomatic cases may not develop symptoms within 14 days of infection and that men may be much more likely than women to develop symptoms within 2 days of infection.
△ Less
Submitted 24 September, 2020; v1 submitted 16 April, 2020;
originally announced April 2020.
-
Bayesian Uncertainty Directed Trial Designs
Authors:
Steffen Ventz,
Matteo Cellamare,
Sergio Bacallado,
Lorenzo Trippa
Abstract:
Most Bayesian response-adaptive designs unbalance randomization rates towards the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. We discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to…
▽ More
Most Bayesian response-adaptive designs unbalance randomization rates towards the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. We discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to the experiment. All decisions during the trial are selected to optimize the available information at the end of the study. The approach can be applied to several designs, ranging from early stage multi-arm trials to biomarker-driven and multi-endpoint studies. We discuss the asymptotic limit of the patient allocation proportion to treatments, and illustrate the finite-sample operating characteristics of BUD designs through examples, including multi-arm trials, biomarker-stratified trials, and trials with multiple co-primary endpoints.
△ Less
Submitted 29 June, 2018;
originally announced June 2018.
-
Bayesian Mixed Effects Models for Zero-inflated Compositions in Microbiome Data Analysis
Authors:
Boyu Ren,
Sergio Bacallado,
Stefano Favaro,
Tommi Vatanen,
Curtis Huttenhower,
Lorenzo Trippa
Abstract:
Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal…
▽ More
Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal prior on each microbial composition is a Dirichlet Process, and dependence across compositions is induced through a linear combination of individual covariates, such as disease biomarkers or the subject's age, and latent factors. The latent factors capture residual variability and their dimensionality is learned from the data in a fully Bayesian procedure. The proposed model is tested in data analyses and simulation studies with zero-inflated compositions. In these settings, within each sample, a large proportion of counts per microbial species are equal to zero. In our Bayesian model a priori the probability of compositions with absent microbial species is strictly positive. We propose an efficient algorithm to sample from the posterior and visualizations of model parameters which reveal associations between covariates and microbial compositions. We evaluate the proposed method in simulation studies, and then analyze a microbiome dataset for infants with type 1 diabetes which contains a large proportion of zeros in the sample-specific microbial compositions.
△ Less
Submitted 24 August, 2019; v1 submitted 3 November, 2017;
originally announced November 2017.
-
Sequential Matrix Completion
Authors:
Annie Marsden,
Sergio Bacallado
Abstract:
We propose a novel algorithm for sequential matrix completion in a recommender system setting, where the $(i,j)$th entry of the matrix corresponds to a user $i$'s rating of product $j$. The objective of the algorithm is to provide a sequential policy for user-product pair recommendation which will yield the highest possible ratings after a finite time horizon. The algorithm uses a Gamma process fa…
▽ More
We propose a novel algorithm for sequential matrix completion in a recommender system setting, where the $(i,j)$th entry of the matrix corresponds to a user $i$'s rating of product $j$. The objective of the algorithm is to provide a sequential policy for user-product pair recommendation which will yield the highest possible ratings after a finite time horizon. The algorithm uses a Gamma process factor model with two posterior-focused bandit policies, Thompson Sampling and Information-Directed Sampling. While Thompson Sampling shows competitive performance in simulations, state-of-the-art performance is obtained from Information-Directed Sampling, which makes its recommendations based off a ratio between the expected reward and a measure of information gain. To our knowledge, this is the first implementation of Information Directed Sampling on large real datasets.
This approach contributes to a recent line of research on bandit approaches to collaborative filtering including Kawale et al. (2015), Li et al. (2010), Bresler et al. (2014), Li et al. (2016), Deshpande & Montanari (2012), and Zhao et al. (2013). The setting of this paper, as has been noted in Kawale et al. (2015) and Zhao et al. (2013), presents significant challenges to bounding regret after finite horizons. We discuss these challenges in relation to simpler models for bandits with side information, such as linear or gaussian process bandits, and hope the experiments presented here motivate further research toward theoretical guarantees.
△ Less
Submitted 22 October, 2017;
originally announced October 2017.
-
Bayesian Nonparametric Ordination for the Analysis of Microbial Communities
Authors:
Boyu Ren,
Sergio Bacallado,
Stefano Favaro,
Susan Holmes,
Lorenzo Trippa
Abstract:
Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters…
▽ More
Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters that capture and describe variations of OTU counts across biological samples. It remains important to evaluate how uncertainty in estimates of each biological sample's microbial distribution propagates to ordination analyses, including visualization of clusters and projections of biological samples on low dimensional spaces. We propose a Bayesian analysis for dependent distributions to endow frequently used ordinations with estimates of uncertainty. A Bayesian nonparametric prior for dependent normalized random measures is constructed, which is marginally equivalent to the normalized generalized Gamma process, a well-known prior for nonparametric analyses. In our prior the dependence and similarity between microbial distributions is represented by latent factors that concentrate in a low dimensional space. We use a shrinkage prior to tune the dimensionality of the latent factors. The resulting posterior samples of model parameters can be used to evaluate uncertainty in analyses routinely applied in microbiome studies. Specifically, by combining them with multivariate data analysis techniques we can visualize credible regions in ecological ordination plots. The characteristics of the proposed model are illustrated through a simulation study and applications in two microbiome datasets.
△ Less
Submitted 20 January, 2017; v1 submitted 19 January, 2016;
originally announced January 2016.
-
Looking-backward probabilities for Gibbs-type exchangeable random partitions
Authors:
Sergio Bacallado,
Stefano Favaro,
Lorenzo Trippa
Abstract:
Gibbs-type random probability measures and the exchangeable random partitions they induce represent the subject of a rich and active literature. They provide a probabilistic framework for a wide range of theoretical and applied problems that are typically referred to as species sampling problems. In this paper, we consider the class of looking-backward species sampling problems introduced in Lijoi…
▽ More
Gibbs-type random probability measures and the exchangeable random partitions they induce represent the subject of a rich and active literature. They provide a probabilistic framework for a wide range of theoretical and applied problems that are typically referred to as species sampling problems. In this paper, we consider the class of looking-backward species sampling problems introduced in Lijoi et al. (Ann. Appl. Probab. 18 (2008) 1519-1547) in Bayesian nonparametrics. Specifically, given some information on the random partition induced by an initial sample from a Gibbs-type random probability measure, we study the conditional distributions of statistics related to the old species, namely those species detected in the initial sample and possibly re-observed in an additional sample. The proposed results contribute to the analysis of conditional properties of Gibbs-type exchangeable random partitions, so far focused mainly on statistics related to those species generated by the additional sample and not already detected in the initial sample.
△ Less
Submitted 3 April, 2015;
originally announced April 2015.
-
Bayesian nonparametric analysis of reversible Markov chains
Authors:
Sergio Bacallado,
Stefano Favaro,
Lorenzo Trippa
Abstract:
We introduce a three-parameter random walk with reinforcement, called the $(θ,α,β)$ scheme, which generalizes the linearly edge reinforced random walk to uncountable spaces. The parameter $β$ smoothly tunes the $(θ,α,β)$ scheme between this edge reinforced random walk and the classical exchangeable two-parameter Hoppe urn scheme, while the parameters $α$ and $θ$ modulate how many states are typica…
▽ More
We introduce a three-parameter random walk with reinforcement, called the $(θ,α,β)$ scheme, which generalizes the linearly edge reinforced random walk to uncountable spaces. The parameter $β$ smoothly tunes the $(θ,α,β)$ scheme between this edge reinforced random walk and the classical exchangeable two-parameter Hoppe urn scheme, while the parameters $α$ and $θ$ modulate how many states are typically visited. Resorting to de Finetti's theorem for Markov chains, we use the $(θ,α,β)$ scheme to define a nonparametric prior for Bayesian analysis of reversible Markov chains. The prior is applied in Bayesian nonparametric inference for species sampling problems with data generated from a reversible Markov chain with an unknown transition kernel. As a real example, we analyze data from molecular dynamics simulations of protein folding.
△ Less
Submitted 6 June, 2013;
originally announced June 2013.
-
Bayesian analysis of variable-order, reversible Markov chains
Authors:
Sergio Bacallado
Abstract:
We define a conjugate prior for the reversible Markov chain of order $r$. The prior arises from a partially exchangeable reinforced random walk, in the same way that the Beta distribution arises from the exchangeable Polyá urn. An extension to variable-order Markov chains is also derived. We show the utility of this prior in testing the order and estimating the parameters of a reversible Markov mo…
▽ More
We define a conjugate prior for the reversible Markov chain of order $r$. The prior arises from a partially exchangeable reinforced random walk, in the same way that the Beta distribution arises from the exchangeable Polyá urn. An extension to variable-order Markov chains is also derived. We show the utility of this prior in testing the order and estimating the parameters of a reversible Markov model.
△ Less
Submitted 13 May, 2011;
originally announced May 2011.