Search | arXiv e-print repository

Regularised Canonical Correlation Analysis: graphical lasso, biplots and beyond

Authors: Lennie Wells, Kumar Thurimella, Sergio Bacallado

Abstract: Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption… ▽ More Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption of conditional independencies and based on the Graphical Lasso. Our method has desirable theoretical guarantees and good empirical performance, demonstrated through extensive simulations and real-world biological datasets. Recognizing the difficulties of model selection in high dimensions and other practical challenges of applying CCA in real-world settings, we introduce a novel framework for evaluating and interpreting regularized CCA models in the context of Exploratory Data Analysis (EDA), which we hope will empower researchers and pave the way for wider adoption. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 83 pages, 27 figures

MSC Class: 62H20 (Primary) 62H12; 62P10 (Secondary) ACM Class: G.3

arXiv:2306.14809 [pdf, other]

Tanimoto Random Features for Scalable Molecular Machine Learning

Authors: Austin Tripp, Sergio Bacallado, Sukriti Singh, José Miguel Hernández-Lobato

Abstract: The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features… ▽ More The Tanimoto coefficient is commonly used to measure the similarity between molecules represented as discrete fingerprints, either as a distance metric or a positive definite kernel. While many kernel methods can be accelerated using random feature approximations, at present there is a lack of such approximations for the Tanimoto kernel. In this paper we propose two kinds of novel random features to allow this kernel to scale to large datasets, and in the process discover a novel extension of the kernel to real-valued vectors. We theoretically characterize these random features, and provide error bounds on the spectral norm of the Gram matrix. Experimentally, we show that these random features are effective at approximating the Tanimoto coefficient of real-world datasets and are useful for molecular property prediction and optimization tasks. △ Less

Submitted 13 November, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: Camera-ready version presented at NeurIPS 2023. Updates include: notation changes, better description of features in section 4, updated experiments, link to code

arXiv:2210.09211 [pdf, other]

Conditional Neural Processes for Molecules

Authors: Miguel Garcia-Ortegon, Andreas Bender, Sergio Bacallado

Abstract: Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs h… ▽ More Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification. △ Less

Submitted 23 February, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

arXiv:2110.15486 [pdf, other]

DOCKSTRING: easy molecular docking yields better benchmarks for ligand design

Authors: Miguel García-Ortegón, Gregor N. C. Simm, Austin J. Tripp, José Miguel Hernández-Lobato, Andreas Bender, Sergio Bacallado

Abstract: The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate's interaction wit… ▽ More The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate's interaction with the target. By contrast, molecular docking is a widely successful method in drug discovery to estimate binding affinities. However, docking simulations require a significant amount of domain knowledge to set up correctly which hampers adoption. To this end, we present DOCKSTRING, a bundle for meaningful and robust comparison of ML models consisting of three components: (1) an open-source Python package for straightforward computation of docking scores; (2) an extensive dataset of docking scores and poses of more than 260K ligands for 58 medically-relevant targets; and (3) a set of pharmaceutically-relevant benchmark tasks including regression, virtual screening, and de novo design. The Python package implements a robust ligand and target preparation protocol that allows non-experts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more appropriate evaluation objective than simple physicochemical properties, yielding more realistic benchmark tasks and molecular candidates. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2102.08984 [pdf, ps, other]

The *-Edge-Reinforced Random Walk

Authors: Sergio Bacallado, Christophe Sabot, Pierre Tarrès

Abstract: We define a linearly reinforced process called the *-Edge-Reinforced Random Walk (*-ERRW ) which can be seen as a Yaglom reversible, hence non-reversible, extension of the Edge-Reinforced Random Walk (ERRW) introduced by Coppersmith and Diaconis in 1986. This family of processes also generalizes the r-dependent ERRW introduced by Bacallado (2009). Under some assumptions on the initial weights, the… ▽ More We define a linearly reinforced process called the *-Edge-Reinforced Random Walk (*-ERRW ) which can be seen as a Yaglom reversible, hence non-reversible, extension of the Edge-Reinforced Random Walk (ERRW) introduced by Coppersmith and Diaconis in 1986. This family of processes also generalizes the r-dependent ERRW introduced by Bacallado (2009). Under some assumptions on the initial weights, the *-ERRW is partially exchangeable in the sense of Diaconis and Freedman (1980), and thus it is a random walk in a random environment. The main result of the paper gives the explicit expression of the mixing law, hence extending the "magic formula" of Coppersmith and Diaconis from the case of mixtures of reversible Markov chains to the case of mixtures of Yaglom reversible Markov chains. △ Less

Submitted 29 November, 2023; v1 submitted 17 February, 2021; originally announced February 2021.

Comments: 20 pages

arXiv:2004.07743 [pdf, other]

BETS: The dangers of selection bias in early analyses of the coronavirus disease (COVID-19) pandemic

Authors: Qingyuan Zhao, Nianqiao Ju, Sergio Bacallado, Rajen D. Shah

Abstract: The coronavirus disease 2019 (COVID-19) has quickly grown from a regional outbreak in Wuhan, China to a global pandemic. Early estimates of the epidemic growth and incubation period of COVID-19 may have been biased due to sample selection. Using detailed case reports from 14 locations in and outside mainland China, we obtained 378 Wuhan-exported cases who left Wuhan before an abrupt travel quarant… ▽ More The coronavirus disease 2019 (COVID-19) has quickly grown from a regional outbreak in Wuhan, China to a global pandemic. Early estimates of the epidemic growth and incubation period of COVID-19 may have been biased due to sample selection. Using detailed case reports from 14 locations in and outside mainland China, we obtained 378 Wuhan-exported cases who left Wuhan before an abrupt travel quarantine. We developed a generative model we call BETS for four key epidemiological events---Beginning of exposure, End of exposure, time of Transmission, and time of Symptom onset (BETS)---and derived explicit formulas to correct for the sample selection. We gave a detailed illustration of why some early and highly influential analyses of the COVID-19 pandemic were severely biased. All our analyses, regardless of which subsample and model were being used, point to an epidemic doubling time of 2 to 2.5 days during the early outbreak in Wuhan. A Bayesian nonparametric analysis further suggests that about 5% of the symptomatic cases may not develop symptoms within 14 days of infection and that men may be much more likely than women to develop symptoms within 2 days of infection. △ Less

Submitted 24 September, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: 33 pages, 8 figures, 5 tables; Accepted for publication in The Annals of Applied Statistics on 24th September, 2020

MSC Class: 62P10; 62F15

arXiv:1806.11370 [pdf, other]

Bayesian Uncertainty Directed Trial Designs

Authors: Steffen Ventz, Matteo Cellamare, Sergio Bacallado, Lorenzo Trippa

Abstract: Most Bayesian response-adaptive designs unbalance randomization rates towards the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. We discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to… ▽ More Most Bayesian response-adaptive designs unbalance randomization rates towards the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. We discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to the experiment. All decisions during the trial are selected to optimize the available information at the end of the study. The approach can be applied to several designs, ranging from early stage multi-arm trials to biomarker-driven and multi-endpoint studies. We discuss the asymptotic limit of the patient allocation proportion to treatments, and illustrate the finite-sample operating characteristics of BUD designs through examples, including multi-arm trials, biomarker-stratified trials, and trials with multiple co-primary endpoints. △ Less

Submitted 29 June, 2018; originally announced June 2018.

arXiv:1711.01241 [pdf, other]

doi 10.1214/19-AOAS1295

Bayesian Mixed Effects Models for Zero-inflated Compositions in Microbiome Data Analysis

Authors: Boyu Ren, Sergio Bacallado, Stefano Favaro, Tommi Vatanen, Curtis Huttenhower, Lorenzo Trippa

Abstract: Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal… ▽ More Detecting associations between microbial compositions and sample characteristics is one of the most important tasks in microbiome studies. Most of the existing methods apply univariate models to single microbial species separately, with adjustments for multiple hypothesis testing. We propose a Bayesian analysis for a generalized mixed effects linear model tailored to this application. The marginal prior on each microbial composition is a Dirichlet Process, and dependence across compositions is induced through a linear combination of individual covariates, such as disease biomarkers or the subject's age, and latent factors. The latent factors capture residual variability and their dimensionality is learned from the data in a fully Bayesian procedure. The proposed model is tested in data analyses and simulation studies with zero-inflated compositions. In these settings, within each sample, a large proportion of counts per microbial species are equal to zero. In our Bayesian model a priori the probability of compositions with absent microbial species is strictly positive. We propose an efficient algorithm to sample from the posterior and visualizations of model parameters which reveal associations between covariates and microbial compositions. We evaluate the proposed method in simulation studies, and then analyze a microbiome dataset for infants with type 1 diabetes which contains a large proportion of zeros in the sample-specific microbial compositions. △ Less

Submitted 24 August, 2019; v1 submitted 3 November, 2017; originally announced November 2017.

arXiv:1710.08045 [pdf, other]

Sequential Matrix Completion

Authors: Annie Marsden, Sergio Bacallado

Abstract: We propose a novel algorithm for sequential matrix completion in a recommender system setting, where the $(i,j)$th entry of the matrix corresponds to a user $i$'s rating of product $j$. The objective of the algorithm is to provide a sequential policy for user-product pair recommendation which will yield the highest possible ratings after a finite time horizon. The algorithm uses a Gamma process fa… ▽ More We propose a novel algorithm for sequential matrix completion in a recommender system setting, where the $(i,j)$th entry of the matrix corresponds to a user $i$'s rating of product $j$. The objective of the algorithm is to provide a sequential policy for user-product pair recommendation which will yield the highest possible ratings after a finite time horizon. The algorithm uses a Gamma process factor model with two posterior-focused bandit policies, Thompson Sampling and Information-Directed Sampling. While Thompson Sampling shows competitive performance in simulations, state-of-the-art performance is obtained from Information-Directed Sampling, which makes its recommendations based off a ratio between the expected reward and a measure of information gain. To our knowledge, this is the first implementation of Information Directed Sampling on large real datasets. This approach contributes to a recent line of research on bandit approaches to collaborative filtering including Kawale et al. (2015), Li et al. (2010), Bresler et al. (2014), Li et al. (2016), Deshpande & Montanari (2012), and Zhao et al. (2013). The setting of this paper, as has been noted in Kawale et al. (2015) and Zhao et al. (2013), presents significant challenges to bounding regret after finite horizons. We discuss these challenges in relation to simpler models for bandits with side information, such as linear or gaussian process bandits, and hope the experiments presented here motivate further research toward theoretical guarantees. △ Less

Submitted 22 October, 2017; originally announced October 2017.

Comments: 10 pages, 6 figures

arXiv:1601.05156 [pdf, other]

doi 10.1080/01621459.2017.1288631

Bayesian Nonparametric Ordination for the Analysis of Microbial Communities

Authors: Boyu Ren, Sergio Bacallado, Stefano Favaro, Susan Holmes, Lorenzo Trippa

Abstract: Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters… ▽ More Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters that capture and describe variations of OTU counts across biological samples. It remains important to evaluate how uncertainty in estimates of each biological sample's microbial distribution propagates to ordination analyses, including visualization of clusters and projections of biological samples on low dimensional spaces. We propose a Bayesian analysis for dependent distributions to endow frequently used ordinations with estimates of uncertainty. A Bayesian nonparametric prior for dependent normalized random measures is constructed, which is marginally equivalent to the normalized generalized Gamma process, a well-known prior for nonparametric analyses. In our prior the dependence and similarity between microbial distributions is represented by latent factors that concentrate in a low dimensional space. We use a shrinkage prior to tune the dimensionality of the latent factors. The resulting posterior samples of model parameters can be used to evaluate uncertainty in analyses routinely applied in microbiome studies. Specifically, by combining them with multivariate data analysis techniques we can visualize credible regions in ecological ordination plots. The characteristics of the proposed model are illustrated through a simulation study and applications in two microbiome datasets. △ Less

Submitted 20 January, 2017; v1 submitted 19 January, 2016; originally announced January 2016.

arXiv:1504.00828 [pdf, ps, other]

doi 10.3150/13-BEJ559

Looking-backward probabilities for Gibbs-type exchangeable random partitions

Authors: Sergio Bacallado, Stefano Favaro, Lorenzo Trippa

Abstract: Gibbs-type random probability measures and the exchangeable random partitions they induce represent the subject of a rich and active literature. They provide a probabilistic framework for a wide range of theoretical and applied problems that are typically referred to as species sampling problems. In this paper, we consider the class of looking-backward species sampling problems introduced in Lijoi… ▽ More Gibbs-type random probability measures and the exchangeable random partitions they induce represent the subject of a rich and active literature. They provide a probabilistic framework for a wide range of theoretical and applied problems that are typically referred to as species sampling problems. In this paper, we consider the class of looking-backward species sampling problems introduced in Lijoi et al. (Ann. Appl. Probab. 18 (2008) 1519-1547) in Bayesian nonparametrics. Specifically, given some information on the random partition induced by an initial sample from a Gibbs-type random probability measure, we study the conditional distributions of statistics related to the old species, namely those species detected in the initial sample and possibly re-observed in an additional sample. The proposed results contribute to the analysis of conditional properties of Gibbs-type exchangeable random partitions, so far focused mainly on statistics related to those species generated by the additional sample and not already detected in the initial sample. △ Less

Submitted 3 April, 2015; originally announced April 2015.

Comments: Published at http://dx.doi.org/10.3150/13-BEJ559 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ559

Journal ref: Bernoulli 2015, Vol. 21, No. 1, 1-37

arXiv:1306.1318 [pdf, ps, other]

doi 10.1214/13-AOS1102

Bayesian nonparametric analysis of reversible Markov chains

Authors: Sergio Bacallado, Stefano Favaro, Lorenzo Trippa

Abstract: We introduce a three-parameter random walk with reinforcement, called the $(θ,α,β)$ scheme, which generalizes the linearly edge reinforced random walk to uncountable spaces. The parameter $β$ smoothly tunes the $(θ,α,β)$ scheme between this edge reinforced random walk and the classical exchangeable two-parameter Hoppe urn scheme, while the parameters $α$ and $θ$ modulate how many states are typica… ▽ More We introduce a three-parameter random walk with reinforcement, called the $(θ,α,β)$ scheme, which generalizes the linearly edge reinforced random walk to uncountable spaces. The parameter $β$ smoothly tunes the $(θ,α,β)$ scheme between this edge reinforced random walk and the classical exchangeable two-parameter Hoppe urn scheme, while the parameters $α$ and $θ$ modulate how many states are typically visited. Resorting to de Finetti's theorem for Markov chains, we use the $(θ,α,β)$ scheme to define a nonparametric prior for Bayesian analysis of reversible Markov chains. The prior is applied in Bayesian nonparametric inference for species sampling problems with data generated from a reversible Markov chain with an unknown transition kernel. As a real example, we analyze data from molecular dynamics simulations of protein folding. △ Less

Submitted 6 June, 2013; originally announced June 2013.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1102 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1102

Journal ref: Annals of Statistics 2013, Vol. 41, No. 2, 870-896

arXiv:1105.2640 [pdf, ps, other]

doi 10.1214/10-AOS857

Bayesian analysis of variable-order, reversible Markov chains

Authors: Sergio Bacallado

Abstract: We define a conjugate prior for the reversible Markov chain of order $r$. The prior arises from a partially exchangeable reinforced random walk, in the same way that the Beta distribution arises from the exchangeable Polyá urn. An extension to variable-order Markov chains is also derived. We show the utility of this prior in testing the order and estimating the parameters of a reversible Markov mo… ▽ More We define a conjugate prior for the reversible Markov chain of order $r$. The prior arises from a partially exchangeable reinforced random walk, in the same way that the Beta distribution arises from the exchangeable Polyá urn. An extension to variable-order Markov chains is also derived. We show the utility of this prior in testing the order and estimating the parameters of a reversible Markov model. △ Less

Submitted 13 May, 2011; originally announced May 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-AOS857 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS857

Journal ref: Annals of Statistics 2011, Vol. 39, No. 2, 838-864

Showing 1–13 of 13 results for author: Bacallado, S