-
Orthogonal calibration via posterior projections with applications to the Schwarzschild model
Authors:
Antik Chakraborty,
Jonelle B. Walsh,
Louis Strigari,
Bani K. Mallick,
Anirban Bhattacharya
Abstract:
The orbital superposition method originally developed by Schwarzschild (1979) is used to study the dynamics of growth of a black hole and its host galaxy, and has uncovered new relationships between the galaxy's global characteristics. Scientists are specifically interested in finding optimal parameter choices for this model that best match physical measurements along with quantifying the uncertai…
▽ More
The orbital superposition method originally developed by Schwarzschild (1979) is used to study the dynamics of growth of a black hole and its host galaxy, and has uncovered new relationships between the galaxy's global characteristics. Scientists are specifically interested in finding optimal parameter choices for this model that best match physical measurements along with quantifying the uncertainty of such procedures. This renders a statistical calibration problem with multivariate outcomes. In this article, we develop a Bayesian method for calibration with multivariate outcomes using orthogonal bias functions thus ensuring parameter identifiability. Our approach is based on projecting the posterior to an appropriate space which allows the user to choose any nonparametric prior on the bias function(s) instead of having to model it (them) with Gaussian processes. We develop a functional projection approach using the theory of Hilbert spaces. A finite-dimensional analogue of the projection problem is also considered. We illustrate the proposed approach using a BART prior and apply it to calibrate the Schwarzschild model illustrating how a multivariate approach may resolve discrepancies resulting from a univariate calibration.
△ Less
Submitted 11 April, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Generalized Regret Analysis of Thompson Sampling using Fractional Posteriors
Authors:
Prateek Jaiswal,
Debdeep Pati,
Anirban Bhattacharya,
Bani K. Mallick
Abstract:
Thompson sampling (TS) is one of the most popular and earliest algorithms to solve stochastic multi-armed bandit problems. We consider a variant of TS, named $α$-TS, where we use a fractional or $α$-posterior ($α\in(0,1)$) instead of the standard posterior distribution. To compute an $α$-posterior, the likelihood in the definition of the standard posterior is tempered with a factor $α$. For $α$-TS…
▽ More
Thompson sampling (TS) is one of the most popular and earliest algorithms to solve stochastic multi-armed bandit problems. We consider a variant of TS, named $α$-TS, where we use a fractional or $α$-posterior ($α\in(0,1)$) instead of the standard posterior distribution. To compute an $α$-posterior, the likelihood in the definition of the standard posterior is tempered with a factor $α$. For $α$-TS we obtain both instance-dependent $\mathcal{O}\left(\sum_{k \neq i^*} Δ_k\left(\frac{\log(T)}{C(α)Δ_k^2} + \frac{1}{2} \right)\right)$ and instance-independent $\mathcal{O}(\sqrt{KT\log K})$ frequentist regret bounds under very mild conditions on the prior and reward distributions, where $Δ_k$ is the gap between the true mean rewards of the $k^{th}$ and the best arms, and $C(α)$ is a known constant. Both the sub-Gaussian and exponential family models satisfy our general conditions on the reward distribution. Our conditions on the prior distribution just require its density to be positive, continuous, and bounded. We also establish another instance-dependent regret upper bound that matches (up to constants) to that of improved UCB [Auer and Ortner, 2010]. Our regret analysis carefully combines recent theoretical developments in the non-asymptotic concentration analysis and Bernstein-von Mises type results for the $α$-posterior distribution. Moreover, our analysis does not require additional structural properties such as closed-form posteriors or conjugate priors.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Covariate-Assisted Bayesian Graph Learning for Heterogeneous Data
Authors:
Yabo Niu,
Yang Ni,
Debdeep Pati,
Bani K. Mallick
Abstract:
In a traditional Gaussian graphical model, data homogeneity is routinely assumed with no extra variables affecting the conditional independence. In modern genomic datasets, there is an abundance of auxiliary information, which often gets under-utilized in determining the joint dependency structure. In this article, we consider a Bayesian approach to model undirected graphs underlying heterogeneous…
▽ More
In a traditional Gaussian graphical model, data homogeneity is routinely assumed with no extra variables affecting the conditional independence. In modern genomic datasets, there is an abundance of auxiliary information, which often gets under-utilized in determining the joint dependency structure. In this article, we consider a Bayesian approach to model undirected graphs underlying heterogeneous multivariate observations with additional assistance from covariates. Building on product partition models, we propose a novel covariate-dependent Gaussian graphical model that allows graphs to vary with covariates so that observations whose covariates are similar share a similar undirected graph. To efficiently embed Gaussian graphical models into our proposed framework, we explore both Gaussian likelihood and pseudo-likelihood functions. For Gaussian likelihood, a G-Wishart distribution is used as a natural conjugate prior, and for the pseudo-likelihood, a product of Gaussian-conditionals is used. Moreover, the proposed model has large prior support and is flexible to approximate any $ν$-Hölder conditional variance-covariance matrices with $ν\in(0,1]$. We further show that based on the theory of fractional likelihood, the rate of posterior contraction is minimax optimal assuming the true density to be a Gaussian mixture with a known number of components. The efficacy of the approach is demonstrated via simulation studies and an analysis of a protein network for a breast cancer dataset assisted by mRNA gene expression as covariates.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Bayesian Flexible Modelling of Spatially Resolved Transcriptomic Data
Authors:
Arhit Chakrabarti,
Yang Ni,
Bani K. Mallick
Abstract:
Single-cell RNA-sequencing technologies may provide valuable insights to the understanding of the composition of different cell types and their functions within a tissue. Recent technologies such as spatial transcriptomics, enable the measurement of gene expressions at the single cell level along with the spatial locations of these cells in the tissue. Dimension-reduction and spatial clustering ar…
▽ More
Single-cell RNA-sequencing technologies may provide valuable insights to the understanding of the composition of different cell types and their functions within a tissue. Recent technologies such as spatial transcriptomics, enable the measurement of gene expressions at the single cell level along with the spatial locations of these cells in the tissue. Dimension-reduction and spatial clustering are two of the most common exploratory analysis strategies for spatial transcriptomic data. However, existing dimension reduction methods may lead to a loss of inherent dependency structure among genes at any spatial location in the tissue and hence do not provide insights of gene co-expression pattern. In spatial transcriptomics, the matrix-variate gene expression data, along with spatial co-ordinates of the single cells, provides information on both gene expression dependencies and cell spatial dependencies through its row and column covariances. In this work, we propose a flexible Bayesian approach to simultaneously estimate the row and column covariances for the matrix-variate spatial transcriptomic data. The posterior estimates of the row and column covariances provide data summaries for downstream exploratory analysis. We illustrate our method with simulations and two analyses of real data generated from a recent spatial transcriptomic platform. Our work elucidates gene co-expression networks as well as clear spatial clustering patterns of the cells.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
Blocked Gibbs sampler for hierarchical Dirichlet processes
Authors:
Snigdha Das,
Yabo Niu,
Yang Ni,
Bani K. Mallick,
Debdeep Pati
Abstract:
Posterior computation in hierarchical Dirichlet process (HDP) mixture models is an active area of research in nonparametric Bayes inference of grouped data. Existing literature almost exclusively focuses on the Chinese restaurant franchise (CRF) analogy of the marginal distribution of the parameters, which can mix poorly and is known to have a linear complexity with the sample size. A recently dev…
▽ More
Posterior computation in hierarchical Dirichlet process (HDP) mixture models is an active area of research in nonparametric Bayes inference of grouped data. Existing literature almost exclusively focuses on the Chinese restaurant franchise (CRF) analogy of the marginal distribution of the parameters, which can mix poorly and is known to have a linear complexity with the sample size. A recently developed slice sampler allows for efficient blocked updates of the parameters, but is shown to be statistically unstable in our article. We develop a blocked Gibbs sampler to sample from the posterior distribution of HDP, which produces statistically stable results, is highly scalable with respect to sample size, and is shown to have good mixing. The heart of the construction is to endow the shared concentration parameter with an appropriately chosen gamma prior that allows us to break the dependence of the shared mixing proportions and permits independent updates of certain log-concave random variables in a block. En route, we develop an efficient rejection sampler for these random variables leveraging piece-wise tangent-line approximations.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
An Approximate Bayesian Approach to Covariate-dependent Graphical Modeling
Authors:
Sutanoy Dasgupta,
Peng Zhao,
Jacob Helwig,
Prasenjit Ghosh,
Debdeep Pati,
Bani K. Mallick
Abstract:
Gaussian graphical models typically assume a homogeneous structure across all subjects, which is often restrictive in applications. In this article, we propose a weighted pseudo-likelihood approach for graphical modeling which allows different subjects to have different graphical structures depending on extraneous covariates. The pseudo-likelihood approach replaces the joint distribution by a prod…
▽ More
Gaussian graphical models typically assume a homogeneous structure across all subjects, which is often restrictive in applications. In this article, we propose a weighted pseudo-likelihood approach for graphical modeling which allows different subjects to have different graphical structures depending on extraneous covariates. The pseudo-likelihood approach replaces the joint distribution by a product of the conditional distributions of each variable. We cast the conditional distribution as a heteroscedastic regression problem, with covariate-dependent variance terms, to enable information borrowing directly from the data instead of a hierarchical framework. This allows independent graphical modeling for each subject, while retaining the benefits of a hierarchical Bayes model and being computationally tractable. An efficient embarrassingly parallel variational algorithm is developed to approximate the posterior and obtain estimates of the graphs. Using a fractional variational framework, we derive asymptotic risk bounds for the estimate in terms of a novel variant of the $α$-Rényi divergence. We theoretically demonstrate the advantages of information borrowing across covariates over independent modeling. We show the practical advantages of the approach through simulation studies and illustrate the dependence structure in protein expression levels on breast cancer patients using CNV information as covariates.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Graphical Dirichlet Process for Clustering Non-Exchangeable Grouped Data
Authors:
Arhit Chakrabarti,
Yang Ni,
Ellen Ruth A. Morris,
Michael L. Salinas,
Robert S. Chapkin,
Bani K. Mallick
Abstract:
We consider the problem of clustering grouped data with possibly non-exchangeable groups whose dependencies can be characterized by a known directed acyclic graph. To allow the sharing of clusters among the non-exchangeable groups, we propose a Bayesian nonparametric approach, termed graphical Dirichlet process, that jointly models the dependent group-specific random measures by assuming each rand…
▽ More
We consider the problem of clustering grouped data with possibly non-exchangeable groups whose dependencies can be characterized by a known directed acyclic graph. To allow the sharing of clusters among the non-exchangeable groups, we propose a Bayesian nonparametric approach, termed graphical Dirichlet process, that jointly models the dependent group-specific random measures by assuming each random measure to be distributed as a Dirichlet process whose concentration parameter and base probability measure depend on those of its parent groups. The resulting joint stochastic process respects the Markov property of the directed acyclic graph that links the groups. We characterize the graphical Dirichlet process using a novel hypergraph representation as well as the stick-breaking representation, the restaurant-type representation, and the representation as a limit of a finite mixture model. We develop an efficient posterior inference algorithm and illustrate our model with simulations and a real grouped single-cell dataset.
△ Less
Submitted 31 July, 2023; v1 submitted 17 February, 2023;
originally announced February 2023.
-
Factorized Fusion Shrinkage for Dynamic Relational Data
Authors:
Peng Zhao,
Anirban Bhattacharya,
Debdeep Pati,
Bani K. Mallick
Abstract:
Modern data science applications often involve complex relational data with dynamic structures. An abrupt change in such dynamic relational data is typically observed in systems that undergo regime changes due to interventions. In such a case, we consider a factorized fusion shrinkage model in which all decomposed factors are dynamically shrunk towards group-wise fusion structures, where the shrin…
▽ More
Modern data science applications often involve complex relational data with dynamic structures. An abrupt change in such dynamic relational data is typically observed in systems that undergo regime changes due to interventions. In such a case, we consider a factorized fusion shrinkage model in which all decomposed factors are dynamically shrunk towards group-wise fusion structures, where the shrinkage is obtained by applying global-local shrinkage priors to the successive differences of the row vectors of the factorized matrices. The proposed priors enjoy many favorable properties in comparison and clustering of the estimated dynamic latent factors. Comparing estimated latent factors involves both adjacent and long-term comparisons, with the time range of comparison considered as a variable. Under certain conditions, we demonstrate that the posterior distribution attains the minimax optimal rate up to logarithmic factors. In terms of computation, we present a structured mean-field variational inference framework that balances optimal posterior inference with computational scalability, exploiting both the dependence among components and across time. The framework can accommodate a wide variety of models, including dynamic matrix factorization, latent space models for networks and low-rank tensors. The effectiveness of our methodology is demonstrated through extensive simulations and real-world data analysis.
△ Less
Submitted 18 April, 2023; v1 submitted 30 September, 2022;
originally announced October 2022.
-
Structured Optimal Variational Inference for Dynamic Latent Space Models
Authors:
Peng Zhao,
Anirban Bhattacharya,
Debdeep Pati,
Bani K. Mallick
Abstract:
We consider a latent space model for dynamic networks, where our objective is to estimate the pairwise inner products of the latent positions. To balance posterior inference and computational scalability, we present a structured mean-field variational inference framework, where the time-dependent properties of the dynamic networks are exploited to facilitate computation and inference. Additionally…
▽ More
We consider a latent space model for dynamic networks, where our objective is to estimate the pairwise inner products of the latent positions. To balance posterior inference and computational scalability, we present a structured mean-field variational inference framework, where the time-dependent properties of the dynamic networks are exploited to facilitate computation and inference. Additionally, an easy-to-implement block coordinate ascent algorithm is developed with message-passing type updates in each block, whereas the complexity per iteration is linear with the number of nodes and time points. To facilitate learning of the pairwise latent distances, we adopt a Gamma prior for the transition variance different from the literature. To certify the optimality, we demonstrate that the variational risk of the proposed variational inference approach attains the minimax optimal rate under certain conditions. En route, we derive the minimax lower bound, which might be of independent interest. To best of our knowledge, this is the first such exercise for dynamic latent space models. Simulations and real data analysis demonstrate the efficacy of our methodology and the efficiency of our algorithm. Finally, our proposed methodology can be readily extended to the case where the scales of the latent nodes are learned in a nodewise manner.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
A Bayesian Survival Tree Partition Model Using Latent Gaussian Processes
Authors:
Richard D. Payne,
Nilabja Guha,
Bani K. Mallick
Abstract:
Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazards assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian tree partition model which is both flexible and inferential. Inference is…
▽ More
Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazards assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian tree partition model which is both flexible and inferential. Inference is obtained through the posterior tree structure and flexibility is preserved by modeling the the hazard function in each partition using a latent exponentiated Gaussian process. An efficient reversible jump Markov chain Monte Carlo algorithm is accomplished by marginalizing the parameters in each partition element via a Laplace approximation. Consistency properties for the estimator are established. The method can be used to help determine subgroups as well as prognostic and/or predictive biomarkers in time-to-event data. The method is applied to a liver survival dataset and is compared with some existing methods on simulated data.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Adaptive Bayesian Variable Clustering via Structural Learning of Breast Cancer Data
Authors:
Riddhi Pratim Ghosh,
Arnab Kumar Maity,
Mohsen Pourahmadi,
Bani K. Mallick
Abstract:
Clustering of proteins is of interest in cancer cell biology. This article proposes a hierarchical Bayesian model for protein (variable) clustering hinging on correlation structure. Starting from a multivariate normal likelihood, we enforce the clustering through prior modeling using angle based unconstrained reparameterization of correlations and assume a truncated Poisson distribution (to penali…
▽ More
Clustering of proteins is of interest in cancer cell biology. This article proposes a hierarchical Bayesian model for protein (variable) clustering hinging on correlation structure. Starting from a multivariate normal likelihood, we enforce the clustering through prior modeling using angle based unconstrained reparameterization of correlations and assume a truncated Poisson distribution (to penalize the large number of clusters) as prior on the number of clusters. The posterior distributions of the parameters are not in explicit form and we use a reversible jump Markov chain Monte Carlo (RJMCMC) based technique is used to simulate the parameters from the posteriors. The end products of the proposed method are estimated cluster configuration of the proteins (variables) along with the number of clusters. The Bayesian method is flexible enough to cluster the proteins as well as the estimate the number of clusters. The performance of the proposed method has been substantiated with extensive simulation studies and one protein expression data with a hereditary disposition in breast cancer where the proteins are coming from different pathways.
△ Less
Submitted 8 February, 2022;
originally announced February 2022.
-
Bayesian Structural Equation Modeling in Multiple Omics Data Integration with Application to Circadian Genes
Authors:
Arnab Kumar Maity,
Sang Chan Lee,
Bani K. Mallick,
Tapasree Roy Sarkar
Abstract:
It is well known that the integration among different data-sources is reliable because of its potential of unveiling new functionalities of the genomic expressions which might be dormant in a single source analysis. Moreover, different studies have justified the more powerful analyses of multi-platform data. Toward this, in this study, we consider the circadian genes' omics profile such as copy nu…
▽ More
It is well known that the integration among different data-sources is reliable because of its potential of unveiling new functionalities of the genomic expressions which might be dormant in a single source analysis. Moreover, different studies have justified the more powerful analyses of multi-platform data. Toward this, in this study, we consider the circadian genes' omics profile such as copy number changes and RNA sequence data along with their survival response. We develop a Bayesian structural equation modeling coupled with linear regressions and log normal accelerated failure time regression to integrate the information between these two platforms to predict the survival of the subjects. We place conjugate priors on the regression parameters and derive the Gibbs sampler using the conditional distributions of them. Our extensive simulation study shows that the integrative model provides a better fit to the data than its closest competitor. The analyses of glioblastoma cancer data and the breast cancer data from TCGA, the largest genomics and transcriptomics database, support our findings. The developed method is wrapped in R package semmcmc available at R CRAN.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
Bayesian Variable Selection in Multivariate Nonlinear Regression with Graph Structures
Authors:
Yabo Niu,
Nilabja Guha,
Debkumar De,
Anindya Bhadra,
Veerabhadran Baladandayuthapani,
Bani K. Mallick
Abstract:
Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices. We develop a Bayesian method to incorporate covariate information in this GGMs setup in a nonlinear seemingly unrelated regression framework. We propose a joint predictor and graph selection model and develop an efficient collapsed Gibbs sampler algorithm to…
▽ More
Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices. We develop a Bayesian method to incorporate covariate information in this GGMs setup in a nonlinear seemingly unrelated regression framework. We propose a joint predictor and graph selection model and develop an efficient collapsed Gibbs sampler algorithm to search the joint model space. Furthermore, we investigate its theoretical variable selection properties. We demonstrate our method on a variety of simulated data, concluding with a real data set from the TCPA project.
△ Less
Submitted 30 July, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Tail-adaptive Bayesian shrinkage
Authors:
Se Yoon Lee,
Peng Zhao,
Debdeep Pati,
Bani K. Mallick
Abstract:
Robust Bayesian methods for high-dimensional regression problems under diverse sparse regimes are studied. Traditional shrinkage priors are primarily designed to detect a handful of signals from tens of thousands of predictors in the so-called ultra-sparsity domain. However, they may not perform desirably when the degree of sparsity is moderate. In this paper, we propose a robust sparse estimation…
▽ More
Robust Bayesian methods for high-dimensional regression problems under diverse sparse regimes are studied. Traditional shrinkage priors are primarily designed to detect a handful of signals from tens of thousands of predictors in the so-called ultra-sparsity domain. However, they may not perform desirably when the degree of sparsity is moderate. In this paper, we propose a robust sparse estimation method under diverse sparsity regimes, which has a tail-adaptive shrinkage property. In this property, the tail-heaviness of the prior adjusts adaptively, becoming larger or smaller as the sparsity level increases or decreases, respectively, to accommodate more or fewer signals, a posteriori. We propose a global-local-tail (GLT) Gaussian mixture distribution that ensures this property. We examine the role of the tail-index of the prior in relation to the underlying sparsity level and demonstrate that the GLT posterior contracts at the minimax optimal rate for sparse normal mean models. We apply both the GLT prior and the Horseshoe prior to a real data problem and simulation examples. Our findings indicate that the varying tail rule based on the GLT prior offers advantages over a fixed tail rule based on the Horseshoe prior in diverse sparsity regimes.
△ Less
Submitted 19 February, 2024; v1 submitted 4 July, 2020;
originally announced July 2020.
-
Estimation of COVID-19 spread curves integrating global data and borrowing information
Authors:
Se Yoon Lee,
Bowen Lei,
Bani K. Mallick
Abstract:
Currently, novel coronavirus disease 2019 (COVID-19) is a big threat to global health. The rapid spread of the virus has created pandemic, and countries all over the world are struggling with a surge in COVID-19 infected cases. There are no drugs or other therapeutics approved by the US Food and Drug Administration to prevent or treat COVID-19: information on the disease is very limited and scatte…
▽ More
Currently, novel coronavirus disease 2019 (COVID-19) is a big threat to global health. The rapid spread of the virus has created pandemic, and countries all over the world are struggling with a surge in COVID-19 infected cases. There are no drugs or other therapeutics approved by the US Food and Drug Administration to prevent or treat COVID-19: information on the disease is very limited and scattered even if it exists. This motivates the use of data integration, combining data from diverse sources and eliciting useful information with a unified view of them. In this paper, we propose a Bayesian hierarchical model that integrates global data for real-time prediction of infection trajectory for multiple countries. Because the proposed model takes advantage of borrowing information across multiple countries, it outperforms an existing individual country-based model. As fully Bayesian way has been adopted, the model provides a powerful predictive tool endowed with uncertainty quantification. Additionally, a joint variable selection technique has been integrated into the proposed modeling scheme, which aimed to identify possible country-level risk factors for severe disease due to COVID-19.
△ Less
Submitted 10 July, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.
-
Directionally Dependent Multi-View Clustering Using Copula Model
Authors:
Kahkashan Afrin,
Ashif S. Iquebal,
Mostafa Karimi,
Allyson Souris,
Se Yoon Lee,
Bani K. Mallick
Abstract:
In recent biomedical scientific problems, it is a fundamental issue to integratively cluster a set of objects from multiple sources of datasets. Such problems are mostly encountered in genomics, where data is collected from various sources, and typically represent distinct yet complementary information. Integrating these data sources for multi-source clustering is challenging due to their complex…
▽ More
In recent biomedical scientific problems, it is a fundamental issue to integratively cluster a set of objects from multiple sources of datasets. Such problems are mostly encountered in genomics, where data is collected from various sources, and typically represent distinct yet complementary information. Integrating these data sources for multi-source clustering is challenging due to their complex dependence structure including directional dependency. Particularly in genomics studies, it is known that there is certain directional dependence between DNA expression, DNA methylation, and RNA expression, widely called The Central Dogma.
Most of the existing multi-view clustering methods either assume an independent structure or pair-wise (non-directional) dependency, thereby ignoring the directional relationship. Motivated by this, we propose a copula-based multi-view clustering model where a copula enables the model to accommodate the directional dependence existing in the datasets. We conduct a simulation experiment where the simulated datasets exhibiting inherent directional dependence: it turns out that ignoring the directional dependence negatively affects the clustering performance. As a real application, we applied our model to the breast cancer tumor samples collected from The Cancer Genome Altas (TCGA).
△ Less
Submitted 22 August, 2020; v1 submitted 16 March, 2020;
originally announced March 2020.
-
Bayesian Copula Density Deconvolution for Zero-Inflated Data in Nutritional Epidemiology
Authors:
Abhra Sarkar,
Debdeep Pati,
Bani K. Mallick,
Raymond J. Carroll
Abstract:
Estimating the marginal and joint densities of the long-term average intakes of different dietary components is an important problem in nutritional epidemiology. Since these variables cannot be directly measured, data are usually collected in the form of 24-hour recalls of the intakes, which show marked patterns of conditional heteroscedasticity. Significantly compounding the challenges, the recal…
▽ More
Estimating the marginal and joint densities of the long-term average intakes of different dietary components is an important problem in nutritional epidemiology. Since these variables cannot be directly measured, data are usually collected in the form of 24-hour recalls of the intakes, which show marked patterns of conditional heteroscedasticity. Significantly compounding the challenges, the recalls for episodically consumed dietary components also include exact zeros. The problem of estimating the density of the latent long-time intakes from their observed measurement error contaminated proxies is then a problem of deconvolution of densities with zero-inflated data. We propose a Bayesian semiparametric solution to the problem, building on a novel hierarchical latent variable framework that translates the problem to one involving continuous surrogates only. Crucial to accommodating important aspects of the problem, we then design a copula-based approach to model the involved joint distributions, adopting different modeling strategies for the marginals of the different dietary components. We design efficient Markov chain Monte Carlo algorithms for posterior inference and illustrate the efficacy of the proposed method through simulation experiments. Applied to our motivating nutritional epidemiology problems, compared to other approaches, our method provides more realistic estimates of the consumption patterns of episodically consumed dietary components.
△ Less
Submitted 10 December, 2019;
originally announced December 2019.
-
A Conditional Density Estimation Partition Model Using Logistic Gaussian Processes
Authors:
Richard D. Payne,
Nilabja Guha,
Yu Ding,
Bani K. Mallick
Abstract:
Conditional density estimation (density regression) estimates the distribution of a response variable y conditional on covariates x. Utilizing a partition model framework, a conditional density estimation method is proposed using logistic Gaussian processes. The partition is created using a Voronoi tessellation and is learned from the data using a reversible jump Markov chain Monte Carlo algorithm…
▽ More
Conditional density estimation (density regression) estimates the distribution of a response variable y conditional on covariates x. Utilizing a partition model framework, a conditional density estimation method is proposed using logistic Gaussian processes. The partition is created using a Voronoi tessellation and is learned from the data using a reversible jump Markov chain Monte Carlo algorithm. The Markov chain Monte Carlo algorithm is made possible through a Laplace approximation on the latent variables of the logistic Gaussian process model. This approximation marginalizes the parameters in each partition element, allowing an efficient search of the posterior distribution of the tessellation. The method has desirable consistency properties. In simulation and applications, the model successfully estimates the partition structure and conditional distribution of y.
△ Less
Submitted 20 March, 2017;
originally announced March 2017.
-
Bayesian sparse multiple regression for simultaneous rank reduction and variable selection
Authors:
Antik Chakraborty,
Anirban Bhattacharya,
Bani K. Mallick
Abstract:
We develop a Bayesian methodology aimed at simultaneously estimating low-rank and row-sparse matrices in a high-dimensional multiple-response linear regression model. We consider a carefully devised shrinkage prior on the matrix of regression coefficients which obviates the need to specify a prior on the rank, and shrinks the regression matrix towards low-rank and row-sparse structures. We provide…
▽ More
We develop a Bayesian methodology aimed at simultaneously estimating low-rank and row-sparse matrices in a high-dimensional multiple-response linear regression model. We consider a carefully devised shrinkage prior on the matrix of regression coefficients which obviates the need to specify a prior on the rank, and shrinks the regression matrix towards low-rank and row-sparse structures. We provide theoretical support to the proposed methodology by proving minimax optimality of the posterior mean under the prediction risk in ultra-high dimensional settings where the number of predictors can grow sub-exponentially relative to the sample size. A one-step post-processing scheme induced by group lasso penalties on the rows of the estimated coefficient matrix is proposed for variable selection, with default choices of tuning parameters. We additionally provide an estimate of the rank using a novel optimization function achieving dimension reduction in the covariate space. We exhibit the performance of the proposed methodology in an extensive simulation study and a real data example.
△ Less
Submitted 8 April, 2019; v1 submitted 2 December, 2016;
originally announced December 2016.
-
Quantile Graphical Models: Bayesian Approaches
Authors:
Nilabja Guha,
Veera Baladandayuthapani,
Bani K. Mallick
Abstract:
Graphical models are ubiquitous tools to describe the interdependence between variables measured simultaneously such as large-scale gene or protein expression data. Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices and they are generated under a multivariate normal joint distribution. However, they suffer fro…
▽ More
Graphical models are ubiquitous tools to describe the interdependence between variables measured simultaneously such as large-scale gene or protein expression data. Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices and they are generated under a multivariate normal joint distribution. However, they suffer from several shortcomings since they are based on Gaussian distribution assumptions. In this article, we propose a Bayesian quantile based approach for sparse estimation of graphs. We demonstrate that the resulting graph estimation is robust to outliers and applicable under general distributional assumptions. Furthermore, we develop efficient variational Bayes approximations to scale the methods for large data sets. Our methods are applied to a novel cancer proteomics data dataset wherein multiple proteomic antibodies are simultaneously assessed on tumor samples using reverse-phase protein arrays (RPPA) technology.
△ Less
Submitted 8 January, 2020; v1 submitted 8 November, 2016;
originally announced November 2016.
-
Bayesian and Variational Bayesian approaches for flows in heterogenous random media
Authors:
Keren Yang,
Nilabja Guha,
Yalchin Efendiev,
Bani K. Mallick
Abstract:
In this paper, we study porous media flows in heterogeneous stochastic media. We propose an efficient forward simulation technique that is tailored for variational Bayesian inversion. As a starting point, the proposed forward simulation technique decomposes the solution into the sum of separable functions (with respect to randomness and the space), where each term is calculated based on a variatio…
▽ More
In this paper, we study porous media flows in heterogeneous stochastic media. We propose an efficient forward simulation technique that is tailored for variational Bayesian inversion. As a starting point, the proposed forward simulation technique decomposes the solution into the sum of separable functions (with respect to randomness and the space), where each term is calculated based on a variational approach. This is similar to Proper Generalized Decomposition (PGD). Next, we apply a multiscale technique to solve for each term and, further, decompose the random function into 1D fields. As a result, our proposed method provides an approximation hierarchy for the solution as we increase the number of terms in the expansion and, also, increase the spatial resolution of each term. We use the hierarchical solution distributions in a variational Bayesian approximation to perform uncertainty quantification in the inverse problem. We conduct a detailed numerical study to explore the performance of the proposed uncertainty quantification technique and show the theoretical posterior concentration.
△ Less
Submitted 8 February, 2018; v1 submitted 3 November, 2016;
originally announced November 2016.
-
Bayesian Variable Selection with Structure Learning: Applications in Integrative Genomics
Authors:
Suprateek Kundu,
Minsuk Shin,
Yichen Cheng,
Ganiraju Manyam,
Bani K. Mallick,
Veera Baladandayuthapani
Abstract:
Significant advances in biotechnology have allowed for simultaneous measurement of molecular data points across multiple genomic and transcriptomic levels from a single tumor/cancer sample. This has motivated systematic approaches to integrate multi-dimensional structured datasets since cancer development and progression is driven by numerous co-ordinated molecular alterations and the interactions…
▽ More
Significant advances in biotechnology have allowed for simultaneous measurement of molecular data points across multiple genomic and transcriptomic levels from a single tumor/cancer sample. This has motivated systematic approaches to integrate multi-dimensional structured datasets since cancer development and progression is driven by numerous co-ordinated molecular alterations and the interactions between them. We propose a novel two-step Bayesian approach that combines a variable selection framework with integrative structure learning between multiple sources of data. The structure learning in the first step is accomplished through novel joint graphical models for heterogeneous (mixed scale) data allowing for flexible incorporation of prior knowledge. This structure learning subsequently informs the variable selection in the second step to identify groups of molecular features within and across platforms associated with outcomes of cancer progression. The variable selection strategy adjusts for collinearity and multiplicity, and also has theoretical justifications. We evaluate our methods through simulations and apply them to a motivating genomic (DNA copy number and methylation) and transcriptomic (mRNA expression) data for assessing important markers associated with Glioblastoma progression.
△ Less
Submitted 11 August, 2015;
originally announced August 2015.
-
Fast sampling with Gaussian scale-mixture priors in high-dimensional regression
Authors:
Anirban Bhattacharya,
Antik Chakraborty,
Bani K. Mallick
Abstract:
We propose an efficient way to sample from a class of structured multivariate Gaussian distributions which routinely arise as conditional posteriors of model parameters that are assigned a conditionally Gaussian prior. The proposed algorithm only requires matrix operations in the form of matrix multiplications and linear system solutions. We exhibit that the computational complexity of the propose…
▽ More
We propose an efficient way to sample from a class of structured multivariate Gaussian distributions which routinely arise as conditional posteriors of model parameters that are assigned a conditionally Gaussian prior. The proposed algorithm only requires matrix operations in the form of matrix multiplications and linear system solutions. We exhibit that the computational complexity of the proposed algorithm grows linearly with the dimension unlike existing algorithms relying on Cholesky factorizations with cubic orders of complexity. The algorithm should be broadly applicable in settings where Gaussian scale mixture priors are used on high dimensional model parameters. We provide an illustration through posterior sampling in a high dimensional regression setting with a horseshoe prior on the vector of regression coefficients.
△ Less
Submitted 27 June, 2016; v1 submitted 15 June, 2015;
originally announced June 2015.
-
Two-Stage Metropolis-Hastings for Tall Data
Authors:
Richard D. Payne,
Bani K. Mallick
Abstract:
This paper discusses the challenges presented by tall data problems associated with Bayesian classification (specifically binary classification) and the existing methods to handle them. Current methods include parallelizing the likelihood, subsampling, and consensus Monte Carlo. A new method based on the two-stage Metropolis-Hastings algorithm is also proposed. The purpose of this algorithm is to…
▽ More
This paper discusses the challenges presented by tall data problems associated with Bayesian classification (specifically binary classification) and the existing methods to handle them. Current methods include parallelizing the likelihood, subsampling, and consensus Monte Carlo. A new method based on the two-stage Metropolis-Hastings algorithm is also proposed. The purpose of this algorithm is to reduce the exact likelihood computational cost in the tall data situation. In the first stage, a new proposal is tested by the approximate likelihood based model. The full likelihood based posterior computation will be conducted only if the proposal passes the first stage screening. Furthermore, this method can be adopted into the consensus Monte Carlo framework. The two-stage method is applied to logistic regression, hierarchical logistic regression, and Bayesian multivariate adaptive regression splines.
△ Less
Submitted 20 March, 2017; v1 submitted 20 November, 2014;
originally announced November 2014.
-
Bayesian Semiparametric Multivariate Density Deconvolution
Authors:
Abhra Sarkar,
Debdeep Pati,
Bani K. Mallick,
Raymond J. Carroll
Abstract:
We consider the problem of multivariate density deconvolution when the interest lies in estimating the distribution of a vector-valued random variable but precise measurements of the variable of interest are not available, observations being contaminated with additive measurement errors. The existing sparse literature on the problem assumes the density of the measurement errors to be completely kn…
▽ More
We consider the problem of multivariate density deconvolution when the interest lies in estimating the distribution of a vector-valued random variable but precise measurements of the variable of interest are not available, observations being contaminated with additive measurement errors. The existing sparse literature on the problem assumes the density of the measurement errors to be completely known. We propose robust Bayesian semiparametric multivariate deconvolution approaches when the measurement error density is not known but replicated proxies are available for each unobserved value of the random vector. Additionally, we allow the variability of the measurement errors to depend on the associated unobserved value of the vector of interest through unknown relationships which also automatically includes the case of multivariate multiplicative measurement errors. Basic properties of finite mixture models, multivariate normal kernels and exchangeable priors are exploited in many novel ways to meet the modeling and computational challenges. Theoretical results that show the flexibility of the proposed methods are provided. We illustrate the efficiency of the proposed methods in recovering the true density of interest through simulation experiments. The methodology is applied to estimate the joint consumption pattern of different dietary components from contaminated 24 hour recalls.
△ Less
Submitted 5 December, 2016; v1 submitted 25 April, 2014;
originally announced April 2014.
-
Bayesian sparse graphical models for classification with application to protein expression data
Authors:
Veerabhadran Baladandayuthapani,
Rajesh Talluri,
Yuan Ji,
Kevin R. Coombes,
Yiling Lu,
Bryan T. Hennessy,
Michael A. Davies,
Bani K. Mallick
Abstract:
Reverse-phase protein array (RPPA) analysis is a powerful, relatively new platform that allows for high-throughput, quantitative analysis of protein networks. One of the challenges that currently limit the potential of this technology is the lack of methods that allow for accurate data modeling and identification of related networks and samples. Such models may improve the accuracy of biological s…
▽ More
Reverse-phase protein array (RPPA) analysis is a powerful, relatively new platform that allows for high-throughput, quantitative analysis of protein networks. One of the challenges that currently limit the potential of this technology is the lack of methods that allow for accurate data modeling and identification of related networks and samples. Such models may improve the accuracy of biological sample classification based on patterns of protein network activation and provide insight into the distinct biological relationships underlying different types of cancer. Motivated by RPPA data, we propose a Bayesian sparse graphical modeling approach that uses selection priors on the conditional relationships in the presence of class information. The novelty of our Bayesian model lies in the ability to draw information from the network data as well as from the associated categorical outcome in a unified hierarchical model for classification. In addition, our method allows for intuitive integration of a priori network information directly in the model and allows for posterior inference on the network topologies both within and between classes. Applying our methodology to an RPPA data set generated from panels of human breast cancer and ovarian cancer cell lines, we demonstrate that the model is able to distinguish the different cancer cell types more accurately than several existing models and to identify differential regulation of components of a critical signaling network (the PI3K-AKT pathway) between these two types of cancer. This approach represents a powerful new tool that can be used to improve our understanding of protein networks in cancer.
△ Less
Submitted 21 November, 2014; v1 submitted 29 March, 2014;
originally announced March 2014.
-
Bayesian object classification of gold nanoparticles
Authors:
Bledar A. Konomi,
Soma S. Dhavala,
Jianhua Z. Huang,
Subrata Kundu,
David Huitink,
Hong Liang,
Yu Ding,
Bani K. Mallick
Abstract:
The properties of materials synthesized with nanoparticles (NPs) are highly correlated to the sizes and shapes of the nanoparticles. The transmission electron microscopy (TEM) imaging technique can be used to measure the morphological characteristics of NPs, which can be simple circles or more complex irregular polygons with varying degrees of scales and sizes. A major difficulty in analyzing the…
▽ More
The properties of materials synthesized with nanoparticles (NPs) are highly correlated to the sizes and shapes of the nanoparticles. The transmission electron microscopy (TEM) imaging technique can be used to measure the morphological characteristics of NPs, which can be simple circles or more complex irregular polygons with varying degrees of scales and sizes. A major difficulty in analyzing the TEM images is the overlap** of objects, having different morphological properties with no specific information about the number of objects present. Furthermore, the objects lying along the boundary render automated image analysis much more difficult. To overcome these challenges, we propose a Bayesian method based on the marked-point process representation of the objects. We derive models, both for the marks which parameterize the morphological aspects and the points which determine the location of the objects. The proposed model is an automatic image segmentation and classification procedure, which simultaneously detects the boundaries and classifies the NPs into one of the predetermined shape families. We execute the inference by sampling the posterior distribution using Markov chain Monte Carlo (MCMC) since the posterior is doubly intractable. We apply our novel method to several TEM imaging samples of gold NPs, producing the needed statistical characterization of their morphology.
△ Less
Submitted 5 December, 2013;
originally announced December 2013.
-
Bayesian Low Rank and Sparse Covariance Matrix Decomposition
Authors:
Lin Zhang,
Abhra Sarkar,
Bani K. Mallick
Abstract:
We consider the problem of estimating high-dimensional covariance matrices of a particular structure, which is a summation of low rank and sparse matrices. This covariance structure has a wide range of applications including factor analysis and random effects models. We propose a Bayesian method of estimating the covariance matrices by representing the covariance model in the form of a factor mode…
▽ More
We consider the problem of estimating high-dimensional covariance matrices of a particular structure, which is a summation of low rank and sparse matrices. This covariance structure has a wide range of applications including factor analysis and random effects models. We propose a Bayesian method of estimating the covariance matrices by representing the covariance model in the form of a factor model with unknown number of latent factors. We introduce binary indicators for factor selection and rank estimation for the low rank component combined with a Bayesian lasso method for the sparse component estimation. Simulation studies show that our method can recover the rank as well as the sparsity of the two components respectively. We further extend our method to a graphical factor model where the graphical model of the residuals as well as selecting the number of factors is of interest. We employ a hyper-inverse Wishart prior for modeling decomposable graphs of the residuals, and a Bayesian graphical lasso selection method for unrestricted graphs. We show through simulations that the extended models can recover both the number of latent factors and the graphical model of the residuals successfully when the sample size is sufficient relative to the dimension.
△ Less
Submitted 15 October, 2013;
originally announced October 2013.
-
Bayesian sparse graphical models and their mixtures using lasso selection priors
Authors:
Rajesh Talluri,
Veerabhadran Baladandayuthapani,
Bani K. Mallick
Abstract:
We propose Bayesian methods for Gaussian graphical models that lead to sparse and adaptively shrunk estimators of the precision (inverse covariance) matrix. Our methods are based on lasso-type regularization priors leading to parsimonious parameterization of the precision matrix, which is essential in several applications involving learning relationships among the variables. In this context, we in…
▽ More
We propose Bayesian methods for Gaussian graphical models that lead to sparse and adaptively shrunk estimators of the precision (inverse covariance) matrix. Our methods are based on lasso-type regularization priors leading to parsimonious parameterization of the precision matrix, which is essential in several applications involving learning relationships among the variables. In this context, we introduce a novel type of selection prior that develops a sparse structure on the precision matrix by making most of the elements exactly zero, in addition to ensuring positive definiteness -- thus conducting model selection and estimation simultaneously. We extend these methods to finite and infinite mixtures of Gaussian graphical models for clustered data using Dirichlet process priors. We discuss appropriate posterior simulation schemes to implement posterior inference in the proposed models, including the evaluation of normalizing constants that are functions of parameters of interest which result from the restrictions on the correlation matrix. We evaluate the operating characteristics of our method via several simulations and in application to real data sets.
△ Less
Submitted 3 October, 2013;
originally announced October 2013.
-
Bayes Regularized Graphical Model Estimation in High Dimensions
Authors:
Suprateek Kundu,
Veera Baladandayuthapani,
Bani K. Mallick
Abstract:
There has been an intense development of Bayes graphical model estimation approaches over the past decade - however, most of the existing methods are restricted to moderate dimensions. We propose a novel approach suitable for high dimensional settings, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under novel class of continuous shrinkag…
▽ More
There has been an intense development of Bayes graphical model estimation approaches over the past decade - however, most of the existing methods are restricted to moderate dimensions. We propose a novel approach suitable for high dimensional settings, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under novel class of continuous shrinkage priors on the precision matrix elements, which induces shrinkage under an equivalence with Cholesky-based regularization while enabling conjugate updates of entire precision matrices. Subsequently, we propose a post-fitting graphical model estimation step which proceeds using penalized joint credible regions to perform neighborhood selection sequentially for each node. The posterior computation proceeds using straightforward fully Gibbs sampling, and the approach is scalable to high dimensions. The proposed approach is shown to be asymptotically consistent in estimating the graph structure for fixed $p$ when the truth is a Gaussian graphical model. Simulations show that our approach compares favorably with Bayesian competitors both in terms of graphical model estimation and computational efficiency. We apply our methods to high dimensional gene expression and microRNA datasets in cancer genomics.
△ Less
Submitted 18 August, 2013;
originally announced August 2013.
-
Investigating international new product diffusion speed: A semiparametric approach
Authors:
Brian M. Hartman,
Bani K. Mallick,
Debabrata Talukdar
Abstract:
Global marketing managers are interested in understanding the speed of the new product diffusion process and how the speed has changed in our ever more technologically advanced and global marketplace. Understanding the process allows firms to forecast the expected rate of return on their new products and develop effective marketing strategies. The most recent major study on this topic [Marketing S…
▽ More
Global marketing managers are interested in understanding the speed of the new product diffusion process and how the speed has changed in our ever more technologically advanced and global marketplace. Understanding the process allows firms to forecast the expected rate of return on their new products and develop effective marketing strategies. The most recent major study on this topic [Marketing Science 21 (2002) 97--114] investigated new product diffusions in the United States. We expand upon that study in three important ways. (1) Van den Bulte notes that a similar study is needed in the international context, especially in develo** countries. Our study covers four new product diffusions across 31 developed and develo** nations from 1980--2004. Our sample accounts for about 80% of the global economic output and 60% of the global population, allowing us to examine more general phenomena. (2) His model contains the implicit assumption that the diffusion speed parameter is constant throughout the diffusion life cycle of a product. Recognizing the likely effects on the speed parameter of recent changes in the marketplace, we model the parameter as a semiparametric function, allowing it the flexibility to change over time. (3) We perform a variable selection to determine that the number of internet users and the consumer price index are strongly associated with the speed of diffusion.
△ Less
Submitted 28 June, 2012;
originally announced June 2012.
-
Nonparametric Bayesian Approaches to Non-homogeneous Hidden Markov Models
Authors:
Abhra Sarkar,
Anindya Bhadra,
Bani K. Mallick
Abstract:
In this article a flexible Bayesian non-parametric model is proposed for non-homogeneous hidden Markov models. The model is developed through the amalgamation of the ideas of hidden Markov models and predictor dependent stick-breaking processes. Computation is carried out using auxiliary variable representation of the model which enable us to perform exact MCMC sampling from the posterior. Further…
▽ More
In this article a flexible Bayesian non-parametric model is proposed for non-homogeneous hidden Markov models. The model is developed through the amalgamation of the ideas of hidden Markov models and predictor dependent stick-breaking processes. Computation is carried out using auxiliary variable representation of the model which enable us to perform exact MCMC sampling from the posterior. Furthermore, the model is extended to the situation when the predictors can simultaneously in influence the transition dynamics of the hidden states as well as the emission distribution. Estimates of few steps ahead conditional predictive distributions of the response have been used as performance diagnostics for these models. The proposed methodology is illustrated through simulation experiments as well as analysis of a real data set concerned with the prediction of rainfall induced malaria epidemics.
△ Less
Submitted 8 May, 2012;
originally announced May 2012.
-
A generalized linear mixed model for longitudinal binary data with a marginal logit link function
Authors:
Michael Parzen,
Souparno Ghosh,
Stuart Lipsitz,
Debajyoti Sinha,
Garrett M. Fitzmaurice,
Bani K. Mallick,
Joseph G. Ibrahim
Abstract:
Longitudinal studies of a binary outcome are common in the health, social, and behavioral sciences. In general, a feature of random effects logistic regression models for longitudinal binary data is that the marginal functional form, when integrated over the distribution of the random effects, is no longer of logistic form. Recently, Wang and Louis [Biometrika 90 (2003) 765--775] proposed a random…
▽ More
Longitudinal studies of a binary outcome are common in the health, social, and behavioral sciences. In general, a feature of random effects logistic regression models for longitudinal binary data is that the marginal functional form, when integrated over the distribution of the random effects, is no longer of logistic form. Recently, Wang and Louis [Biometrika 90 (2003) 765--775] proposed a random intercept model in the clustered binary data setting where the marginal model has a logistic form. An acknowledged limitation of their model is that it allows only a single random effect that varies from cluster to cluster. In this paper we propose a modification of their model to handle longitudinal data, allowing separate, but correlated, random intercepts at each measurement occasion. The proposed model allows for a flexible correlation structure among the random intercepts, where the correlations can be interpreted in terms of Kendall's $τ$. For example, the marginal correlations among the repeated binary outcomes can decline with increasing time separation, while the model retains the property of having matching conditional and marginal logit link functions. Finally, the proposed method is used to analyze data from a longitudinal study designed to monitor cardiac abnormalities in children born to HIV-infected women.
△ Less
Submitted 18 April, 2011;
originally announced April 2011.