Search | arXiv e-print repository

doi 10.51387/23-NEJSDS37

Bayesian Variable Selection in Double Generalized Linear Tweedie Spatial Process Models

Authors: Aritra Halder, Shariq Mohammed, Dipak K. Dey

Abstract: Double generalized linear models provide a flexible framework for modeling data by allowing the mean and the dispersion to vary across observations. Common members of the exponential dispersion family including the Gaussian, Poisson, compound Poisson-gamma (CP-g), Gamma and inverse-Gaussian are known to admit such models. The lack of their use can be attributed to ambiguities that exist in model s… ▽ More Double generalized linear models provide a flexible framework for modeling data by allowing the mean and the dispersion to vary across observations. Common members of the exponential dispersion family including the Gaussian, Poisson, compound Poisson-gamma (CP-g), Gamma and inverse-Gaussian are known to admit such models. The lack of their use can be attributed to ambiguities that exist in model specification under a large number of covariates and complications that arise when data display complex spatial dependence. In this work we consider a hierarchical specification for the CP-g model with a spatial random effect. The spatial effect is targeted at performing uncertainty quantification by modeling dependence within the data arising from location based indexing of the response. We focus on a Gaussian process specification for the spatial effect. Simultaneously, we tackle the problem of model specification for such models using Bayesian variable selection. It is effected through a continuous spike and slab prior on the model parameters, specifically the fixed effects. The novelty of our contribution lies in the Bayesian frameworks developed for such models. We perform various synthetic experiments to showcase the accuracy of our frameworks. They are then applied to analyze automobile insurance premiums in Connecticut, for the year of 2008. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: Appeared online in New England Journal of Statistics in Data Science

arXiv:2106.10941 [pdf, other]

Tumor Radiogenomics with Bayesian Layered Variable Selection

Authors: Shariq Mohammed, Sebastian Kurtek, Karthik Bharath, Arvind Rao, Veerabhadran Baladandayuthapani

Abstract: We propose a statistical framework to integrate radiological magnetic resonance imaging (MRI) and genomic data to identify the underlying radiogenomic associations in lower grade gliomas (LGG). We devise a novel imaging phenotype by dividing the tumor region into concentric spherical layers that mimics the tumor evolution process. MRI data within each layer is represented by voxel--intensity-based… ▽ More We propose a statistical framework to integrate radiological magnetic resonance imaging (MRI) and genomic data to identify the underlying radiogenomic associations in lower grade gliomas (LGG). We devise a novel imaging phenotype by dividing the tumor region into concentric spherical layers that mimics the tumor evolution process. MRI data within each layer is represented by voxel--intensity-based probability density functions which capture the complete information about tumor heterogeneity. Under a Riemannian-geometric framework these densities are mapped to a vector of principal component scores which act as imaging phenotypes. Subsequently, we build Bayesian variable selection models for each layer with the imaging phenotypes as the response and the genomic markers as predictors. Our novel hierarchical prior formulation incorporates the interior-to-exterior structure of the layers, and the correlation between the genomic markers. We employ a computationally-efficient Expectation--Maximization-based strategy for estimation. Simulation studies demonstrate the superior performance of our approach compared to other approaches. With a focus on the cancer driver genes in LGG, we discuss some biologically relevant findings. Genes implicated with survival and oncogenesis are identified as being associated with the spherical layers, which could potentially serve as early-stage diagnostic markers for disease monitoring, prior to routine invasive approaches. △ Less

Submitted 21 June, 2021; originally announced June 2021.

arXiv:2104.00510 [pdf, other]

RADIOHEAD: Radiogenomic Analysis Incorporating Tumor Heterogeneity in Imaging Through Densities

Authors: Shariq Mohammed, Karthik Bharath, Sebastian Kurtek, Arvind Rao, Veerabhadran Baladandayuthapani

Abstract: Recent technological advancements have enabled detailed investigation of associations between the molecular architecture and tumor heterogeneity, through multi-source integration of radiological imaging and genomic (radiogenomic) data. In this paper, we integrate and harness radiogenomic data in patients with lower grade gliomas (LGG), a type of brain cancer, in order to develop a regression frame… ▽ More Recent technological advancements have enabled detailed investigation of associations between the molecular architecture and tumor heterogeneity, through multi-source integration of radiological imaging and genomic (radiogenomic) data. In this paper, we integrate and harness radiogenomic data in patients with lower grade gliomas (LGG), a type of brain cancer, in order to develop a regression framework called RADIOHEAD (RADIOgenomic analysis incorporating tumor HEterogeneity in imAging through Densities) to identify radiogenomic associations. Imaging data is represented through voxel intensity probability density functions of tumor sub-regions obtained from multimodal magnetic resonance imaging, and genomic data through molecular signatures in the form of pathway enrichment scores corresponding to their gene expression profiles. Employing a Riemannian-geometric framework for principal component analysis on the set of probability densities functions, we map each probability density to a vector of principal component scores, which are then included as predictors in a Bayesian regression model with the pathway enrichment scores as the response. Variable selection compatible with the grou** structure amongst the predictors induced through the tumor sub-regions is carried out under a group spike-and-slab prior. A Bayesian false discovery rate mechanism is then used to infer significant associations based on the posterior distribution of the regression coefficients. Our analyses reveal several pathways relevant to LGG etiology (such as synaptic transmission, nerve impulse and neurotransmitter pathways), to have significant associations with the corresponding imaging-based predictors. △ Less

Submitted 7 April, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

arXiv:2010.02742 [pdf, other]

Wound and episode level readmission risk or weeks to readmit: Why do patients get readmitted? How long does it take for a patient to get readmitted?

Authors: Subba Reddy Oota, Nafisur Rahman, Shahid Saleem Mohammed, Jeffrey Galitz, Ming Liu

Abstract: The Affordable care Act of 2010 had introduced Readmission reduction program in 2012 to reduce avoidable re-admissions to control rising healthcare costs. Wound care impacts 15 of medicare beneficiaries making it one of the major contributors of medicare health care cost. Health plans have been exploring proactive health care services that can focus on preventing wound recurrences and re-admission… ▽ More The Affordable care Act of 2010 had introduced Readmission reduction program in 2012 to reduce avoidable re-admissions to control rising healthcare costs. Wound care impacts 15 of medicare beneficiaries making it one of the major contributors of medicare health care cost. Health plans have been exploring proactive health care services that can focus on preventing wound recurrences and re-admissions to control the wound care costs. With rising costs of Wound care industry, it has become of paramount importance to reduce wound recurrences & patient re-admissions. What factors are responsible for a Wound to recur which ultimately lead to hospitalization or re-admission? Is there a way to identify the patients at risk of re-admission before the occurrence using data driven analysis? Patient re-admission risk management has become critical for patients suffering from chronic wounds such as diabetic ulcers, pressure ulcers, and vascular ulcers. Understanding the risk & the factors that cause patient readmission can help care providers and patients avoid wound recurrences. Our work focuses on identifying patients who are at high risk of re-admission & determining the time period with in which a patient might get re-admitted. Frequent re-admissions add financial stress to the patient & Health plan and deteriorate the quality of life of the patient. Having this information can allow a provider to set up preventive measures that can delay, if not prevent, patients' re-admission. On a combined wound & episode-level data set of patient's wound care information, our extended autoprognosis achieves a recall of 92 and a precision of 92 for the predicting a patient's re-admission risk. For new patient class, precision and recall are as high as 91 and 98, respectively. We are also able to predict the patient's discharge event for a re-admission event to occur through our model with a MAE of 2.3 weeks. △ Less

Submitted 5 October, 2020; originally announced October 2020.

Comments: 7 pages, 7 figures

arXiv:2004.12012 [pdf, other]

Integrative Bayesian models using Post-selective Inference: a case study in Radiogenomics

Authors: Snigdha Panigrahi, Shariq Mohammed, Arvind Rao, Veerabhadran Baladandayuthapani

Abstract: Integrative analyses based on statistically relevant associations between genomics and a wealth of intermediary phenotypes (such as imaging) provide vital insights into their clinical relevance in terms of the disease mechanisms. Estimates for uncertainty in the resulting integrative models are however unreliable unless inference accounts for the selection of these associations with accuracy. In t… ▽ More Integrative analyses based on statistically relevant associations between genomics and a wealth of intermediary phenotypes (such as imaging) provide vital insights into their clinical relevance in terms of the disease mechanisms. Estimates for uncertainty in the resulting integrative models are however unreliable unless inference accounts for the selection of these associations with accuracy. In this article, we develop selection-aware Bayesian methods which: (i) counteract the impact of model selection bias through a "selection-aware posterior" in a flexible class of integrative Bayesian models post a selection of promising variables via $\ell_1$-regularized algorithms; (ii) strike an inevitable tradeoff between the quality of model selection and inferential power when the same dataset is used for both selection and uncertainty estimation. Central to our methodological development, a carefully constructed conditional likelihood function deployed with a reparameterization map** provides notably tractable updates when gradient-based MCMC sampling is used for estimating uncertainties from the selection-aware posterior. Applying our methods to a radiogenomic analysis, we successfully recover several important gene pathways and estimate uncertainties for their associations with patient survival times. △ Less

Submitted 12 August, 2022; v1 submitted 24 April, 2020; originally announced April 2020.

Comments: 45 pages, 7 Figures

arXiv:2003.06299 [pdf, other]

doi 10.1080/03461238.2021.1921017

Spatial Tweedie exponential dispersion models

Authors: Aritra Halder, Shariq Mohammed, Kun Chen, Dipak K. Dey

Abstract: This paper proposes a general modeling framework that allows for uncertainty quantification at the individual covariate level and spatial referencing, operating withing a double generalized linear model (DGLM). DGLMs provide a general modeling framework allowing dispersion to depend in a link-linear fashion on chosen covariates. We focus on working with Tweedie exponential dispersion models while… ▽ More This paper proposes a general modeling framework that allows for uncertainty quantification at the individual covariate level and spatial referencing, operating withing a double generalized linear model (DGLM). DGLMs provide a general modeling framework allowing dispersion to depend in a link-linear fashion on chosen covariates. We focus on working with Tweedie exponential dispersion models while considering DGLMs, the reason being their recent wide-spread use for modeling mixed response types. Adopting a regularization based approach, we suggest a class of flexible convex penalties derived from an un-directed graph that facilitates estimation of the unobserved spatial effect. Developments are concisely showcased by proposing a co-ordinate descent algorithm that jointly explains variation from covariates in mean and dispersion through estimation of respective model coefficients while estimating the unobserved spatial effect. Simulations performed show that proposed approach is superior to competitors like the ridge and un-penalized versions. Finally, a real data application is considered while modeling insurance losses arising from automobile collisions in the state of Connecticut, USA for the year 2008. △ Less

Submitted 12 March, 2020; originally announced March 2020.

Comments: 26 pages, 3 figures and 7 tables

Journal ref: Scand. Actuar. J. 10 (2021) 1017-1036

arXiv:1912.12356 [pdf, other]

Spatial risk estimation in Tweedie compound Poisson double generalized linear models

Authors: Aritra Halder, Shariq Mohammed, Kun Chen, Dipak Dey

Abstract: Tweedie exponential dispersion family constitutes a fairly rich sub-class of the celebrated exponential family. In particular, a member, compound Poisson gamma (CP-g) model has seen extensive use over the past decade for modeling mixed response featuring exact zeros with a continuous response from a gamma distribution. This paper proposes a framework to perform residual analysis on CP-g double gen… ▽ More Tweedie exponential dispersion family constitutes a fairly rich sub-class of the celebrated exponential family. In particular, a member, compound Poisson gamma (CP-g) model has seen extensive use over the past decade for modeling mixed response featuring exact zeros with a continuous response from a gamma distribution. This paper proposes a framework to perform residual analysis on CP-g double generalized linear models for spatial uncertainty quantification. Approximations are introduced to proposed framework making the procedure scalable, without compromise in accuracy of estimation and model complexity; accompanied by sensitivity analysis to model mis-specification. Proposed framework is applied to modeling spatial uncertainty in insurance loss costs arising from automobile collision coverage. Scalability is demonstrated by choosing sizable spatial reference domains comprised of groups of states within the United States of America. △ Less

Submitted 15 January, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

Comments: 34 pages, 10 figures and 12 tables

arXiv:1501.06314 [pdf, ps, other]

doi 10.1007/s11222-016-9670-1

Variable selection for model-based clustering using the integrated complete-data likelihood

Authors: Marbac Matthieu, Sedki Mohammed

Abstract: Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. Model selection methods are an efficient alternative, yet they require a difficult opt… ▽ More Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. Model selection methods are an efficient alternative, yet they require a difficult optimization of an information criterion which involves combinatorial problems. First, most of these optimization algorithms are based on a suboptimal procedure (e.g. stepwise method). Second, the algorithms are often greedy because they need multiple calls of EM algorithms. Here we propose to use a new information criterion based on the integrated complete-data likelihood. It does not require any estimate and its maximization is simple and computationally efficient. The original contribution of our approach is to perform the model selection without requiring any parameter estimation. Then, parameter inference is needed only for the unique selected model. This approach is used for the variable selection of a Gaussian mixture model with conditional independence assumption. The numerical experiments on simulated and benchmark datasets show that the proposed method often outperforms two classical approaches for variable selection. △ Less

Submitted 18 June, 2015; v1 submitted 26 January, 2015; originally announced January 2015.

Comments: submitted to Statistics and Computing

MSC Class: 62H30; 62F15; 62-07; 62F07

arXiv:1312.6965 [pdf, ps, other]

doi 10.1109/TASE.2013.2256349

An Unsupervised Approach for Automatic Activity Recognition based on Hidden Markov Model Regression

Authors: Dorra Trabelsi, Samer Mohammed, Faicel Chamroukhi, Latifa Oukhellou, Yacine Amirat

Abstract: Using supervised machine learning approaches to recognize human activities from on-body wearable accelerometers generally requires a large amount of labelled data. When ground truth information is not available, too expensive, time consuming or difficult to collect, one has to rely on unsupervised approaches. This paper presents a new unsupervised approach for human activity recognition from raw a… ▽ More Using supervised machine learning approaches to recognize human activities from on-body wearable accelerometers generally requires a large amount of labelled data. When ground truth information is not available, too expensive, time consuming or difficult to collect, one has to rely on unsupervised approaches. This paper presents a new unsupervised approach for human activity recognition from raw acceleration data measured using inertial wearable sensors. The proposed method is based upon joint segmentation of multidimensional time series using a Hidden Markov Model (HMM) in a multiple regression context. The model is learned in an unsupervised framework using the Expectation-Maximization (EM) algorithm where no activity labels are needed. The proposed method takes into account the sequential appearance of the data. It is therefore adapted for the temporal acceleration data to accurately detect the activities. It allows both segmentation and classification of the human activities. Experimental results are provided to demonstrate the efficiency of the proposed approach with respect to standard supervised and unsupervised classification approaches △ Less

Submitted 25 December, 2013; originally announced December 2013.

Journal ref: IEEE Transactions on Automation Science and Engineering, Volume: 10, Issue: 3, July 2013, Pages: 829-835

arXiv:1312.6956 [pdf, ps, other]

doi 10.1016/j.neucom.2013.04.003

Joint segmentation of multivariate time series with hidden process regression for human activity recognition

Authors: Faicel Chamroukhi, Samer Mohammed, Dorra Trabelsi, Latifa Oukhellou, Yacine Amirat

Abstract: The problem of human activity recognition is central for understanding and predicting the human behavior, in particular in a prospective of assistive services to humans, such as health monitoring, well being, security, etc. There is therefore a growing need to build accurate models which can take into account the variability of the human activities over time (dynamic models) rather than static one… ▽ More The problem of human activity recognition is central for understanding and predicting the human behavior, in particular in a prospective of assistive services to humans, such as health monitoring, well being, security, etc. There is therefore a growing need to build accurate models which can take into account the variability of the human activities over time (dynamic models) rather than static ones which can have some limitations in such a dynamic context. In this paper, the problem of activity recognition is analyzed through the segmentation of the multidimensional time series of the acceleration data measured in the 3-d space using body-worn accelerometers. The proposed model for automatic temporal segmentation is a specific statistical latent process model which assumes that the observed acceleration sequence is governed by sequence of hidden (unobserved) activities. More specifically, the proposed approach is based on a specific multiple regression model incorporating a hidden discrete logistic process which governs the switching from one activity to another over time. The model is learned in an unsupervised context by maximizing the observed-data log-likelihood via a dedicated expectation-maximization (EM) algorithm. We applied it on a real-world automatic human activity recognition problem and its performance was assessed by performing comparisons with alternative approaches, including well-known supervised static classifiers and the standard hidden Markov model (HMM). The obtained results are very encouraging and show that the proposed approach is quite competitive even it works in an entirely unsupervised way and does not requires a feature extraction preprocessing step. △ Less

Submitted 25 December, 2013; originally announced December 2013.

Journal ref: Neurocomputing, Volume 120, Pages 633-644, November 2013

Showing 1–10 of 10 results for author: Mohammed, S