-
Informed Random Partition Models with Temporal Dependence
Authors:
Sally Paganin,
Garritt L. Page,
Fernando Andrés Quintana
Abstract:
Model-based clustering is a powerful tool that is often used to discover hidden structure in data by grou** observational units that exhibit similar response values. Recently, clustering methods have been developed that permit incorporating an ``initial'' partition informed by expert opinion. Then, using some similarity criteria, partitions different from the initial one are down weighted, i.e.…
▽ More
Model-based clustering is a powerful tool that is often used to discover hidden structure in data by grou** observational units that exhibit similar response values. Recently, clustering methods have been developed that permit incorporating an ``initial'' partition informed by expert opinion. Then, using some similarity criteria, partitions different from the initial one are down weighted, i.e. they are assigned reduced probabilities. These methods represent an exciting new direction of method development in clustering techniques. We add to this literature a method that very flexibly permits assigning varying levels of uncertainty to any subset of the partition. This is particularly useful in practice as there is rarely clear prior information with regards to the entire partition. Our approach is not based on partition penalties but considers individual allocation probabilities for each unit (e.g., locally weighted prior information). We illustrate the gains in prior specification flexibility via simulation studies and an application to a dataset concerning spatio-temporal evolution of ${\rm PM}_{10}$ measurements in Germany.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Computational methods for fast Bayesian model assessment via calibrated posterior p-values
Authors:
Sally Paganin,
Perry de Valpine
Abstract:
Posterior predictive p-values (ppps) have become popular tools for Bayesian model assessment, being general-purpose and easy to use. However, interpretation can be difficult because their distribution is not uniform under the hypothesis that the model did generate the data. Calibrated ppps (cppps) can be obtained via a bootstrap-like procedure, yet remain unavailable in practice due to high comput…
▽ More
Posterior predictive p-values (ppps) have become popular tools for Bayesian model assessment, being general-purpose and easy to use. However, interpretation can be difficult because their distribution is not uniform under the hypothesis that the model did generate the data. Calibrated ppps (cppps) can be obtained via a bootstrap-like procedure, yet remain unavailable in practice due to high computational cost. This paper introduces methods to enable efficient approximation of cppps and their uncertainty for fast model assessment. We first investigate the computational trade-off between the number of calibration replicates and the number of MCMC samples per replicate. Provided that the MCMC chain from the real data has converged, using short MCMC chains per calibration replicate can save significant computation time compared to naive implementations, without significant loss in accuracy. We propose different variance estimators for the cppp approximation, which can be used to confirm quickly the lack of evidence against model misspecification. As variance estimation uses effective sample sizes of many short MCMC chains, we show these can be approximated well from the real-data MCMC chain. The procedure for cppp is implemented in NIMBLE, a flexible framework for hierarchical modeling that supports many models and discrepancy measures.
△ Less
Submitted 30 January, 2024; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Computational strategies and estimation performance with Bayesian semiparametric Item Response Theory models
Authors:
Sally Paganin,
Christopher J. Paciorek,
Claudia Wehrhahn,
Abel Rodriguez,
Sophia Rabe-Hesketh,
Perry de Valpine
Abstract:
Item response theory (IRT) models typically rely on a normality assumption for subject-specific latent traits, which is often unrealistic in practice. Semiparametric extensions based on Dirichlet process mixtures offer a more flexible representation of the unknown distribution of the latent trait. However, the use of such models in the IRT literature has been extremely limited, in good part becaus…
▽ More
Item response theory (IRT) models typically rely on a normality assumption for subject-specific latent traits, which is often unrealistic in practice. Semiparametric extensions based on Dirichlet process mixtures offer a more flexible representation of the unknown distribution of the latent trait. However, the use of such models in the IRT literature has been extremely limited, in good part because of the lack of comprehensive studies and accessible software tools. This paper provides guidance for practitioners on semiparametric IRT models and their implementation. In particular, we rely on NIMBLE, a flexible software system for hierarchical models that enables the use of Dirichlet process mixtures. We highlight efficient sampling strategies for model estimation and compare inferential results under parametric and semiparametric models.
△ Less
Submitted 10 August, 2022; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Centered Partition Process: Informative Priors for Clustering
Authors:
Sally Paganin,
Amy H. Herring,
Andrew F. Olshan,
David B. Dunson
Abstract:
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there h…
▽ More
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. For example, we are motivated by an epidemiological application, in which we wish to cluster birth defects into groups and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.
△ Less
Submitted 29 January, 2019;
originally announced January 2019.
-
Bayesian modeling of networks in complex business intelligence problems
Authors:
Daniele Durante,
Sally Paganin,
Bruno Scarpa,
David B. Dunson
Abstract:
Complex network data problems are increasingly common in many fields of application. Our motivation is drawn from strategic marketing studies monitoring customer choices of specific products, along with co-subscription networks encoding multiple purchasing behavior. Data are available for several agencies within the same insurance company, and our goal is to efficiently exploit co-subscription net…
▽ More
Complex network data problems are increasingly common in many fields of application. Our motivation is drawn from strategic marketing studies monitoring customer choices of specific products, along with co-subscription networks encoding multiple purchasing behavior. Data are available for several agencies within the same insurance company, and our goal is to efficiently exploit co-subscription networks to inform targeted advertising of cross-sell strategies to currently mono-product customers. We address this goal by develo** a Bayesian hierarchical model, which clusters agencies according to common mono-product customer choices and co-subscription networks. Within each cluster, we efficiently model customer behavior via a cluster-dependent mixture of latent eigenmodels. This formulation provides key information on mono-product customer choices and multiple purchasing behavior within each cluster, informing targeted cross-sell strategies. We develop simple algorithms for tractable inference, and assess performance in simulations and an application to business intelligence.
△ Less
Submitted 28 March, 2016; v1 submitted 2 October, 2015;
originally announced October 2015.