Search | arXiv e-print repository

A Bayesian multivariate model with temporal dependence on random partition of areal data

Authors: Jessica Pavani, Fernando Andrés Quintana

Abstract: More than half of the world's population is exposed to the risk of mosquito-borne diseases, which leads to millions of cases and hundreds of thousands of deaths every year. Analyzing this type of data is often complex and poses several interesting challenges, mainly due to the vast geographic area, the peculiar temporal behavior, and the potential correlation between infections. Motivation stems f… ▽ More More than half of the world's population is exposed to the risk of mosquito-borne diseases, which leads to millions of cases and hundreds of thousands of deaths every year. Analyzing this type of data is often complex and poses several interesting challenges, mainly due to the vast geographic area, the peculiar temporal behavior, and the potential correlation between infections. Motivation stems from the analysis of tropical diseases data, namely, the number of cases of two arboviruses, dengue and chikungunya, transmitted by the same mosquito, for all the 145 microregions in Southeast Brazil from 2018 to 2022. As a contribution to the literature on multivariate disease data, we develop a flexible Bayesian multivariate spatio-temporal model where temporal dependence is defined for areal clusters. The model features a prior distribution for the random partition of areal data that incorporates neighboring information, thus encouraging maps with few contiguous clusters and discouraging clusters with disconnected areas. The model also incorporates an autoregressive structure and terms related to seasonal patterns into temporal components that are disease and cluster-specific. It also considers a multivariate directed acyclic graph autoregressive structure to accommodate spatial and inter-disease dependence, facilitating the interpretation of spatial correlation. We explore properties of the model by way of simulation studies and show results that prove our proposal compares well to competing alternatives. Finally, we apply the model to the motivating dataset with a twofold goal: clustering areas where the temporal trend of certain diseases are similar, and exploring the potential existence of temporal and/or spatial correlation between two diseases transmitted by the same mosquito. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2311.14502 [pdf, other]

Informed Random Partition Models with Temporal Dependence

Authors: Sally Paganin, Garritt L. Page, Fernando Andrés Quintana

Abstract: Model-based clustering is a powerful tool that is often used to discover hidden structure in data by grou** observational units that exhibit similar response values. Recently, clustering methods have been developed that permit incorporating an ``initial'' partition informed by expert opinion. Then, using some similarity criteria, partitions different from the initial one are down weighted, i.e.… ▽ More Model-based clustering is a powerful tool that is often used to discover hidden structure in data by grou** observational units that exhibit similar response values. Recently, clustering methods have been developed that permit incorporating an ``initial'' partition informed by expert opinion. Then, using some similarity criteria, partitions different from the initial one are down weighted, i.e. they are assigned reduced probabilities. These methods represent an exciting new direction of method development in clustering techniques. We add to this literature a method that very flexibly permits assigning varying levels of uncertainty to any subset of the partition. This is particularly useful in practice as there is rarely clear prior information with regards to the entire partition. Our approach is not based on partition penalties but considers individual allocation probabilities for each unit (e.g., locally weighted prior information). We illustrate the gains in prior specification flexibility via simulation studies and an application to a dataset concerning spatio-temporal evolution of ${\rm PM}_{10}$ measurements in Germany. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: 54 pages, 25 figures

MSC Class: 62F15

arXiv:2309.14120 [pdf, other]

Regression with Variable Dimension Covariates

Authors: Peter Mueller, Fernando Andrés Quintana, Garritt L. Page

Abstract: Regression is one of the most fundamental statistical inference problems. A broad definition of regression problems is as estimation of the distribution of an outcome using a family of probability models indexed by covariates. Despite the ubiquitous nature of regression problems and the abundance of related methods and results there is a surprising gap in the literature. There are no well establis… ▽ More Regression is one of the most fundamental statistical inference problems. A broad definition of regression problems is as estimation of the distribution of an outcome using a family of probability models indexed by covariates. Despite the ubiquitous nature of regression problems and the abundance of related methods and results there is a surprising gap in the literature. There are no well established methods for regression with a varying dimension covariate vectors, despite the common occurrence of such problems. In this paper we review some recent related papers proposing varying dimension regression by way of random partitions. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2302.06764 [pdf, other]

doi 10.1080/10618600.2024.2357636

A Projection Approach to Local Regression with Variable-Dimension Covariates

Authors: Matthew J. Heiner, Garritt L. Page, Fernando Andrés Quintana

Abstract: Incomplete covariate vectors are known to be problematic for estimation and inferences on model parameters, but their impact on prediction performance is less understood. We develop an imputation-free method that builds on a random partition model admitting variable-dimension covariates. Cluster-specific response models further incorporate covariates via linear predictors, facilitating estimation… ▽ More Incomplete covariate vectors are known to be problematic for estimation and inferences on model parameters, but their impact on prediction performance is less understood. We develop an imputation-free method that builds on a random partition model admitting variable-dimension covariates. Cluster-specific response models further incorporate covariates via linear predictors, facilitating estimation of smooth prediction surfaces with relatively few clusters. We exploit marginalization techniques of Gaussian kernels to analytically project response distributions according to any pattern of missing covariates, yielding a local regression with internally consistent uncertainty propagation that utilizes only one set of coefficients per cluster. Aggressive shrinkage of these coefficients regulates uncertainty due to missing covariates. The method allows in- and out-of-sample prediction for any missingness pattern, even if the pattern in a new subject's incomplete covariate vector was not seen in the training data. We develop an MCMC algorithm for posterior sampling that improves a computationally expensive update for latent cluster allocation. Finally, we demonstrate the model's effectiveness for nonlinear point and density prediction under various circumstances by comparing with other recent methods for regression of variable dimensions on synthetic and real data. △ Less

Submitted 28 February, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Journal ref: Journal of Computational and Graphical Statistics (2024)

arXiv:2203.12280 [pdf, other]

Bayesian Nonparametric Vector Autoregressive Models via a Logit Stick-breaking Prior: an Application to Child Obesity

Authors: Mario Beraha, Alessandra Guglielmi, Fernando A. Quintana, Maria de Iorio, Johan Gunnar Eriksson, Fabian Yap

Abstract: Overweight and obesity in adults are known to be associated with risks of metabolic and cardiovascular diseases. Because obesity is an epidemic, increasingly affecting children, it is important to understand if this condition persists from early life to childhood and if different patterns of obesity growth can be detected. Our motivation starts from a study of obesity over time in children from So… ▽ More Overweight and obesity in adults are known to be associated with risks of metabolic and cardiovascular diseases. Because obesity is an epidemic, increasingly affecting children, it is important to understand if this condition persists from early life to childhood and if different patterns of obesity growth can be detected. Our motivation starts from a study of obesity over time in children from South Eastern Asia. Our main focus is on clustering obesity patterns after adjusting for the effect of baseline information. Specifically, we consider a joint model for height and weight patterns taken every 6 months from birth. We propose a novel model that facilitates clustering by combining a vector autoregressive sampling model with a dependent logit stick-breaking prior. Simulation studies show the superiority of the model to capture patterns, compared to other alternatives. We apply the model to the motivating dataset, and discuss the main features of the detected clusters. We also compare alternative models with ours in terms of predictive performances. △ Less

Submitted 23 March, 2022; originally announced March 2022.

arXiv:2107.11456 [pdf, ps, other]

Multipartition model for multiple change point identification

Authors: Ricardo C. Pedroso, Rosangela H. Loschi, Fernando Andrés Quintana

Abstract: Among the main goals in multiple change point problems are the estimation of the number and positions of the change points, as well as the regime structure in the clusters induced by those changes. The product partition model (PPM) is a widely used approach for the detection of multiple change points. The traditional PPM assumes that change points split the set of time points in random clusters th… ▽ More Among the main goals in multiple change point problems are the estimation of the number and positions of the change points, as well as the regime structure in the clusters induced by those changes. The product partition model (PPM) is a widely used approach for the detection of multiple change points. The traditional PPM assumes that change points split the set of time points in random clusters that define a partition of the time axis. It is then typically assumed that sampling model parameter values within each of these blocks are identical. Because changes in different parameters of the observational model may occur at different times, the PPM thus fails to identify the parameters that experienced those changes. A similar problem may occur when detecting changes in multivariate time series. To solve this important limitation, we introduce a multipartition model to detect multiple change points occurring in several parameters at possibly different times. The proposed model assumes that the changes experienced by each parameter generate a different random partition of the time axis, which facilitates identifying which parameters have changed and when they do so. We discuss a partially collapsed Gibbs sampler scheme to implement posterior simulation under the proposed model. We apply the proposed model to identify multiple change points in Normal means and variances and evaluate the performance of the proposed model through Monte Carlo simulations and data illustrations. Its performance is compared with some previously proposed approaches for change point problems. These studies show that the proposed model is competitive and enriches the analysis of change point problems. △ Less

Submitted 9 August, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

Comments: 59 pages, 33 figures

arXiv:2007.06129 [pdf, other]

The Dependent Dirichlet Process and Related Models

Authors: Fernand A. Quintana, Peter Mueller, Alejandro Jara, Steven N. MacEachern

Abstract: Standard regression approaches assume that some finite number of the response distribution characteristics, such as location and scale, change as a (parametric or nonparametric) function of predictors. However, it is not always appropriate to assume a location/scale representation, where the error distribution has unchanging shape over the predictor space. In fact, it often happens in applied rese… ▽ More Standard regression approaches assume that some finite number of the response distribution characteristics, such as location and scale, change as a (parametric or nonparametric) function of predictors. However, it is not always appropriate to assume a location/scale representation, where the error distribution has unchanging shape over the predictor space. In fact, it often happens in applied research that the distribution of responses under study changes with predictors in ways that cannot be reasonably represented by a finite dimensional functional form. This can seriously affect the answers to the scientific questions of interest, and therefore more general approaches are indeed needed. This gives rise to the study of fully nonparametric regression models. We review some of the main Bayesian approaches that have been employed to define probability models where the complete response distribution may vary flexibly with predictors. We focus on developments based on modifications of the Dirichlet process, historically termed dependent Dirichlet processes, and some of the extensions that have been proposed to tackle this general problem using nonparametric approaches. △ Less

Submitted 12 July, 2020; originally announced July 2020.

MSC Class: 62F15 ACM Class: G.3

arXiv:2005.10287 [pdf, other]

The semi-hierarchical Dirichlet Process and its application to clustering homogeneous distributions

Authors: Mario Beraha, Alessandra Guglielmi, Fernando A. Quintana

Abstract: Assessing homogeneity of distributions is an old problem that has received considerable attention, especially in the nonparametric Bayesian literature. To this effect, we propose the semi-hierarchical Dirichlet process, a novel hierarchical prior that extends the hierarchical Dirichlet process of Teh et al. (2006) and that avoids the degeneracy issues of nested processes recently described by Came… ▽ More Assessing homogeneity of distributions is an old problem that has received considerable attention, especially in the nonparametric Bayesian literature. To this effect, we propose the semi-hierarchical Dirichlet process, a novel hierarchical prior that extends the hierarchical Dirichlet process of Teh et al. (2006) and that avoids the degeneracy issues of nested processes recently described by Camerlenghi et al. (2019a). We go beyond the simple yes/no answer to the homogeneity question and embed the proposed prior in a random partition model; this procedure allows us to give a more comprehensive response to the above question and in fact find groups of populations that are internally homogeneous when I greater or equal than 2 such populations are considered. We study theoretical properties of the semi-hierarchical Dirichlet process and of the Bayes factor for the homogeneity test when I = 2. Extensive simulation studies and applications to educational data are also discussed. △ Less

Submitted 16 June, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:1912.13119 [pdf, other]

Clustering and Prediction with Variable Dimension Covariates

Authors: Garritt L. Page, Fernando A. Quintana, Peter Müller

Abstract: In many applied fields incomplete covariate vectors are commonly encountered. It is well known that this can be problematic when making inference on model parameters, but its impact on prediction performance is less understood. We develop a method based on covariate dependent partition models that seamlessly handles missing covariates while completely avoiding any type of imputation. The method we… ▽ More In many applied fields incomplete covariate vectors are commonly encountered. It is well known that this can be problematic when making inference on model parameters, but its impact on prediction performance is less understood. We develop a method based on covariate dependent partition models that seamlessly handles missing covariates while completely avoiding any type of imputation. The method we develop allows in-sample predictions as well as out-of-sample prediction, even if the missing pattern in the new subjects' incomplete covariate vector was not seen in the training data. Any data type, including categorical or continuous covariates are permitted. In simulation studies the proposed method compares favorably. We illustrate the method in two application examples. △ Less

Submitted 12 July, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

arXiv:1912.11542 [pdf, other]

Dependent Modeling of Temporal Sequences of Random Partitions

Authors: Garritt L. Page, Fernando A. Quintana, David B. Dahl

Abstract: We consider the task of modeling a dependent sequence of random partitions. It is well-known that a random measure in Bayesian nonparametrics induces a distribution over random partitions. The community has therefore assumed that the best approach to obtain a dependent sequence of random partitions is through modeling dependent random measures. We argue that this approach is problematic and show t… ▽ More We consider the task of modeling a dependent sequence of random partitions. It is well-known that a random measure in Bayesian nonparametrics induces a distribution over random partitions. The community has therefore assumed that the best approach to obtain a dependent sequence of random partitions is through modeling dependent random measures. We argue that this approach is problematic and show that the random partition model induced by dependent Bayesian nonparametric priors exhibit counter-intuitive dependence among partitions even though the dependence for the sequence of random probability measures is intuitive. Because of this, we advocate instead to model the sequence of random partitions directly when clustering is of principal interest. To this end, we develop a class of dependent random partition models that explicitly models dependence in a sequence of partitions. We derive conditional and marginal properties of the joint partition model and devise computational strategies when employing the method in Bayesian modeling. In the case of temporal dependence, we demonstrate through simulation how the methodology produces partitions that evolve gently and naturally over time. We further illustrate the utility of the method by applying it to an environmental data set that exhibits spatio-temporal dependence. △ Less

Submitted 30 July, 2021; v1 submitted 24 December, 2019; originally announced December 2019.

arXiv:1810.00121 [pdf, other]

Discovering Interactions Using Covariate Informed Random Partition Models

Authors: Garritt L. Page, Fernando A. Quintana, Gary L. Rosner

Abstract: Combination chemotherapy treatment regimens created for patients diagnosed with childhood acute lymphoblastic leukemia have had great success in improving cure rates. Unfortunately, patients prescribed these types of treatment regimens have displayed susceptibility to the onset of osteonecrosis. Some have suggested that this is due to pharmacokinetic interaction between two agents in the treatment… ▽ More Combination chemotherapy treatment regimens created for patients diagnosed with childhood acute lymphoblastic leukemia have had great success in improving cure rates. Unfortunately, patients prescribed these types of treatment regimens have displayed susceptibility to the onset of osteonecrosis. Some have suggested that this is due to pharmacokinetic interaction between two agents in the treatment regimen (asparaginase and dexamethasone) and other physiological variables. Determining which physiological variables to consider when searching for interactions in scenarios like these, minus a priori guidance, has proved to be a challenging problem, particularly if interactions influence the response distribution in ways beyond shifts in expectation or dispersion only. In this paper we propose an exploratory technique that is able to discover associations between covariates and responses in a very general way. The procedure connects covariates to responses very flexibly through dependent random partition prior distributions, and then employs machine learning techniques to highlight potential associations found in each cluster. We provide a simulation study to show utility and apply the method to data produced from a study dedicated to learning which physiological predictors influence severity of osteonecrosis multiplicatively. △ Less

Submitted 3 August, 2020; v1 submitted 28 September, 2018; originally announced October 2018.

Comments: 29 pages, 3 figures

arXiv:1705.05181 [pdf, ps, other]

Determinantal point process mixtures via spectral density approach

Authors: Ilaria Bianchini, Alessandra Guglielmi, Fernando A. Quintana

Abstract: We consider mixture models where location parameters are a priori encouraged to be well separated. We explore a class of determinantal point process (DPP) mixture models, which provide the desired notion of separation or repulsion. Instead of using the rather restrictive case where analytical results are available, we adopt a spectral representation from which approximations to the DPP intensity f… ▽ More We consider mixture models where location parameters are a priori encouraged to be well separated. We explore a class of determinantal point process (DPP) mixture models, which provide the desired notion of separation or repulsion. Instead of using the rather restrictive case where analytical results are available, we adopt a spectral representation from which approximations to the DPP intensity functions can be readily computed. For the sake of concreteness the presentation focuses on a power exponential spectral density, but the proposed approach is in fact quite general. We later extend our model to incorporate covariate information in the likelihood and also in the assignment to mixture components, yielding a trade-off between repulsiveness of locations in the mixtures and attraction among subjects with similar covariates. We develop full Bayesian inference, and explore model properties and posterior behavior using several simulation scenarios and data illustrations. △ Less

Submitted 15 May, 2017; originally announced May 2017.

Comments: 42 pages (including Supplementary Material)

arXiv:1701.04457 [pdf, other]

Parsimonious Hierarchical Modeling Using Repulsive Distributions

Authors: J. J. Quinlan, F. A. Quintana, G. L. Page

Abstract: Employing nonparametric methods for density estimation has become routine in Bayesian statistical practice. Models based on discrete nonparametric priors such as Dirichlet Process Mixture (DPM) models are very attractive choices due to their flexibility and tractability. However, a common problem in fitting DPMs or other discrete models to data is that they tend to produce a large number of (somet… ▽ More Employing nonparametric methods for density estimation has become routine in Bayesian statistical practice. Models based on discrete nonparametric priors such as Dirichlet Process Mixture (DPM) models are very attractive choices due to their flexibility and tractability. However, a common problem in fitting DPMs or other discrete models to data is that they tend to produce a large number of (sometimes) redundant clusters. In this work we propose a method that produces parsimonious mixture models (i.e. mixtures that discourage the creation of redundant clusters), without sacrificing flexibility or model fit. This method is based on the idea of repulsion, that is, that any two mixture components are encouraged to be well separated. We propose a family of d-dimensional probability densities whose coordinates tend to repel each other in a smooth way. The induced probability measure has a close relation with Gibbs measures, graph theory and point processes. We investigate its global properties and explore its use in the context of mixture models for density estimation. Computational techniques are detailed and we illustrate its usefulness with some well-known data sets and a small simulation study. △ Less

Submitted 29 June, 2017; v1 submitted 16 January, 2017; originally announced January 2017.

Comments: 36 pages, 9 figures, 1 table

arXiv:1505.02589 [pdf, ps, other]

doi 10.1214/14-BA919

Predictions Based on the Clustering of Heterogeneous Functions via Shape and Subject-Specific Covariates

Authors: Garritt L. Page, Fernando A. Quintana

Abstract: We consider a study of players employed by teams who are members of the National Basketball Association where units of observation are functional curves that are realizations of production measurements taken through the course of one's career. The observed functional output displays large amounts of between player heterogeneity in the sense that some individuals produce curves that are fairly smoo… ▽ More We consider a study of players employed by teams who are members of the National Basketball Association where units of observation are functional curves that are realizations of production measurements taken through the course of one's career. The observed functional output displays large amounts of between player heterogeneity in the sense that some individuals produce curves that are fairly smooth while others are (much) more erratic. We argue that this variability in curve shape is a feature that can be exploited to guide decision making, learn about processes under study and improve prediction. In this paper we develop a methodology that takes advantage of this feature when clustering functional curves. Individual curves are flexibly modeled using Bayesian penalized B-splines while a hierarchical structure allows the clustering to be guided by the smoothness of individual curves. In a sense, the hierarchical structure balances the desire to fit individual curves well while still producing meaningful clusters that are used to guide prediction. We seamlessly incorporate available covariate information to guide the clustering of curves non-parametrically through the use of a product partition model prior for a random partition of individuals. Clustering based on curve smoothness and subject-specific covariate information is particularly important in carrying out the two types of predictions that are of interest, those that complete a partially observed curve from an active player, and those that predict the entire career curve for a player yet to play in the National Basketball Association. △ Less

Submitted 11 May, 2015; originally announced May 2015.

Comments: Published at http://dx.doi.org/10.1214/14-BA919 in the Bayesian Analysis (http://projecteuclid.org/euclid.ba) by the International Society of Bayesian Analysis (http://bayesian.org/)

Report number: VTeX-BA-BA919

Journal ref: Bayesian Analysis 2015, Vol. 10, No. 2, 379-410

arXiv:1504.04489 [pdf, other]

Spatial Product Partition Models

Authors: Garritt L. Page, Fernando A. Quintana

Abstract: When modeling geostatistical or areal data, spatial structure is commonly accommodated via a covariance function for the former and a neighborhood structure for the latter. In both cases the resulting spatial structure is a consequence of implicit spatial grou** in that observations near in space are assumed to behave similarly. It would be desirable to develop spatial methods that explicitly mo… ▽ More When modeling geostatistical or areal data, spatial structure is commonly accommodated via a covariance function for the former and a neighborhood structure for the latter. In both cases the resulting spatial structure is a consequence of implicit spatial grou** in that observations near in space are assumed to behave similarly. It would be desirable to develop spatial methods that explicitly model the partitioning of spatial locations providing more control over resulting spatial structures and being able to better balance global vs local spatial dependence. To this end, we extend product partition models to a spatial setting so that the partitioning of locations into spatially dependent clusters is explicitly modeled. We explore the spatial structures that result from employing a spatial product partition model and demonstrate its flexibility in accommodating many types of spatial dependencies. We illustrate the method's utility through simulation studies and an education application. △ Less

Submitted 17 April, 2015; originally announced April 2015.

arXiv:1306.2503 [pdf, ps, other]

doi 10.1214/12-STS407

Defining Predictive Probability Functions for Species Sampling Models

Authors: Jaeyong Lee, Fernando A. Quintana, Peter Müller, Lorenzo Trippa

Abstract: We review the class of species sampling models (SSM). In particular, we investigate the relation between the exchangeable partition probability function (EPPF) and the predictive probability function (PPF). It is straightforward to define a PPF from an EPPF, but the converse is not necessarily true. In this paper we introduce the notion of putative PPFs and show novel conditions for a putative PPF… ▽ More We review the class of species sampling models (SSM). In particular, we investigate the relation between the exchangeable partition probability function (EPPF) and the predictive probability function (PPF). It is straightforward to define a PPF from an EPPF, but the converse is not necessarily true. In this paper we introduce the notion of putative PPFs and show novel conditions for a putative PPF to define an EPPF. We show that all possible PPFs in a certain class have to define (unnormalized) probabilities for cluster membership that are linear in cluster size. We give a new necessary and sufficient condition for arbitrary putative PPFs to define an EPPF. Finally, we show posterior inference for a large class of SSMs with a PPF that is not linear in cluster size and discuss a numerical method to derive its PPF. △ Less

Submitted 11 June, 2013; originally announced June 2013.

Comments: Published in at http://dx.doi.org/10.1214/12-STS407 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS407

Journal ref: Statistical Science 2013, Vol. 28, No. 2, 209-222

arXiv:1202.5914 [pdf, ps, other]

doi 10.1214/11-AOAS492

Multivariate Bayesian semiparametric models for authentication of food and beverages

Authors: Luis Gutiérrez, Fernando A. Quintana

Abstract: Food and beverage authentication is the process by which foods or beverages are verified as complying with its label description, for example, verifying if the denomination of origin of an olive oil bottle is correct or if the variety of a certain bottle of wine matches its label description. The common way to deal with an authentication process is to measure a number of attributes on samples of f… ▽ More Food and beverage authentication is the process by which foods or beverages are verified as complying with its label description, for example, verifying if the denomination of origin of an olive oil bottle is correct or if the variety of a certain bottle of wine matches its label description. The common way to deal with an authentication process is to measure a number of attributes on samples of food and then use these as input for a classification problem. Our motivation stems from data consisting of measurements of nine chemical compounds denominated Anthocyanins, obtained from samples of Chilean red wines of grape varieties Cabernet Sauvignon, Merlot and Carménère. We consider a model-based approach to authentication through a semiparametric multivariate hierarchical linear mixed model for the mean responses, and covariance matrices that are specific to the classification categories. Specifically, we propose a model of the ANOVA-DDP type, which takes advantage of the fact that the available covariates are discrete in nature. The results suggest that the model performs well compared to other parametric alternatives. This is also corroborated by application to simulated data. △ Less

Submitted 27 February, 2012; originally announced February 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS492 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS492

Journal ref: Annals of Applied Statistics 2011, Vol. 5, No. 4, 2385-2402

arXiv:0708.4350 [pdf, ps, other]

doi 10.1214/07-AOAS104

Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis

Authors: Michael A. Newton, Fernando A. Quintana, Johan A. den Boon, Srikumar Sengupta, Paul Ahlquist

Abstract: A prespecified set of genes may be enriched, to varying degrees, for genes that have altered expression levels relative to two or more states of a cell. Knowing the enrichment of gene sets defined by functional categories, such as gene ontology (GO) annotations, is valuable for analyzing the biological signals in microarray expression data. A common approach to measuring enrichment is by cross-c… ▽ More A prespecified set of genes may be enriched, to varying degrees, for genes that have altered expression levels relative to two or more states of a cell. Knowing the enrichment of gene sets defined by functional categories, such as gene ontology (GO) annotations, is valuable for analyzing the biological signals in microarray expression data. A common approach to measuring enrichment is by cross-classifying genes according to membership in a functional category and membership on a selected list of significantly altered genes. A small Fisher's exact test $p$-value, for example, in this $2\times2$ table is indicative of enrichment. Other category analysis methods retain the quantitative gene-level scores and measure significance by referring a category-level statistic to a permutation distribution associated with the original differential expression problem. We describe a class of random-set scoring methods that measure distinct components of the enrichment signal. The class includes Fisher's test based on selected genes and also tests that average gene-level evidence across the category. Averaging and selection methods are compared empirically using Affymetrix data on expression in nasopharyngeal cancer tissue, and theoretically using a location model of differential expression. We find that each method has a domain of superiority in the state space of enrichment problems, and that both methods have benefits in practice. Our analysis also addresses two problems related to multiple-category inference, namely, that equally enriched categories are not detected with equal probability if they are of different sizes, and also that there is dependence among category statistics owing to shared genes. Random-set enrichment calculations do not require Monte Carlo for implementation. They are made available in the R package allez. △ Less

Submitted 31 August, 2007; originally announced August 2007.

Comments: Published at http://dx.doi.org/10.1214/07-AOAS104 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS104

Journal ref: Annals of Applied Statistics 2007, Vol. 1, No. 1, 85-106

Showing 1–18 of 18 results for author: Quintana, F A