Search | arXiv e-print repository

arXiv:2406.19563 [pdf, other]

Bayesian Rank-Clustering

Authors: Michael Pearce, Elena A. Erosheva

Abstract: In a traditional analysis of ordinal comparison data, the goal is to infer an overall ranking of objects from best to worst with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying abilities or qualities being equal for some objects. In such cases, practitioners may prefe… ▽ More In a traditional analysis of ordinal comparison data, the goal is to infer an overall ranking of objects from best to worst with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying abilities or qualities being equal for some objects. In such cases, practitioners may prefer an overall ranking where groups of objects are allowed to have equal ranks or to be $\textit{rank-clustered}$. Existing models related to rank-clustering are limited by their inability to handle a variety of ordinal data types, to quantify uncertainty, or by the need to pre-specify the number and size of potential rank-clusters. We solve these limitations through the proposed Bayesian $\textit{Rank-Clustered Bradley-Terry-Luce}$ model. We allow for rank-clustering via parameter fusion by imposing a novel spike-and-slab prior on object-specific worth parameters in Bradley-Terry-Luce family of distributions for ordinal comparisons. We demonstrate the model on simulated and real datasets in survey analysis, elections, and sports. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 36 pages, 20 figures, 2 tables

arXiv:2301.09755 [pdf, other]

Modeling Preferences: A Bayesian Mixture of Finite Mixtures for Rankings and Ratings

Authors: Michael Pearce, Elena A. Erosheva

Abstract: Rankings and ratings are commonly used to express preferences but provide distinct and complementary information. Rankings give ordinal and scale-free comparisons but lack granularity; ratings provide cardinal and granular assessments but may be highly subjective or inconsistent. Collecting and analyzing rankings and ratings jointly has not been performed until recently due to a lack of principled… ▽ More Rankings and ratings are commonly used to express preferences but provide distinct and complementary information. Rankings give ordinal and scale-free comparisons but lack granularity; ratings provide cardinal and granular assessments but may be highly subjective or inconsistent. Collecting and analyzing rankings and ratings jointly has not been performed until recently due to a lack of principled methods. In this work, we propose a flexible, joint statistical model for rankings and ratings under heterogeneous preferences: the Bradley-Terry-Luce-Binomial (BTL-Binomial). We employ a Bayesian mixture of finite mixtures (MFM) approach to estimate heterogeneous preferences, understand their inherent uncertainty, and make accurate decisions based on ranking and ratings jointly. We demonstrate the efficiency and practicality of the BTL-Binomial MFM approach on real and simulated datasets of ranking and rating preferences in peer review and survey data contexts. △ Less

Submitted 23 January, 2023; originally announced January 2023.

Comments: 41 pages, 16 figures

arXiv:2208.03252 [pdf, other]

doi 10.1214/21-AOAS1439

Partial-Mastery Cognitive Diagnosis Models

Authors: Zhuoran Shang, Elena A. Erosheva, Gongjun Xu

Abstract: Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of… ▽ More Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of specific latent attributes and that each attribute is either fully mastered or not mastered by a given subject. We propose a new class of models, partial mastery CDMs (PM-CDMs), that generalizes CDMs by allowing for partial mastery levels for each attribute of interest. We demonstrate that PM-CDMs can be represented as restricted latent class models. Relying on the latent class representation, we propose a Bayesian approach for estimation. We present simulation studies to demonstrate parameter recovery, to investigate the impact of model misspecification with respect to partial mastery, and to develop diagnostic tools that could be used by practitioners to decide between CDMs and PM-CDMs. We use two examples of real test data -- the fraction subtraction and the English tests -- to demonstrate that employing PM-CDMs not only improves model fit, compared to CDMs, but also can make substantial difference in conclusions about attribute mastery. We conclude that PM-CDMs can lead to more effective remediation programs by providing detailed individual-level information about skills learned and skills that need to study. △ Less

Submitted 5 August, 2022; originally announced August 2022.

Journal ref: This work has been published in Ann. Appl. Stat. 15(3): 1529-1555 (September 2021)

arXiv:2206.12365 [pdf, ps, other]

On the validity of bootstrap uncertainty estimates in the Mallows-Binomial model

Authors: Michael Pearce, Elena A. Erosheva

Abstract: The Mallows-Binomial distribution is the first joint statistical model for rankings and ratings (Pearce and Erosheva, 2022). Because frequentist estimation of the model parameters and their uncertainty is challenging, it is natural to consider the nonparametric bootstrap. However, it is not clear that the nonparametric bootstrap is asymptotically valid in this setting. This is because the Mallows-… ▽ More The Mallows-Binomial distribution is the first joint statistical model for rankings and ratings (Pearce and Erosheva, 2022). Because frequentist estimation of the model parameters and their uncertainty is challenging, it is natural to consider the nonparametric bootstrap. However, it is not clear that the nonparametric bootstrap is asymptotically valid in this setting. This is because the Mallows-Binomial model is parameterized by continuous quantities whose discrete order affects the likelihood. In this note, we demonstrate that bootstrap uncertainty of the maximum likelihood estimates in the Mallows-Binomial model are asymptotically valid. △ Less

Submitted 20 July, 2022; v1 submitted 24 June, 2022; originally announced June 2022.

Comments: 9 pages

arXiv:2201.02539 [pdf, other]

A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review

Authors: Michael Pearce, Elena A. Erosheva

Abstract: Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows'… ▽ More Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $φ$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty. △ Less

Submitted 24 June, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

Comments: 36 pages, 8 figures

Journal ref: JMLR 23(210): 1-33, 2022

arXiv:2109.11705 [pdf, other]

Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data

Authors: Yuqi Gu, Elena A. Erosheva, Gongjun Xu, David B. Dunson

Abstract: Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this articl… ▽ More Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs (Gro-M$^3$s) for multivariate categorical data, which improve parsimony and interpretability. In Gro-M$^3$s, observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grou** structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro-M$^3$s to inferring the variable grou** structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset. △ Less

Submitted 14 February, 2023; v1 submitted 23 September, 2021; originally announced September 2021.

arXiv:1909.01284 [pdf, other]

Gender-based homophily in collaborations across a heterogeneous scholarly landscape

Authors: Y. Samuel Wang, Carole J. Lee, Jevin D. West, Carl T. Bergstrom, Elena A. Erosheva

Abstract: In this article, we investigate the role of gender in collaboration patterns by analyzing gender-based homophily -- the tendency for researchers to co-author with individuals of the same gender. We develop and apply novel methodology to the corpus of JSTOR articles, a broad scholarly landscape, which we analyze at various levels of granularity. Most notably, for a precise analysis of gender homoph… ▽ More In this article, we investigate the role of gender in collaboration patterns by analyzing gender-based homophily -- the tendency for researchers to co-author with individuals of the same gender. We develop and apply novel methodology to the corpus of JSTOR articles, a broad scholarly landscape, which we analyze at various levels of granularity. Most notably, for a precise analysis of gender homophily, we develop methodology which explicitly accounts for the fact that the data comprises heterogeneous intellectual communities and that not all authorships are exchangeable. In particular, we distinguish three phenomena which may affect the distribution of observed gender homophily in collaborations: a structural component that is due to demographics and non-gendered authorship norms of a scholarly community, a compositional component which is driven by varying gender representation across sub-disciplines and time, and a behavioral component which we define as the remainder of observed gender homophily after its structural and compositional components have been taken into account. Using minimal modeling assumptions, the methodology we develop allows us to test for behavioral homophily. We find that statistically significant behavioral homophily can be detected across the JSTOR corpus and show that this finding is robust to missing gender indicators in our data. In a secondary analysis, we show that the proportion of women representation in a field is positively associated with the probability of finding statistically significant behavioral homophily. △ Less

Submitted 16 June, 2022; v1 submitted 3 September, 2019; originally announced September 2019.

arXiv:1711.11057 [pdf, other]

On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example

Authors: Yen-Chi Chen, Y. Samuel Wang, Elena A. Erosheva

Abstract: Variational inference is a general approach for approximating complex density functions, such as those arising in latent variable models, popular in machine learning. It has been applied to approximate the maximum likelihood estimator and to carry out Bayesian inference, however, quantification of uncertainty with variational inference remains challenging from both theoretical and practical perspe… ▽ More Variational inference is a general approach for approximating complex density functions, such as those arising in latent variable models, popular in machine learning. It has been applied to approximate the maximum likelihood estimator and to carry out Bayesian inference, however, quantification of uncertainty with variational inference remains challenging from both theoretical and practical perspectives. This paper is concerned with develo** uncertainty measures for variational inference by using bootstrap procedures. We first develop two general bootstrap approaches for assessing the uncertainty of a variational estimate and the study the underlying bootstrap theory in both fixed- and increasing-dimension settings. We then use the bootstrap approach and our theoretical results in the context of mixed membership modeling with multivariate binary data on functional disability from the National Long Term Care Survey. We carry out a two-sample approach to test for changes in the repeated measures of functional disability for the subset of individuals present in 1989 and 1994 waves. △ Less

Submitted 17 April, 2018; v1 submitted 29 November, 2017; originally announced November 2017.

Comments: Accepted to the Annals of Applied Statistics; 34 pages, 8 pages

MSC Class: 62G09 (Primary); 62G15; 62H99 (Secondary)

arXiv:1610.09026 [pdf, ps, other]

On the relationship between set-based and network-based measures of gender homophily in scholarly publications

Authors: Y. Samuel Wang, Elena A. Erosheva

Abstract: There is an increased interest in the scientific community in the problem of measuring gender homophily in co-authorship on scholarly publications (Eisen, 2016). For a given set of publications and co-authorships, we assume that author identities have not been disambiguated in that we do not know when one person is an author on more than one paper. In this case, one way to think about measuring ge… ▽ More There is an increased interest in the scientific community in the problem of measuring gender homophily in co-authorship on scholarly publications (Eisen, 2016). For a given set of publications and co-authorships, we assume that author identities have not been disambiguated in that we do not know when one person is an author on more than one paper. In this case, one way to think about measuring gender homophily is to consider all observed co-authorship pairs and obtain a set-based gender homophily coefficient (e.g., Bergstrom et al., 2016). Another way is to consider papers as observed disjoint networks of co-authors and use a network-based assortativity coefficient (e.g., Newman, 2003). In this note, we review both metrics and show that the gender homophily set-based index is equivalent to the gender assortativity network-based coefficient with properly weighted edges. △ Less

Submitted 11 November, 2016; v1 submitted 27 October, 2016; originally announced October 2016.

Comments: University of Washington; Center for Statistics and Social Sciences; WP 157

arXiv:1512.08731 [pdf, other]

A Variational EM Method for Mixed Membership Models with Multivariate Rank Data: an Analysis of Public Policy Preferences

Authors: Y. Samuel Wang, Ross Matsueda, Elena A. Erosheva

Abstract: In this article, we consider modeling ranked responses from a heterogeneous population. Specifically, we analyze data from the Eurobarometer 34.1 survey regarding public policy preferences towards drugs, alcohol and AIDS. Such policy preferences are likely to exhibit substantial differences within as well as across European nations reflecting a wide variety of cultures, political affiliations, ide… ▽ More In this article, we consider modeling ranked responses from a heterogeneous population. Specifically, we analyze data from the Eurobarometer 34.1 survey regarding public policy preferences towards drugs, alcohol and AIDS. Such policy preferences are likely to exhibit substantial differences within as well as across European nations reflecting a wide variety of cultures, political affiliations, ideological perspectives and common practices. We use a mixed membership model to account for multiple subgroups with differing preferences and to allow each individual to possess partial membership in more than one subgroup. Previous methods for fitting mixed membership models to rank data in a univariate setting have utilized an MCMC approach and do not estimate the relative frequency of each subgroup. We propose a variational EM approach for fitting mixed membership models with multivariate rank data. Our method allows for fast approximate inference and explicitly estimates the subgroup sizes. Analyzing the Eurobarometer 34.1 data, we find interpretable subgroups which generally agree with the "left vs right" classification of political ideologies. △ Less

Submitted 24 February, 2017; v1 submitted 29 December, 2015; originally announced December 2015.

Comments: 24 pages; 7 figures

arXiv:1401.2728 [pdf, ps, other]

doi 10.1214/13-AOAS675

A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes

Authors: Jonathan Gruhl, Elena A. Erosheva, Paul K. Crane

Abstract: Multivariate data that combine binary, categorical, count and continuous outcomes are common in the social and health sciences. We propose a semiparametric Bayesian latent variable model for multivariate data of arbitrary type that does not require specification of conditional distributions. Drawing on the extended rank likelihood method by Hoff [Ann. Appl. Stat. 1 (2007) 265-283], we develop a se… ▽ More Multivariate data that combine binary, categorical, count and continuous outcomes are common in the social and health sciences. We propose a semiparametric Bayesian latent variable model for multivariate data of arbitrary type that does not require specification of conditional distributions. Drawing on the extended rank likelihood method by Hoff [Ann. Appl. Stat. 1 (2007) 265-283], we develop a semiparametric approach for latent variable modeling with mixed outcomes and propose associated Markov chain Monte Carlo estimation methods. Motivated by cognitive testing data, we focus on bifactor models, a special case of factor analysis. We employ our semiparametric Bayesian latent variable model to investigate the association between cognitive outcomes and MRI-measured regional brain volumes. △ Less

Submitted 13 January, 2014; originally announced January 2014.

Comments: Published in at http://dx.doi.org/10.1214/13-AOAS675 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS675

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 4, 2361-2383

arXiv:0712.2124 [pdf, ps, other]

doi 10.1214/07-AOAS126

Describing disability through individual-level mixture models for multivariate binary data

Authors: Elena A. Erosheva, Stephen E. Fienberg, Cyrille Joutard

Abstract: Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model… ▽ More Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model. We first describe GoM as an individual-level mixture model that allows individuals to have partial membership in several mixture components simultaneously. We then prove the equivalence between individual-level and population-level mixture models, and use this property to develop a Markov Chain Monte Carlo algorithm for Bayesian estimation of the model. We use our approach to analyze functional disability data from the NLTCS. △ Less

Submitted 13 December, 2007; originally announced December 2007.

Comments: Published in at http://dx.doi.org/10.1214/07-AOAS126 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS126

Journal ref: Annals of Applied Statistics 2007, Vol. 1, No. 2, 502-537

Showing 1–12 of 12 results for author: Erosheva, E A