-
Bayesian Rank-Clustering
Authors:
Michael Pearce,
Elena A. Erosheva
Abstract:
In a traditional analysis of ordinal comparison data, the goal is to infer an overall ranking of objects from best to worst with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying abilities or qualities being equal for some objects. In such cases, practitioners may prefe…
▽ More
In a traditional analysis of ordinal comparison data, the goal is to infer an overall ranking of objects from best to worst with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying abilities or qualities being equal for some objects. In such cases, practitioners may prefer an overall ranking where groups of objects are allowed to have equal ranks or to be $\textit{rank-clustered}$. Existing models related to rank-clustering are limited by their inability to handle a variety of ordinal data types, to quantify uncertainty, or by the need to pre-specify the number and size of potential rank-clusters. We solve these limitations through the proposed Bayesian $\textit{Rank-Clustered Bradley-Terry-Luce}$ model. We allow for rank-clustering via parameter fusion by imposing a novel spike-and-slab prior on object-specific worth parameters in Bradley-Terry-Luce family of distributions for ordinal comparisons. We demonstrate the model on simulated and real datasets in survey analysis, elections, and sports.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Modeling Preferences: A Bayesian Mixture of Finite Mixtures for Rankings and Ratings
Authors:
Michael Pearce,
Elena A. Erosheva
Abstract:
Rankings and ratings are commonly used to express preferences but provide distinct and complementary information. Rankings give ordinal and scale-free comparisons but lack granularity; ratings provide cardinal and granular assessments but may be highly subjective or inconsistent. Collecting and analyzing rankings and ratings jointly has not been performed until recently due to a lack of principled…
▽ More
Rankings and ratings are commonly used to express preferences but provide distinct and complementary information. Rankings give ordinal and scale-free comparisons but lack granularity; ratings provide cardinal and granular assessments but may be highly subjective or inconsistent. Collecting and analyzing rankings and ratings jointly has not been performed until recently due to a lack of principled methods. In this work, we propose a flexible, joint statistical model for rankings and ratings under heterogeneous preferences: the Bradley-Terry-Luce-Binomial (BTL-Binomial). We employ a Bayesian mixture of finite mixtures (MFM) approach to estimate heterogeneous preferences, understand their inherent uncertainty, and make accurate decisions based on ranking and ratings jointly. We demonstrate the efficiency and practicality of the BTL-Binomial MFM approach on real and simulated datasets of ranking and rating preferences in peer review and survey data contexts.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
Partial-Mastery Cognitive Diagnosis Models
Authors:
Zhuoran Shang,
Elena A. Erosheva,
Gongjun Xu
Abstract:
Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of…
▽ More
Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of specific latent attributes and that each attribute is either fully mastered or not mastered by a given subject. We propose a new class of models, partial mastery CDMs (PM-CDMs), that generalizes CDMs by allowing for partial mastery levels for each attribute of interest. We demonstrate that PM-CDMs can be represented as restricted latent class models. Relying on the latent class representation, we propose a Bayesian approach for estimation. We present simulation studies to demonstrate parameter recovery, to investigate the impact of model misspecification with respect to partial mastery, and to develop diagnostic tools that could be used by practitioners to decide between CDMs and PM-CDMs. We use two examples of real test data -- the fraction subtraction and the English tests -- to demonstrate that employing PM-CDMs not only improves model fit, compared to CDMs, but also can make substantial difference in conclusions about attribute mastery. We conclude that PM-CDMs can lead to more effective remediation programs by providing detailed individual-level information about skills learned and skills that need to study.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
On the validity of bootstrap uncertainty estimates in the Mallows-Binomial model
Authors:
Michael Pearce,
Elena A. Erosheva
Abstract:
The Mallows-Binomial distribution is the first joint statistical model for rankings and ratings (Pearce and Erosheva, 2022). Because frequentist estimation of the model parameters and their uncertainty is challenging, it is natural to consider the nonparametric bootstrap. However, it is not clear that the nonparametric bootstrap is asymptotically valid in this setting. This is because the Mallows-…
▽ More
The Mallows-Binomial distribution is the first joint statistical model for rankings and ratings (Pearce and Erosheva, 2022). Because frequentist estimation of the model parameters and their uncertainty is challenging, it is natural to consider the nonparametric bootstrap. However, it is not clear that the nonparametric bootstrap is asymptotically valid in this setting. This is because the Mallows-Binomial model is parameterized by continuous quantities whose discrete order affects the likelihood. In this note, we demonstrate that bootstrap uncertainty of the maximum likelihood estimates in the Mallows-Binomial model are asymptotically valid.
△ Less
Submitted 20 July, 2022; v1 submitted 24 June, 2022;
originally announced June 2022.
-
A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review
Authors:
Michael Pearce,
Elena A. Erosheva
Abstract:
Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows'…
▽ More
Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $φ$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.
△ Less
Submitted 24 June, 2022; v1 submitted 7 January, 2022;
originally announced January 2022.
-
Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data
Authors:
Yuqi Gu,
Elena A. Erosheva,
Gongjun Xu,
David B. Dunson
Abstract:
Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this articl…
▽ More
Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs (Gro-M$^3$s) for multivariate categorical data, which improve parsimony and interpretability. In Gro-M$^3$s, observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grou** structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro-M$^3$s to inferring the variable grou** structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset.
△ Less
Submitted 14 February, 2023; v1 submitted 23 September, 2021;
originally announced September 2021.
-
Gender-based homophily in collaborations across a heterogeneous scholarly landscape
Authors:
Y. Samuel Wang,
Carole J. Lee,
Jevin D. West,
Carl T. Bergstrom,
Elena A. Erosheva
Abstract:
In this article, we investigate the role of gender in collaboration patterns by analyzing gender-based homophily -- the tendency for researchers to co-author with individuals of the same gender. We develop and apply novel methodology to the corpus of JSTOR articles, a broad scholarly landscape, which we analyze at various levels of granularity. Most notably, for a precise analysis of gender homoph…
▽ More
In this article, we investigate the role of gender in collaboration patterns by analyzing gender-based homophily -- the tendency for researchers to co-author with individuals of the same gender. We develop and apply novel methodology to the corpus of JSTOR articles, a broad scholarly landscape, which we analyze at various levels of granularity. Most notably, for a precise analysis of gender homophily, we develop methodology which explicitly accounts for the fact that the data comprises heterogeneous intellectual communities and that not all authorships are exchangeable. In particular, we distinguish three phenomena which may affect the distribution of observed gender homophily in collaborations: a structural component that is due to demographics and non-gendered authorship norms of a scholarly community, a compositional component which is driven by varying gender representation across sub-disciplines and time, and a behavioral component which we define as the remainder of observed gender homophily after its structural and compositional components have been taken into account. Using minimal modeling assumptions, the methodology we develop allows us to test for behavioral homophily. We find that statistically significant behavioral homophily can be detected across the JSTOR corpus and show that this finding is robust to missing gender indicators in our data. In a secondary analysis, we show that the proportion of women representation in a field is positively associated with the probability of finding statistically significant behavioral homophily.
△ Less
Submitted 16 June, 2022; v1 submitted 3 September, 2019;
originally announced September 2019.
-
On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example
Authors:
Yen-Chi Chen,
Y. Samuel Wang,
Elena A. Erosheva
Abstract:
Variational inference is a general approach for approximating complex density functions, such as those arising in latent variable models, popular in machine learning. It has been applied to approximate the maximum likelihood estimator and to carry out Bayesian inference, however, quantification of uncertainty with variational inference remains challenging from both theoretical and practical perspe…
▽ More
Variational inference is a general approach for approximating complex density functions, such as those arising in latent variable models, popular in machine learning. It has been applied to approximate the maximum likelihood estimator and to carry out Bayesian inference, however, quantification of uncertainty with variational inference remains challenging from both theoretical and practical perspectives. This paper is concerned with develo** uncertainty measures for variational inference by using bootstrap procedures. We first develop two general bootstrap approaches for assessing the uncertainty of a variational estimate and the study the underlying bootstrap theory in both fixed- and increasing-dimension settings. We then use the bootstrap approach and our theoretical results in the context of mixed membership modeling with multivariate binary data on functional disability from the National Long Term Care Survey. We carry out a two-sample approach to test for changes in the repeated measures of functional disability for the subset of individuals present in 1989 and 1994 waves.
△ Less
Submitted 17 April, 2018; v1 submitted 29 November, 2017;
originally announced November 2017.
-
On the relationship between set-based and network-based measures of gender homophily in scholarly publications
Authors:
Y. Samuel Wang,
Elena A. Erosheva
Abstract:
There is an increased interest in the scientific community in the problem of measuring gender homophily in co-authorship on scholarly publications (Eisen, 2016). For a given set of publications and co-authorships, we assume that author identities have not been disambiguated in that we do not know when one person is an author on more than one paper. In this case, one way to think about measuring ge…
▽ More
There is an increased interest in the scientific community in the problem of measuring gender homophily in co-authorship on scholarly publications (Eisen, 2016). For a given set of publications and co-authorships, we assume that author identities have not been disambiguated in that we do not know when one person is an author on more than one paper. In this case, one way to think about measuring gender homophily is to consider all observed co-authorship pairs and obtain a set-based gender homophily coefficient (e.g., Bergstrom et al., 2016). Another way is to consider papers as observed disjoint networks of co-authors and use a network-based assortativity coefficient (e.g., Newman, 2003). In this note, we review both metrics and show that the gender homophily set-based index is equivalent to the gender assortativity network-based coefficient with properly weighted edges.
△ Less
Submitted 11 November, 2016; v1 submitted 27 October, 2016;
originally announced October 2016.
-
A Variational EM Method for Mixed Membership Models with Multivariate Rank Data: an Analysis of Public Policy Preferences
Authors:
Y. Samuel Wang,
Ross Matsueda,
Elena A. Erosheva
Abstract:
In this article, we consider modeling ranked responses from a heterogeneous population. Specifically, we analyze data from the Eurobarometer 34.1 survey regarding public policy preferences towards drugs, alcohol and AIDS. Such policy preferences are likely to exhibit substantial differences within as well as across European nations reflecting a wide variety of cultures, political affiliations, ide…
▽ More
In this article, we consider modeling ranked responses from a heterogeneous population. Specifically, we analyze data from the Eurobarometer 34.1 survey regarding public policy preferences towards drugs, alcohol and AIDS. Such policy preferences are likely to exhibit substantial differences within as well as across European nations reflecting a wide variety of cultures, political affiliations, ideological perspectives and common practices. We use a mixed membership model to account for multiple subgroups with differing preferences and to allow each individual to possess partial membership in more than one subgroup. Previous methods for fitting mixed membership models to rank data in a univariate setting have utilized an MCMC approach and do not estimate the relative frequency of each subgroup. We propose a variational EM approach for fitting mixed membership models with multivariate rank data. Our method allows for fast approximate inference and explicitly estimates the subgroup sizes. Analyzing the Eurobarometer 34.1 data, we find interpretable subgroups which generally agree with the "left vs right" classification of political ideologies.
△ Less
Submitted 24 February, 2017; v1 submitted 29 December, 2015;
originally announced December 2015.
-
A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes
Authors:
Jonathan Gruhl,
Elena A. Erosheva,
Paul K. Crane
Abstract:
Multivariate data that combine binary, categorical, count and continuous outcomes are common in the social and health sciences. We propose a semiparametric Bayesian latent variable model for multivariate data of arbitrary type that does not require specification of conditional distributions. Drawing on the extended rank likelihood method by Hoff [Ann. Appl. Stat. 1 (2007) 265-283], we develop a se…
▽ More
Multivariate data that combine binary, categorical, count and continuous outcomes are common in the social and health sciences. We propose a semiparametric Bayesian latent variable model for multivariate data of arbitrary type that does not require specification of conditional distributions. Drawing on the extended rank likelihood method by Hoff [Ann. Appl. Stat. 1 (2007) 265-283], we develop a semiparametric approach for latent variable modeling with mixed outcomes and propose associated Markov chain Monte Carlo estimation methods. Motivated by cognitive testing data, we focus on bifactor models, a special case of factor analysis. We employ our semiparametric Bayesian latent variable model to investigate the association between cognitive outcomes and MRI-measured regional brain volumes.
△ Less
Submitted 13 January, 2014;
originally announced January 2014.
-
Describing disability through individual-level mixture models for multivariate binary data
Authors:
Elena A. Erosheva,
Stephen E. Fienberg,
Cyrille Joutard
Abstract:
Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model…
▽ More
Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model. We first describe GoM as an individual-level mixture model that allows individuals to have partial membership in several mixture components simultaneously. We then prove the equivalence between individual-level and population-level mixture models, and use this property to develop a Markov Chain Monte Carlo algorithm for Bayesian estimation of the model. We use our approach to analyze functional disability data from the NLTCS.
△ Less
Submitted 13 December, 2007;
originally announced December 2007.