Search | arXiv e-print repository

arXiv:2404.16745 [pdf, other]

Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing Fairness

Authors: **g Ouyang, Chengyu Cui, Kean Ming Tan, Gongjun Xu

Abstract: In the era of data explosion, statisticians have been develo** interpretable and computationally efficient statistical methods to measure latent factors (e.g., skills, abilities, and personalities) using large-scale assessment data. In addition to understanding the latent information, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wi… ▽ More In the era of data explosion, statisticians have been develo** interpretable and computationally efficient statistical methods to measure latent factors (e.g., skills, abilities, and personalities) using large-scale assessment data. In addition to understanding the latent information, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wide applications, such as evaluating the fairness of educational testing, where the covariate effect reflects whether a test question is biased toward certain individual characteristics (e.g., gender and race) taking into account their latent abilities. However, the large sample size, substantial covariate dimension, and great test length pose challenges to develo** efficient methods and drawing valid inferences. Moreover, to accommodate the commonly encountered discrete types of responses, nonlinear latent factor models are often assumed, bringing further complexity to the problem. To address these challenges, we consider a covariate-adjusted generalized factor model and develop novel and interpretable conditions to address the identifiability issue. Based on the identifiability conditions, we propose a joint maximum likelihood estimation method and establish estimation consistency and asymptotic normality results for the covariate effects under a practical yet challenging asymptotic regime. Furthermore, we derive estimation and inference results for latent factors and the factor loadings. We illustrate the finite sample performance of the proposed method through extensive numerical studies and an application to an educational assessment dataset obtained from the Programme for International Student Assessment (PISA). △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2310.12010 [pdf, other]

A Note on Improving Variational Estimation for Multidimensional Item Response Theory

Authors: Chenchen Ma, **g Ouyang, Chun Wang, Gongjun Xu

Abstract: Survey instruments and assessments are frequently used in many domains of social science. When the constructs that these assessments try to measure become multifaceted, multidimensional item response theory (MIRT) provides a unified framework and convenient statistical tool for item analysis, calibration, and scoring. However, the computational challenge of estimating MIRT models prohibits its wid… ▽ More Survey instruments and assessments are frequently used in many domains of social science. When the constructs that these assessments try to measure become multifaceted, multidimensional item response theory (MIRT) provides a unified framework and convenient statistical tool for item analysis, calibration, and scoring. However, the computational challenge of estimating MIRT models prohibits its wide use because many of the extant methods can hardly provide results in a realistic time frame when the number of dimensions, sample size, and test length are large. Instead, variational estimation methods, such as Gaussian Variational Expectation Maximization (GVEM) algorithm, have been recently proposed to solve the estimation challenge by providing a fast and accurate solution. However, results have shown that variational estimation methods may produce some bias on discrimination parameters during confirmatory model estimation, and this note proposes an importance weighted version of GVEM (i.e., IW-GVEM) to correct for such bias under MIRT models. We also use the adaptive moment estimation method to update the learning rate for gradient descent automatically. Our simulations show that IW-GVEM can effectively correct bias with modest increase of computation time, compared with GVEM. The proposed method may also shed light on improving the variational estimation for other psychometrics models. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2302.07216 [pdf, other]

On the Multiway Principal Component Analysis

Authors: Jialin Ouyang, Ming Yuan

Abstract: Multiway data are becoming more and more common. While there are many approaches to extending principal component analysis (PCA) from usual data matrices to multiway arrays, their conceptual differences from the usual PCA, and the methodological implications of such differences remain largely unknown. This work aims to specifically address these questions. In particular, we clarify the subtle diff… ▽ More Multiway data are becoming more and more common. While there are many approaches to extending principal component analysis (PCA) from usual data matrices to multiway arrays, their conceptual differences from the usual PCA, and the methodological implications of such differences remain largely unknown. This work aims to specifically address these questions. In particular, we clarify the subtle difference between PCA and singular value decomposition (SVD) for multiway data, and show that multiway principal components (PCs) can be estimated reliably in absence of the eigengaps required by the usual PCA, and in general much more efficiently than the usual PCs. Furthermore, the sample multiway PCs are asymptotically independent and hence allow for separate and more accurate inferences about the population PCs. The practical merits of multiway PCA are further demonstrated through numerical, both simulated and real data, examples. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2209.03482 [pdf, other]

High-Dimensional Inference for Generalized Linear Models with Hidden Confounding

Authors: **g Ouyang, Kean Ming Tan, Gongjun Xu

Abstract: Statistical inferences for high-dimensional regression models have been extensively studied for their wide applications ranging from genomics, neuroscience, to economics. However, in practice, there are often potential unmeasured confounders associated with both the response and covariates, which can lead to invalidity of standard debiasing methods. This paper focuses on a generalized linear regre… ▽ More Statistical inferences for high-dimensional regression models have been extensively studied for their wide applications ranging from genomics, neuroscience, to economics. However, in practice, there are often potential unmeasured confounders associated with both the response and covariates, which can lead to invalidity of standard debiasing methods. This paper focuses on a generalized linear regression framework with hidden confounding and proposes a debiasing approach to address this high-dimensional problem, by adjusting for the effects induced by the unmeasured confounders. We establish consistency and asymptotic normality for the proposed debiased estimator. The finite sample performance of the proposed method is demonstrated through extensive numerical studies and an application to a genetic data set. △ Less

Submitted 11 September, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

arXiv:2110.11707 [pdf, other]

Variational Wasserstein Barycenters with c-Cyclical Monotonicity

Authors: **** Chi, Zhiyao Yang, Jihong Ouyang, Ximing Li

Abstract: Wasserstein barycenter, built on the theory of optimal transport, provides a powerful framework to aggregate probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, especially for high dimensional and continuous settings. To this end, we develop a novel continuous approximation method… ▽ More Wasserstein barycenter, built on the theory of optimal transport, provides a powerful framework to aggregate probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, especially for high dimensional and continuous settings. To this end, we develop a novel continuous approximation method for the Wasserstein barycenters problem given sample access to the input distributions. The basic idea is to introduce a variational distribution as the approximation of the true continuous barycenter, so as to frame the barycenters computation problem as an optimization problem, where parameters of the variational distribution adjust the proxy distribution to be similar to the barycenter. Leveraging the variational distribution, we construct a tractable dual formulation for the regularized Wasserstein barycenter problem with c-cyclical monotonicity, which can be efficiently solved by stochastic optimization. We provide theoretical analysis on convergence and demonstrate the practical effectiveness of our method on real applications of subset posterior aggregation and synthetic data. △ Less

Submitted 17 December, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

arXiv:2110.11112 [pdf, other]

DIF Statistical Inference without Knowing Anchoring Items

Authors: Yunxiao Chen, Chengcheng Li, **g Ouyang, Gongjun Xu

Abstract: Establishing the invariance property of an instrument is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends on not only the latent trait measured by the instrument but also the group membership. DIF analysis is confounded by the group differ… ▽ More Establishing the invariance property of an instrument is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends on not only the latent trait measured by the instrument but also the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require knowing several anchor items that are DIF-free to draw inferences on whether each of the rest is a DIF item, where the anchor items are used to identify the latent trait distributions. When no prior information on anchor items is available, item purification methods and regularized estimation methods can be used. The former iteratively purifies the anchor set by a stepwise model selection procedure, and the latter selects the DIF-free items by a LASSO-type regularization approach. Unfortunately, unlike the methods based on a correctly specified anchor set, these methods are not guaranteed to provide valid statistical inference (e.g., confidence intervals and $p$-values). In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal $L_1$ norm condition for identifying the latent trait distributions. Without requiring prior knowledge about an anchor set, it can accurately estimate the DIF effects of individual items and further draw valid statistical inferences for quantifying the uncertainty. Specifically, the inference results allow us to control the type-I error for DIF detection, which may not be possible with item purification and regularized estimation methods. The proposed method is applied to analyzing the three personality scales of the Eysenck personality questionnaire - revised (EPQ-R). △ Less

Submitted 11 January, 2023; v1 submitted 21 October, 2021; originally announced October 2021.

arXiv:2103.14885 [pdf, ps, other]

Identifiability of Latent Class Models with Covariates

Authors: **g Ouyang, Gongjun Xu

Abstract: Latent class models with covariates are widely used for psychological, social, and educational research. Yet the fundamental identifiability issue of these models has not been fully addressed. Among the previous research on the identifiability of latent class models with covariates, Huang and Bandeen-Roche (2004, Psychometrika, 69:5-32) studied the local identifiability conditions. However, motiva… ▽ More Latent class models with covariates are widely used for psychological, social, and educational research. Yet the fundamental identifiability issue of these models has not been fully addressed. Among the previous research on the identifiability of latent class models with covariates, Huang and Bandeen-Roche (2004, Psychometrika, 69:5-32) studied the local identifiability conditions. However, motivated by recent advances in the identifiability of the restricted latent class models, particularly Cognitive Diagnosis Models (CDMs), we show in this work that the conditions in Huang and Bandeen-Roche (2004) are only necessary but not sufficient to determine the local identifiability of the model parameters. To address the open identifiability issue for latent class models with covariates, this work establishes conditions to ensure the global identifiability of the model parameters in both strict and generic senses. Moreover, our results extend to the polytomous-response CDMs with covariates, which generalizes the existing identifiability results for CDMs. △ Less

Submitted 7 February, 2022; v1 submitted 27 March, 2021; originally announced March 2021.

arXiv:1810.10307 [pdf, other]

Topic representation: finding more representative words in topic models

Authors: **** Chi, Jihong Ouyang, Changchun Li, Xueyang Dong, Ximing Li, Xinhua Wang

Abstract: The top word list, i.e., the top-M words with highest marginal probability in a given topic, is the standard topic representation in topic models. Most of recent automatical topic labeling algorithms and popular topic quality metrics are based on it. However, we find, empirically, words in this type of top word list are not always representative. The objective of this paper is to find more represe… ▽ More The top word list, i.e., the top-M words with highest marginal probability in a given topic, is the standard topic representation in topic models. Most of recent automatical topic labeling algorithms and popular topic quality metrics are based on it. However, we find, empirically, words in this type of top word list are not always representative. The objective of this paper is to find more representative top word lists for topics. To achieve this, we rerank the words in a given topic by further considering marginal probability on words over every other topic. The reranking list of top-M words is used to be a novel topic representation for topic models. We investigate three reranking methodologies, using (1) standard deviation weight, (2) standard deviation weight with topic size and (3) Chi Square \c{hi}2statistic selection. Experimental results on real world collections indicate that our representations can extract more representative words for topics, agreeing with human judgements. △ Less

Submitted 23 October, 2018; originally announced October 2018.

Comments: The paper has been submitted to Pattern Recognition Letters and is being reviewed

arXiv:1409.7454 [pdf, other]

A Bayesian spatial temporal mixtures approach to kinetic parametric images in dynamic Positron Emission Tomography

Authors: Wanchuang Zhu, **song Ouyang, Yothin Rakvongthai, N. J. Guehl, D. W. Wooten, G. El Fakhri, M. D. Normandin, Yanan Fan

Abstract: We present a fully Bayesian statistical approach to the problem of compartmental modelling in the context of Positron Emission Tomography. We cluster homogeneous region of interest and perform kinetic parameter estimation simultaneously. A mixture modelling approach is adopted, incorporating both spatial and temporal information based on reconstructed dynamic PET image. Our modelling approach is f… ▽ More We present a fully Bayesian statistical approach to the problem of compartmental modelling in the context of Positron Emission Tomography. We cluster homogeneous region of interest and perform kinetic parameter estimation simultaneously. A mixture modelling approach is adopted, incorporating both spatial and temporal information based on reconstructed dynamic PET image. Our modelling approach is flexible, and provides uncertainty estimates for the estimated kinetic parameters. Crucially, the proposed method allows us to determine the unknown number of clusters, which has a great impact on resulting estimated kinetic parameters. We demonstrate our method on simulated dynamic Myocardial PET data, and show that our method is superior to standard curve-fitting approach. △ Less

Submitted 7 February, 2016; v1 submitted 25 September, 2014; originally announced September 2014.

Comments: 30 pages

Showing 1–9 of 9 results for author: Ouyang, J