-
Unraveling the Skillsets of Data Scientists: Text Mining Analysis of Dutch University Master Programs in Data Science and Artificial Intelligence
Authors:
Mathijs J. Mol,
Barbara Belfi,
Zsuzsa Bakk
Abstract:
The growing demand for data scientists in the global labor market and the Netherlands has led to a rise in data science and artificial intelligence (AI) master programs offered by universities. However, there is still a lack of clarity regarding the specific skillsets of data scientists. This study aims to address this issue by employing Correlated Topic Modeling (CTM) to analyse the content of 41…
▽ More
The growing demand for data scientists in the global labor market and the Netherlands has led to a rise in data science and artificial intelligence (AI) master programs offered by universities. However, there is still a lack of clarity regarding the specific skillsets of data scientists. This study aims to address this issue by employing Correlated Topic Modeling (CTM) to analyse the content of 41 master programs offered by seven Dutch universities. We assess the differences and similarities in the core skills taught by these programs, determine the subject-specific and general nature of the skills, and provide a comparison between the different types of universities offering these programs. Our findings reveal that research, data processing, statistics and ethics are the predominant skills taught in Dutch data science and AI master programs, with general universities emphasizing research skills and technical universities focusing more on IT and electronic skills. This study contributes to a better understanding of the diverse skillsets of data scientists, which is essential for employers, universities, and prospective students.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Multilevel latent class analysis with covariates: Analysis of cross-national citizenship norms with a two-stage approach
Authors:
Roberto Di Mari,
Zsuzsa Bakk,
Jennifer Oser,
Jouni Kuha
Abstract:
This paper focuses on the substantive application of multilevel LCA to the evolution of citizenship norms in a diverse array of democratic countries. To do so, we present a two-stage approach to fit multilevel latent class models: in the first stage (measurement model construction), unconditional class enumeration is done separately on both low and high level latent variables, estimating only a pa…
▽ More
This paper focuses on the substantive application of multilevel LCA to the evolution of citizenship norms in a diverse array of democratic countries. To do so, we present a two-stage approach to fit multilevel latent class models: in the first stage (measurement model construction), unconditional class enumeration is done separately on both low and high level latent variables, estimating only a part of the model at a time -- hence kee** the remaining part fixed -- and then updating the full measurement model; in the second stage (structural model construction), individual and/or group covariates are included in the model. By separating the two parts -- first stage and second stage of model building -- the measurement model is stabilized and is allowed to be determined only by it's indicators. Moreover, this two-step approach makes the inclusion/exclusion of a covariate a relatively simple task to handle. Our proposal amends common practice in applied social science research, where simple (low-level) LCA is done to obtain a classification of low-level unit, and this is then related to (low- and high-level) covariates simply including group fixed effects. Our analysis identifies latent classes that score either consistently high or consistently low on all measured items, along with two theoretically important classes that place distinctive emphasis on items related to engaged citizenship, and duty-based norms.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
multilevLCA: An R Package for Single-Level and Multilevel Latent Class Analysis with Covariates
Authors:
Johan Lyrvall,
Roberto Di Mari,
Zsuzsa Bakk,
Jennifer Oser,
Jouni Kuha
Abstract:
This contribution presents a guide to the R package multilevLCA, which offers a complete and innovative set of technical tools for the latent class analysis of single-level and multilevel categorical data. We describe the available model specifications, mainly falling within the fixed-effect or random-effect approaches. Maximum likelihood estimation of the model parameters, enhanced by a refined i…
▽ More
This contribution presents a guide to the R package multilevLCA, which offers a complete and innovative set of technical tools for the latent class analysis of single-level and multilevel categorical data. We describe the available model specifications, mainly falling within the fixed-effect or random-effect approaches. Maximum likelihood estimation of the model parameters, enhanced by a refined initialization strategy, is implemented either simultaneously, i.e., in one-step, or by means of the more advantageous two-step estimator. The package features i) semi-automatic model selection when a priori information on the number of classes is lacking, ii) predictors of class membership, and iii) output visualization tools for any of the available model specifications. All functionalities are illustrated by means of a real application on citizenship norms data, which are available in the package.
△ Less
Submitted 10 April, 2024; v1 submitted 12 May, 2023;
originally announced May 2023.
-
StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables
Authors:
Sacha Morin,
Robin Legault,
Félix Laliberté,
Zsuzsa Bakk,
Charles-Édouard Giguère,
Roxane de la Sablonnière,
Éric Lacourse
Abstract:
StepMix is an open-source Python package for the pseudo-likelihood estimation (one-, two- and three-step approaches) of generalized finite mixture models (latent profile and latent class analysis) with external variables (covariates and distal outcomes). In many applications in social sciences, the main objective is not only to cluster individuals into latent classes, but also to use these classes…
▽ More
StepMix is an open-source Python package for the pseudo-likelihood estimation (one-, two- and three-step approaches) of generalized finite mixture models (latent profile and latent class analysis) with external variables (covariates and distal outcomes). In many applications in social sciences, the main objective is not only to cluster individuals into latent classes, but also to use these classes to develop more complex statistical models. These models generally divide into a measurement model that relates the latent classes to observed indicators, and a structural model that relates covariates and outcome variables to the latent classes. The measurement and structural models can be estimated jointly using the so-called one-step approach or sequentially using stepwise methods, which present significant advantages for practitioners regarding the interpretability of the estimated latent classes. In addition to the one-step approach, StepMix implements the most important stepwise estimation methods from the literature, including the bias-adjusted three-step methods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the more recent two-step approach. These pseudo-likelihood estimators are presented in this paper under a unified framework as specific expectation-maximization subroutines. To facilitate and promote their adoption among the data science community, StepMix follows the object-oriented design of the scikit-learn library and provides an additional R wrapper.
△ Less
Submitted 16 June, 2024; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Two-step estimation of latent trait models
Authors:
Jouni Kuha,
Zsuzsa Bakk
Abstract:
We consider two-step estimation of latent variable models, in which just the measurement model is estimated in the first step and the measurement parameters are then fixed at their estimated values in the second step where the structural model is estimated. We show how this approach can be implemented for latent trait models (item response theory models) where the latent variables are continuous a…
▽ More
We consider two-step estimation of latent variable models, in which just the measurement model is estimated in the first step and the measurement parameters are then fixed at their estimated values in the second step where the structural model is estimated. We show how this approach can be implemented for latent trait models (item response theory models) where the latent variables are continuous and their measurement indicators are categorical variables. The properties of two-step estimators are examined using simulation studies and applied examples. They perform well, and have attractive practical and conceptual properties compared to the alternative one-step and three-step approaches. These results are in line with previous findings for other families of latent variable models. This provides strong evidence that two-step estimation is a flexible and useful general method of estimation for different types of latent variable models.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
A two-step estimator for multilevel latent class analysis with covariates
Authors:
Roberto Di Mari,
Zsuzsa Bakk,
Jennifer Oser,
Jouni Kuha
Abstract:
We propose a two-step estimator for multilevel latent class analysis (LCA) with covariates. The measurement model for observed items is estimated in its first step, and in the second step covariates are added in the model, kee** the measurement model parameters fixed. We discuss model identification, and derive an Expectation Maximization algorithm for efficient implementation of the estimator.…
▽ More
We propose a two-step estimator for multilevel latent class analysis (LCA) with covariates. The measurement model for observed items is estimated in its first step, and in the second step covariates are added in the model, kee** the measurement model parameters fixed. We discuss model identification, and derive an Expectation Maximization algorithm for efficient implementation of the estimator. By means of an extensive simulation study we show that (i) this approach performs similarly to existing stepwise estimators for multilevel LCA but with much reduced computing time, and (ii) it yields approximately unbiased parameter estimates with a negligible loss of efficiency compared to the one-step estimator. The proposal is illustrated with a cross-national analysis of predictors of citizenship norms.
△ Less
Submitted 5 July, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Cluster-weighted latent class modeling
Authors:
Roberto Di Mari,
Antonio Punzo,
Zsuzsa Bakk
Abstract:
Usually in Latent Class Analysis (LCA), external predictors are taken to be cluster conditional probability predictors (LC models with covariates), and/or score conditional probability predictors (LC regression models). In such cases, their distribution is not of interest. Class specific distribution is of interest in the distal outcome model, when the distribution of the external variable(s) is a…
▽ More
Usually in Latent Class Analysis (LCA), external predictors are taken to be cluster conditional probability predictors (LC models with covariates), and/or score conditional probability predictors (LC regression models). In such cases, their distribution is not of interest. Class specific distribution is of interest in the distal outcome model, when the distribution of the external variable(s) is assumed to dependent on LC membership. In this paper, we consider a more general formulation, typical in cluster-weighted models, which embeds both the latent class regression and the distal outcome models. This allows us to test simultaneously both whether the distribution of the covariate(s) differs across classes, and whether there are significant direct effects of the covariate(s) on the indicators, by including most of the information about the covariate(s) - latent variable relationship. We show the advantages of the proposed modeling approach through a set of population studies and an empirical application on assets ownership of Italian households.
△ Less
Submitted 4 January, 2018;
originally announced January 2018.