-
Similarity-driven and Task-driven Models for Diversity of Opinion in Crowdsourcing Markets
Authors:
Chen Jason Zhang,
Yunrui Liu,
Pengcheng Zeng,
Ting Wu,
Lei Chen,
Pan Hui,
Fei Hao
Abstract:
The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to constru…
▽ More
The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd - Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, 'Diversity of Opinion' has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets.
△ Less
Submitted 28 February, 2024; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Extended ADMM for general penalized quantile regression with linear constraints in big data
Authors:
Yongxin Liu,
Peng Zeng
Abstract:
Quantile regression (QR) can be used to describe the comprehensive relationship between a response and predictors. Prior domain knowledge and assumptions in application are usually formulated as constraints of parameters to improve the estimation efficiency. This paper develops methods based on multi-block ADMM to fit general penalized QR with linear constraints of regression coefficients. Differe…
▽ More
Quantile regression (QR) can be used to describe the comprehensive relationship between a response and predictors. Prior domain knowledge and assumptions in application are usually formulated as constraints of parameters to improve the estimation efficiency. This paper develops methods based on multi-block ADMM to fit general penalized QR with linear constraints of regression coefficients. Different formulations to handle the linear constraints and general penalty are explored and compared. The most efficient one has explicit expressions for each parameter and avoids nested-loop iterations in some existing algorithms. Additionally, parallel ADMM algorithm for big data is also developed when data are stored in a distributed fashion. The stop** criterion and convergence of the algorithm are established. Extensive numerical experiments and a real data example demonstrate the computational efficiency of the proposed algorithms. The details of theoretical proofs and different algorithm variations are presented in Appendix.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Phase transition and higher order analysis of $L_q$ regularization under dependence
Authors:
Hanwen Huang,
Peng Zeng,
Qinglong Yang
Abstract:
We study the problem of estimating a $k$-sparse signal ${\mbox{$β$}}_0\in{\bf R}^p$ from a set of noisy observations ${\bf y}\in{\bf R}^n$ under the model ${\bf y}={\bf X}{\mbox{$β$}}+{\bf w}$, where ${\bf X}\in{\bf R}^{n\times p}$ is the measurement matrix the row of which is drawn from distribution $N(0,{\mbox{$Σ$}})$. We consider the class of $L_q$-regularized least squares (LQLS) given by the…
▽ More
We study the problem of estimating a $k$-sparse signal ${\mbox{$β$}}_0\in{\bf R}^p$ from a set of noisy observations ${\bf y}\in{\bf R}^n$ under the model ${\bf y}={\bf X}{\mbox{$β$}}+{\bf w}$, where ${\bf X}\in{\bf R}^{n\times p}$ is the measurement matrix the row of which is drawn from distribution $N(0,{\mbox{$Σ$}})$. We consider the class of $L_q$-regularized least squares (LQLS) given by the formulation $\hat{\mbox{$β$}}(λ,q)=\text{argmin}_{\mbox{$β$}\in{\bf R}^p}\frac{1}{2}\|{\bf y}-{\bf X}{\mbox{$β$}}\|^2_2+λ\|{\mbox{$β$}}\|_q^q$, where $\|\cdot\|_q$ $(0\le q\le 2)$ denotes the $L_q$-norm. In the setting $p,n,k\rightarrow\infty$ with fixed $k/p=ε$ and $n/p=δ$, we derive the asymptotic risk of $\hat{\mbox{$β$}}(λ,q)$ for arbitrary covariance matrix ${\mbox{$Σ$}}$ which generalizes the existing results for standard Gaussian design, i.e. $X_{ij}\overset{i.i.d}{\sim}N(0,1)$. We perform a higher-order analysis for LQLS in the small-error regime in which the first dominant term can be used to determine the phase transition behavior of LQLS. Our results show that the first dominant term does not depend on the covariance structure of ${\mbox{$Σ$}}$ in the cases $0\le q< 1$ and $1< q\le 2$ which indicates that the correlations among predictors only affect the phase transition curve in the case $q=1$ a.k.a. LASSO. To study the influence of the covariance structure of ${\mbox{$Σ$}}$ on the performance of LQLS in the cases $0\le q< 1$ and $1<q\le 2$, we derive the explicit formulas for the second dominant term in the expansion of the asymptotic risk in terms of small error. Extensive computational experiments confirm that our analytical predictions are consistent with numerical results.
△ Less
Submitted 1 December, 2022; v1 submitted 18 November, 2022;
originally announced November 2022.
-
scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data
Authors:
Pengcheng Zeng,
Zhixiang Lin
Abstract:
Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we…
▽ More
Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
A Model-free Variable Screening Method Based on Leverage Score
Authors:
Wenxuan Zhong,
Yiwen Liu,
Peng Zeng
Abstract:
With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational ch…
▽ More
With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational challenges. Recently, an innovative and effective sampling scheme based on leverage scores via singular value decompositions has been proposed to select rows of a design matrix as a surrogate of the full data in linear regression. Analogously, variable screening can be viewed as selecting rows of the design matrix. However, effective variable selection along this line of thinking remains elusive. In this article, we bridge this gap to propose a weighted leverage variable screening method by utilizing both the left and right singular vectors of the design matrix. We show theoretically and empirically that the predictors selected using our method can consistently include true predictors not only for linear models but also for complicated general index models. Extensive simulation studies show that the weighted leverage screening method is highly computationally efficient and effective. We also demonstrate its success in identifying carcinoma related genes using spatial transcriptome data.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Joint Curve Registration and Classification with Two-level Functional Models
Authors:
Lin Tang,
Pengcheng Zeng,
Jian Qing Shi,
Won-Seok Kim
Abstract:
Many classification techniques when the data are curves or functions have been recently proposed. However, the presence of misaligned problems in the curves can influence the performance of most of them. In this paper, we propose a model-based approach for simultaneous curve registration and classification. The method is proposed to perform curve classification based on a functional logistic regre…
▽ More
Many classification techniques when the data are curves or functions have been recently proposed. However, the presence of misaligned problems in the curves can influence the performance of most of them. In this paper, we propose a model-based approach for simultaneous curve registration and classification. The method is proposed to perform curve classification based on a functional logistic regression model that relies on both scalar variables and functional variables, and to align curves simultaneously via a data registration model. EM-based algorithms are developed to perform maximum likelihood inference of the proposed models. We establish the identifiability results for curve registration model and investigate the asymptotic properties of the proposed estimation procedures. Simulation studies are conducted to demonstrate the finite sample performance of the proposed models. An application of the hyoid bone movement data from stroke patients reveals the effectiveness of the new models.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
Elastic Coupled Co-clustering for Single-Cell Genomic Data
Authors:
Pengcheng Zeng,
Zhixiang Lin
Abstract:
The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species. These datasets typically have different powers in identifying the unknown ce…
▽ More
The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species. These datasets typically have different powers in identifying the unknown cell types through clustering, and data integration can potentially lead to a better performance of clustering algorithms. In this work, we formulate the problem in an unsupervised transfer learning framework, which utilizes knowledge learned from auxiliary dataset to improve the clustering performance of target dataset. The degree of shared information among the target and auxiliary datasets can vary, and their distributions can also be different. To address these challenges, we propose an elastic coupled co-clustering based transfer learning algorithm, by elastically propagating clustering knowledge obtained from the auxiliary dataset to the target dataset. Implementation on single-cell genomic datasets shows that our algorithm greatly improves clustering performance over the traditional learning algorithms. The source code and data sets are available at https://github.com/cuhklinlab/elasticC3.
△ Less
Submitted 5 June, 2020; v1 submitted 29 March, 2020;
originally announced March 2020.
-
Simultaneous Registration and Clustering for Multi-dimensional Functional Data
Authors:
Pengcheng Zeng,
Jian Qing Shi,
Won-Seok Kim
Abstract:
The clustering for functional data with misaligned problems has drawn much attention in the last decade. Most methods do the clustering after those functional data being registered and there has been little research using both functional and scalar variables. In this paper, we propose a simultaneous registration and clustering (SRC) model via two-level models, allowing the use of both types of var…
▽ More
The clustering for functional data with misaligned problems has drawn much attention in the last decade. Most methods do the clustering after those functional data being registered and there has been little research using both functional and scalar variables. In this paper, we propose a simultaneous registration and clustering (SRC) model via two-level models, allowing the use of both types of variables and also allowing simultaneous registration and clustering. For the data collected from subjects in different unknown groups, a Gaussian process functional regression model with time war** is used as the first level model; an allocation model depending on scalar variables is used as the second level model providing further information over the groups. The former carries out registration and modeling for the multi-dimensional functional data (2D or 3D curves) at the same time. This methodology is implemented using an EM algorithm, and is examined on both simulated data and real data.
△ Less
Submitted 13 November, 2017;
originally announced November 2017.
-
Sector-Based Factor Models for Asset Returns
Authors:
Angela Gu,
Patrick Zeng
Abstract:
Factor analysis is a statistical technique employed to evaluate how observed variables correlate through common factors and unique variables. While it is often used to analyze price movement in the unstable stock market, it does not always yield easily interpretable results. In this study, we develop improved factor models by explicitly incorporating sector information on our studied stocks. We ad…
▽ More
Factor analysis is a statistical technique employed to evaluate how observed variables correlate through common factors and unique variables. While it is often used to analyze price movement in the unstable stock market, it does not always yield easily interpretable results. In this study, we develop improved factor models by explicitly incorporating sector information on our studied stocks. We add eleven sectors of stocks as defined by the IBES, represented by respective sector-specific factors, to non-specific market factors to revise the factor model. We then develop an expectation maximization (EM) algorithm to compute our revised model with 15 years' worth of S&P 500 stocks' daily close prices. Our results in most sectors show that nearly all of these factor components have the same sign, consistent with the intuitive idea that stocks in the same sector tend to rise and fall in coordination over time. Results obtained by the classic factor model, in contrast, had a homogeneous blend of positive and negative components. We conclude that results produced by our sector-based factor model are more interpretable than those produced by the classic non-sector-based model for at least some stock sectors.
△ Less
Submitted 11 August, 2014;
originally announced August 2014.