Search | arXiv e-print repository

Similarity-driven and Task-driven Models for Diversity of Opinion in Crowdsourcing Markets

Authors: Chen Jason Zhang, Yunrui Liu, Pengcheng Zeng, Ting Wu, Lei Chen, Pan Hui, Fei Hao

Abstract: The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to constru… ▽ More The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done solely by machines or conducted by professional experts due to cost constraints. Within the field of social science, four elements are required to construct a sound crowd - Diversity of Opinion, Independence, Decentralization and Aggregation. However, while the other three components have already been investigated and implemented in existing crowdsourcing platforms, 'Diversity of Opinion' has not been functionally enabled yet. From a computational point of view, constructing a wise crowd necessitates quantitatively modeling and taking diversity into account. There are usually two paradigms in a crowdsourcing marketplace for worker selection: building a crowd to wait for tasks to come and selecting workers for a given task. We propose similarity-driven and task-driven models for both paradigms. Also, we develop efficient and effective algorithms for recruiting a limited number of workers with optimal diversity in both models. To validate our solutions, we conduct extensive experiments using both synthetic datasets and real data sets. △ Less

Submitted 28 February, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 37 pages, 11 figures

arXiv:2305.07481 [pdf, other]

Extended ADMM for general penalized quantile regression with linear constraints in big data

Authors: Yongxin Liu, Peng Zeng

Abstract: Quantile regression (QR) can be used to describe the comprehensive relationship between a response and predictors. Prior domain knowledge and assumptions in application are usually formulated as constraints of parameters to improve the estimation efficiency. This paper develops methods based on multi-block ADMM to fit general penalized QR with linear constraints of regression coefficients. Differe… ▽ More Quantile regression (QR) can be used to describe the comprehensive relationship between a response and predictors. Prior domain knowledge and assumptions in application are usually formulated as constraints of parameters to improve the estimation efficiency. This paper develops methods based on multi-block ADMM to fit general penalized QR with linear constraints of regression coefficients. Different formulations to handle the linear constraints and general penalty are explored and compared. The most efficient one has explicit expressions for each parameter and avoids nested-loop iterations in some existing algorithms. Additionally, parallel ADMM algorithm for big data is also developed when data are stored in a distributed fashion. The stop** criterion and convergence of the algorithm are established. Extensive numerical experiments and a real data example demonstrate the computational efficiency of the proposed algorithms. The details of theoretical proofs and different algorithm variations are presented in Appendix. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2211.10541 [pdf, ps, other]

Phase transition and higher order analysis of $L_q$ regularization under dependence

Authors: Hanwen Huang, Peng Zeng, Qinglong Yang

Abstract: We study the problem of estimating a $k$-sparse signal ${\mbox{$β$}}_0\in{\bf R}^p$ from a set of noisy observations ${\bf y}\in{\bf R}^n$ under the model ${\bf y}={\bf X}{\mbox{$β$}}+{\bf w}$, where ${\bf X}\in{\bf R}^{n\times p}$ is the measurement matrix the row of which is drawn from distribution $N(0,{\mbox{$Σ$}})$. We consider the class of $L_q$-regularized least squares (LQLS) given by the… ▽ More We study the problem of estimating a $k$-sparse signal ${\mbox{$β$}}_0\in{\bf R}^p$ from a set of noisy observations ${\bf y}\in{\bf R}^n$ under the model ${\bf y}={\bf X}{\mbox{$β$}}+{\bf w}$, where ${\bf X}\in{\bf R}^{n\times p}$ is the measurement matrix the row of which is drawn from distribution $N(0,{\mbox{$Σ$}})$. We consider the class of $L_q$-regularized least squares (LQLS) given by the formulation $\hat{\mbox{$β$}}(λ,q)=\text{argmin}_{\mbox{$β$}\in{\bf R}^p}\frac{1}{2}\|{\bf y}-{\bf X}{\mbox{$β$}}\|^2_2+λ\|{\mbox{$β$}}\|_q^q$, where $\|\cdot\|_q$ $(0\le q\le 2)$ denotes the $L_q$-norm. In the setting $p,n,k\rightarrow\infty$ with fixed $k/p=ε$ and $n/p=δ$, we derive the asymptotic risk of $\hat{\mbox{$β$}}(λ,q)$ for arbitrary covariance matrix ${\mbox{$Σ$}}$ which generalizes the existing results for standard Gaussian design, i.e. $X_{ij}\overset{i.i.d}{\sim}N(0,1)$. We perform a higher-order analysis for LQLS in the small-error regime in which the first dominant term can be used to determine the phase transition behavior of LQLS. Our results show that the first dominant term does not depend on the covariance structure of ${\mbox{$Σ$}}$ in the cases $0\le q< 1$ and $1< q\le 2$ which indicates that the correlations among predictors only affect the phase transition curve in the case $q=1$ a.k.a. LASSO. To study the influence of the covariance structure of ${\mbox{$Σ$}}$ on the performance of LQLS in the cases $0\le q< 1$ and $1<q\le 2$, we derive the explicit formulas for the second dominant term in the expansion of the asymptotic risk in terms of small error. Extensive computational experiments confirm that our analytical predictions are consistent with numerical results. △ Less

Submitted 1 December, 2022; v1 submitted 18 November, 2022; originally announced November 2022.

Comments: 35 pages, 11 figures

arXiv:2205.09523 [pdf, other]

scICML: Information-theoretic Co-clustering-based Multi-view Learning for the Integrative Analysis of Single-cell Multi-omics data

Authors: Pengcheng Zeng, Zhixiang Lin

Abstract: Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we… ▽ More Modern high-throughput sequencing technologies have enabled us to profile multiple molecular modalities from the same single cell, providing unprecedented opportunities to assay celluar heterogeneity from multiple biological layers. However, the datasets generated from these technologies tend to have high level of noise and are highly sparse, bringing challenges to data analysis. In this paper, we develop a novel information-theoretic co-clustering-based multi-view learning (scICML) method for multi-omics single-cell data integration. scICML utilizes co-clusterings to aggregate similar features for each view of data and uncover the common clustering pattern for cells. In addition, scICML automatically matches the clusters of the linked features across different data types for considering the biological dependency structure across different types of genomic features. Our experiments on four real-world datasets demonstrate that scICML improves the overall clustering performance and provides biological insights into the data analysis of peripheral blood mononuclear cells. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 11 pages; 1 figure

arXiv:2109.09936 [pdf, other]

doi 10.1080/01621459.2021.1918554

A Model-free Variable Screening Method Based on Leverage Score

Authors: Wenxuan Zhong, Yiwen Liu, Peng Zeng

Abstract: With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational ch… ▽ More With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational challenges. Recently, an innovative and effective sampling scheme based on leverage scores via singular value decompositions has been proposed to select rows of a design matrix as a surrogate of the full data in linear regression. Analogously, variable screening can be viewed as selecting rows of the design matrix. However, effective variable selection along this line of thinking remains elusive. In this article, we bridge this gap to propose a weighted leverage variable screening method by utilizing both the left and right singular vectors of the design matrix. We show theoretically and empirically that the predictors selected using our method can consistently include true predictors not only for linear models but also for complicated general index models. Extensive simulation studies show that the weighted leverage screening method is highly computationally efficient and effective. We also demonstrate its success in identifying carcinoma related genes using spatial transcriptome data. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: Journal of the American Statistical Association, published online: 21 Jun 2021

arXiv:2011.02304 [pdf, ps, other]

Joint Curve Registration and Classification with Two-level Functional Models

Authors: Lin Tang, Pengcheng Zeng, Jian Qing Shi, Won-Seok Kim

Abstract: Many classification techniques when the data are curves or functions have been recently proposed. However, the presence of misaligned problems in the curves can influence the performance of most of them. In this paper, we propose a model-based approach for simultaneous curve registration and classification. The method is proposed to perform curve classification based on a functional logistic regre… ▽ More Many classification techniques when the data are curves or functions have been recently proposed. However, the presence of misaligned problems in the curves can influence the performance of most of them. In this paper, we propose a model-based approach for simultaneous curve registration and classification. The method is proposed to perform curve classification based on a functional logistic regression model that relies on both scalar variables and functional variables, and to align curves simultaneously via a data registration model. EM-based algorithms are developed to perform maximum likelihood inference of the proposed models. We establish the identifiability results for curve registration model and investigate the asymptotic properties of the proposed estimation procedures. Simulation studies are conducted to demonstrate the finite sample performance of the proposed models. An application of the hyoid bone movement data from stroke patients reveals the effectiveness of the new models. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: 27 pages,8 figures

arXiv:2003.12970 [pdf, other]

Elastic Coupled Co-clustering for Single-Cell Genomic Data

Authors: Pengcheng Zeng, Zhixiang Lin

Abstract: The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species. These datasets typically have different powers in identifying the unknown ce… ▽ More The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species. These datasets typically have different powers in identifying the unknown cell types through clustering, and data integration can potentially lead to a better performance of clustering algorithms. In this work, we formulate the problem in an unsupervised transfer learning framework, which utilizes knowledge learned from auxiliary dataset to improve the clustering performance of target dataset. The degree of shared information among the target and auxiliary datasets can vary, and their distributions can also be different. To address these challenges, we propose an elastic coupled co-clustering based transfer learning algorithm, by elastically propagating clustering knowledge obtained from the auxiliary dataset to the target dataset. Implementation on single-cell genomic datasets shows that our algorithm greatly improves clustering performance over the traditional learning algorithms. The source code and data sets are available at https://github.com/cuhklinlab/elasticC3. △ Less

Submitted 5 June, 2020; v1 submitted 29 March, 2020; originally announced March 2020.

Comments: 18 pages, 3 figures, 2 tables

arXiv:1711.04761 [pdf, other]

Simultaneous Registration and Clustering for Multi-dimensional Functional Data

Authors: Pengcheng Zeng, Jian Qing Shi, Won-Seok Kim

Abstract: The clustering for functional data with misaligned problems has drawn much attention in the last decade. Most methods do the clustering after those functional data being registered and there has been little research using both functional and scalar variables. In this paper, we propose a simultaneous registration and clustering (SRC) model via two-level models, allowing the use of both types of var… ▽ More The clustering for functional data with misaligned problems has drawn much attention in the last decade. Most methods do the clustering after those functional data being registered and there has been little research using both functional and scalar variables. In this paper, we propose a simultaneous registration and clustering (SRC) model via two-level models, allowing the use of both types of variables and also allowing simultaneous registration and clustering. For the data collected from subjects in different unknown groups, a Gaussian process functional regression model with time war** is used as the first level model; an allocation model depending on scalar variables is used as the second level model providing further information over the groups. The former carries out registration and modeling for the multi-dimensional functional data (2D or 3D curves) at the same time. This methodology is implemented using an EM algorithm, and is examined on both simulated data and real data. △ Less

Submitted 13 November, 2017; originally announced November 2017.

Comments: 36 pages, 13 figures

arXiv:1408.2794 [pdf, other]

Sector-Based Factor Models for Asset Returns

Authors: Angela Gu, Patrick Zeng

Abstract: Factor analysis is a statistical technique employed to evaluate how observed variables correlate through common factors and unique variables. While it is often used to analyze price movement in the unstable stock market, it does not always yield easily interpretable results. In this study, we develop improved factor models by explicitly incorporating sector information on our studied stocks. We ad… ▽ More Factor analysis is a statistical technique employed to evaluate how observed variables correlate through common factors and unique variables. While it is often used to analyze price movement in the unstable stock market, it does not always yield easily interpretable results. In this study, we develop improved factor models by explicitly incorporating sector information on our studied stocks. We add eleven sectors of stocks as defined by the IBES, represented by respective sector-specific factors, to non-specific market factors to revise the factor model. We then develop an expectation maximization (EM) algorithm to compute our revised model with 15 years' worth of S&P 500 stocks' daily close prices. Our results in most sectors show that nearly all of these factor components have the same sign, consistent with the intuitive idea that stocks in the same sector tend to rise and fall in coordination over time. Results obtained by the classic factor model, in contrast, had a homogeneous blend of positive and negative components. We conclude that results produced by our sector-based factor model are more interpretable than those produced by the classic non-sector-based model for at least some stock sectors. △ Less

Submitted 11 August, 2014; originally announced August 2014.

Comments: 10 pages, 6 figures

Showing 1–9 of 9 results for author: Zeng, P