Search | arXiv e-print repository

Individualized Dynamic Mediation Analysis Using Latent Factor Models

Authors: Yijiao Zhang, Yubai Yuan, Yuexia Zhang, Zhongyi Zhu, Annie Qu

Abstract: Mediation analysis plays a crucial role in causal inference as it can investigate the pathways through which treatment influences outcome. Most existing mediation analysis assumes that mediation effects are static and homogeneous within populations. However, mediation effects usually change over time and exhibit significant heterogeneity in many real-world applications. Additionally, the presence… ▽ More Mediation analysis plays a crucial role in causal inference as it can investigate the pathways through which treatment influences outcome. Most existing mediation analysis assumes that mediation effects are static and homogeneous within populations. However, mediation effects usually change over time and exhibit significant heterogeneity in many real-world applications. Additionally, the presence of unobserved confounding variables imposes a significant challenge to inferring both causal effect and mediation effect. To address these issues, we propose an individualized dynamic mediation analysis method. Our approach can identify the significant mediators of the population level while capturing the time-varying and heterogeneous mediation effects via latent factor modeling on coefficients of structural equation models. Another advantage of our method is that we can infer individualized mediation effects in the presence of unmeasured time-varying confounders. We provide estimation consistency for our proposed causal estimand and selection consistency for significant mediators. Extensive simulation studies and an application to a DNA methylation study demonstrate the effectiveness and advantages of our method. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 25 pages, 3 figures, 3 tables

arXiv:2405.15325 [pdf, other]

On the Identification of Temporally Causal Representation with Instantaneous Dependence

Authors: Zijian Li, Yifan Shen, Kaitao Zheng, Ruichu Cai, Xiangchen Song, Mingming Gong, Zhengmao Zhu, Guangyi Chen, Kun Zhang

Abstract: Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grou** of the observa… ▽ More Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grou** of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings. △ Less

Submitted 7 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.03042 [pdf, other]

Functional Post-Clustering Selective Inference with Applications to EHR Data Analysis

Authors: Zihan Zhu, Xin Gai, Anru R. Zhang

Abstract: In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post… ▽ More In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post-clustering data often leads to an inflated type-I error. In this paper, we introduce a new statistical approach that adjusts for this bias when analyzing data collected over time. Our method extends classical selective inference methods for cross-sectional data to longitudinal data. We provide theoretical guarantees for our approach with upper bounds on the selective type-I and type-II errors. We apply the method to simulated data and real-world Acute Kidney Injury (AKI) EHR datasets, thereby illustrating the advantages of our approach. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.02539 [pdf, ps, other]

Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models

Authors: Changxin Yang, Zhongyi Zhu, Heng Lian

Abstract: While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converge… ▽ More While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.02372 [pdf, ps, other]

Triadic-OCD: Asynchronous Online Change Detection with Provable Robustness, Optimality, and Convergence

Authors: Yancheng Huang, Kai Yang, Zelin Zhu, Leian Chen

Abstract: The primary goal of online change detection (OCD) is to promptly identify changes in the data stream. OCD problem find a wide variety of applications in diverse areas, e.g., security detection in smart grids and intrusion detection in communication networks. Prior research usually assumes precise knowledge of the system parameters. Nevertheless, this presumption often proves unattainable in practi… ▽ More The primary goal of online change detection (OCD) is to promptly identify changes in the data stream. OCD problem find a wide variety of applications in diverse areas, e.g., security detection in smart grids and intrusion detection in communication networks. Prior research usually assumes precise knowledge of the system parameters. Nevertheless, this presumption often proves unattainable in practical scenarios due to factors such as estimation errors, system updates, etc. This paper aims to take the first attempt to develop a triadic-OCD framework with certifiable robustness, provable optimality, and guaranteed convergence. In addition, the proposed triadic-OCD algorithm can be realized in a fully asynchronous distributed manner, easing the necessity of transmitting the data to a single server. This asynchronous mechanism could also mitigate the straggler issue that faced by traditional synchronous algorithm. Moreover, the non-asymptotic convergence property of Triadic-OCD is theoretically analyzed, and its iteration complexity to achieve an $ε$-optimal point is derived. Extensive experiments have been conducted to elucidate the effectiveness of the proposed method. △ Less

Submitted 4 June, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: Accepted at ICML2024

arXiv:2404.11579 [pdf, other]

Spatial Heterogeneous Additive Partial Linear Model: A Joint Approach of Bivariate Spline and Forest Lasso

Authors: Xin Zhang, Shan Yu, Zhengyuan Zhu, Xin Wang

Abstract: Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an ef… ▽ More Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an efficient clustering approach to identify the model heterogeneity of the spatial additive partial linear model. Specifically, we aim to detect the spatially contiguous clusters based on the regression coefficients while introducing a spatially varying intercept to deal with the smooth spatial effect. On the one hand, to approximate the spatial varying intercept, we use the method of bivariate spline over triangulation, which can effectively handle the data from a complex domain. On the other hand, a novel fusion penalty termed the forest lasso is proposed to reveal the spatial clustering pattern. Our proposed fusion penalty has advantages in both the estimation and computation efficiencies when dealing with large spatial data. Theoretically properties of our estimator are established, and simulation results show that our approach can achieve more accurate estimation with a limited computation cost compared with the existing approaches. To illustrate its practical use, we apply our approach to analyze the spatial pattern of the relationship between land surface temperature measured by satellites and air temperature measured by ground stations in the United States. △ Less

Submitted 3 May, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.04317 [pdf, other]

DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

Authors: Wenxuan Zuo, Zifan Zhu, Yuxuan Du, Yi-Chun Yeh, Jed A. Fuhrman, **chi Lv, Yingying Fan, Fengzhu Sun

Abstract: High-dimensional longitudinal time series data is prevalent across various real-world applications. Many such applications can be modeled as regression problems with high-dimensional time series covariates. Deep learning has been a popular and powerful tool for fitting these regression models. Yet, the development of interpretable and reproducible deep-learning models is challenging and remains un… ▽ More High-dimensional longitudinal time series data is prevalent across various real-world applications. Many such applications can be modeled as regression problems with high-dimensional time series covariates. Deep learning has been a popular and powerful tool for fitting these regression models. Yet, the development of interpretable and reproducible deep-learning models is challenging and remains underexplored. This study introduces a novel method, Deep Learning Inference using Knockoffs for Time series data (DeepLINK-T), focusing on the selection of significant time series variables in regression while controlling the false discovery rate (FDR) at a predetermined level. DeepLINK-T combines deep learning with knockoff inference to control FDR in feature selection for time series models, accommodating a wide variety of feature distributions. It addresses dependencies across time and features by leveraging a time-varying latent factor structure in time series covariates. Three key ingredients for DeepLINK-T are 1) a Long Short-Term Memory (LSTM) autoencoder for generating time series knockoff variables, 2) an LSTM prediction network using both original and knockoff variables, and 3) the application of the knockoffs framework for variable selection with FDR control. Extensive simulation studies have been conducted to evaluate DeepLINK-T's performance, showing its capability to control FDR effectively while demonstrating superior feature selection power for high-dimensional longitudinal time series data compared to its non-time series counterpart. DeepLINK-T is further applied to three metagenomic data sets, validating its practical utility and effectiveness, and underscoring its potential in real-world applications. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.03764 [pdf, other]

CONCERT: Covariate-Elaborated Robust Local Information Transfer with Conditional Spike-and-Slab Prior

Authors: Ruqian Zhang, Yijiao Zhang, Annie Qu, Zhongyi Zhu, Juan Shen

Abstract: The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only local information is shared. In this paper, we propose a novel Bayesian transfer learning method named "CO… ▽ More The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only local information is shared. In this paper, we propose a novel Bayesian transfer learning method named "CONCERT" to allow robust local information transfer for high-dimensional data analysis. A novel conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize the local similarities and make the sources work collaboratively to help improve the performance on the target. Distinguished from existing work, CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously. Variable selection consistency is established for our CONCERT. To make our algorithm scalable, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and a genetic data analysis demonstrate the validity and the advantage of CONCERT over existing cutting-edge transfer learning methods. We also extend our CONCERT to the logistical models with numerical studies showing its superiority over other methods. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 31 pages, 22 figures

arXiv:2401.15811 [pdf, other]

Seller-Side Experiments under Interference Induced by Feedback Loops in Two-Sided Platforms

Authors: Zhihua Zhu, Zheng Cai, Liang Zheng, Nian Si

Abstract: Two-sided platforms are central to modern commerce and content sharing and often utilize A/B testing for develo** new features. While user-side experiments are common, seller-side experiments become crucial for specific interventions and metrics. This paper investigates the effects of interference caused by feedback loops on seller-side experiments in two-sided platforms, with a particular focus… ▽ More Two-sided platforms are central to modern commerce and content sharing and often utilize A/B testing for develo** new features. While user-side experiments are common, seller-side experiments become crucial for specific interventions and metrics. This paper investigates the effects of interference caused by feedback loops on seller-side experiments in two-sided platforms, with a particular focus on the counterfactual interleaving design, proposed in \citet{ha2020counterfactual,nandy2021b}. These feedback loops, often generated by pacing algorithms, cause outcomes from earlier sessions to influence subsequent ones. This paper contributes by creating a mathematical framework to analyze this interference, theoretically estimating its impact, and conducting empirical evaluations of the counterfactual interleaving design in real-world scenarios. Our research shows that feedback loops can result in misleading conclusions about the treatment effects. △ Less

Submitted 9 February, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2401.02592 [pdf, other]

Guaranteed Nonconvex Factorization Approach for Tensor Train Recovery

Authors: Zhen Qin, Michael B. Wakin, Zhihui Zhu

Abstract: In this paper, we provide the first convergence guarantee for the factorization approach. Specifically, to avoid the scaling ambiguity and to facilitate theoretical analysis, we optimize over the so-called left-orthogonal TT format which enforces orthonormality among most of the factors. To ensure the orthonormal structure, we utilize the Riemannian gradient descent (RGD) for optimizing those fact… ▽ More In this paper, we provide the first convergence guarantee for the factorization approach. Specifically, to avoid the scaling ambiguity and to facilitate theoretical analysis, we optimize over the so-called left-orthogonal TT format which enforces orthonormality among most of the factors. To ensure the orthonormal structure, we utilize the Riemannian gradient descent (RGD) for optimizing those factors over the Stiefel manifold. We first delve into the TT factorization problem and establish the local linear convergence of RGD. Notably, the rate of convergence only experiences a linear decline as the tensor order increases. We then study the sensing problem that aims to recover a TT format tensor from linear measurements. Assuming the sensing operator satisfies the restricted isometry property (RIP), we show that with a proper initialization, which could be obtained through spectral initialization, RGD also converges to the ground-truth tensor at a linear rate. Furthermore, we expand our analysis to encompass scenarios involving Gaussian noise in the measurements. We prove that RGD can reliably recover the ground truth at a linear rate, with the recovery error exhibiting only polynomial growth in relation to the tensor order. We conduct various experiments to validate our theoretical findings. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2312.14534 [pdf, other]

Global Rank Sum Test: An Efficient Rank-Based Nonparametric Test for Large Scale Online Experiment

Authors: Zheng Cai, Bo Hu, Zhihua Zhu

Abstract: Online experiments are widely used for improving online services. While doing online experiments, The student t-test is the most widely used hypothesis testing technique. In practice, however, the normality assumption on which the t-test depends on may fail, which resulting in untrustworthy results. In this paper, we first discuss the question of when the t-test fails, and thus introduce the rank-… ▽ More Online experiments are widely used for improving online services. While doing online experiments, The student t-test is the most widely used hypothesis testing technique. In practice, however, the normality assumption on which the t-test depends on may fail, which resulting in untrustworthy results. In this paper, we first discuss the question of when the t-test fails, and thus introduce the rank-sum test. Next, in order to solve the difficulties while implementing rank-sum test in large online experiment platforms, we proposed a global-rank-sum test method as an improvement for the traditional one. Finally, we demonstrate that the global-rank-sum test is not only more accurate and has higher statistical power than the t-test, but also more time efficient than the traditional rank-sum test, which eventually makes it possible for large online experiment platforms to use. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: 9 pages, 3 figures

arXiv:2310.20579 [pdf, other]

Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks

Authors: Jiayuan Ye, Zhenyu Zhu, Fanghui Liu, Reza Shokri, Volkan Cevher

Abstract: We analytically investigate how over-parameterization of models in randomized machine learning algorithms impacts the information leakage about their training data. Specifically, we prove a privacy bound for the KL divergence between model distributions on worst-case neighboring datasets, and explore its dependence on the initialization, width, and depth of fully connected neural networks. We find… ▽ More We analytically investigate how over-parameterization of models in randomized machine learning algorithms impacts the information leakage about their training data. Specifically, we prove a privacy bound for the KL divergence between model distributions on worst-case neighboring datasets, and explore its dependence on the initialization, width, and depth of fully connected neural networks. We find that this KL privacy bound is largely determined by the expected squared gradient norm relative to model parameters during training. Notably, for the special setting of linearized network, our analysis indicates that the squared gradient norm (and therefore the escalation of privacy loss) is tied directly to the per-layer variance of the initialization distribution. By using this analysis, we demonstrate that privacy bound improves with increasing depth under certain initializations (LeCun and Xavier), while degrades with increasing depth under other initializations (He and NTK). Our work reveals a complex interplay between privacy and depth that depends on the chosen initialization distribution. We further prove excess empirical risk bounds under a fixed KL privacy budget, and show that the interplay between privacy utility trade-off and depth is similarly affected by the initialization. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.18123 [pdf, ps, other]

Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling

Authors: Zhenyu Zhu, Francesco Locatello, Volkan Cevher

Abstract: This paper provides statistical sample complexity bounds for score-matching and its applications in causal discovery. We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network using stochastic gradient descent. We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery me… ▽ More This paper provides statistical sample complexity bounds for score-matching and its applications in causal discovery. We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network using stochastic gradient descent. We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery method of Rolland et al. [2022], assuming a sufficiently good estimation of the score function. Finally, we analyze the upper bound of score-matching estimation within the score-based generative modeling, which has been applied for causal discovery but is also of independent interest within the domain of generative models. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted in NeurIPS 2023

arXiv:2310.09426 [pdf, other]

Offline Reinforcement Learning for Optimizing Production Bidding Policies

Authors: Dmytro Korenkevych, Frank Cheng, Artsiom Balakir, Alex Nikulkov, Lingnan Gao, Zhihao Cen, Zuobing Xu, Zheqing Zhu

Abstract: The online advertising market, with its thousands of auctions run per second, presents a daunting challenge for advertisers who wish to optimize their spend under a budget constraint. Thus, advertising platforms typically provide automated agents to their customers, which act on their behalf to bid for impression opportunities in real time at scale. Because these proxy agents are owned by the plat… ▽ More The online advertising market, with its thousands of auctions run per second, presents a daunting challenge for advertisers who wish to optimize their spend under a budget constraint. Thus, advertising platforms typically provide automated agents to their customers, which act on their behalf to bid for impression opportunities in real time at scale. Because these proxy agents are owned by the platform but use advertiser funds to operate, there is a strong practical need to balance reliability and explainability of the agent with optimizing power. We propose a generalizable approach to optimizing bidding policies in production environments by learning from real data using offline reinforcement learning. This approach can be used to optimize any differentiable base policy (practically, a heuristic policy based on principles which the advertiser can easily understand), and only requires data generated by the base policy itself. We use a hybrid agent architecture that combines arbitrary base policies with deep neural networks, where only the optimized base policy parameters are eventually deployed, and the neural network part is discarded after training. We demonstrate that such an architecture achieves statistically significant performance gains in both simulated and at-scale production bidding environments. Our approach does not incur additional infrastructure, safety, or explainability costs, as it directly optimizes parameters of existing production routines without replacing them with black box-style models like neural networks. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2309.02530 [pdf, other]

Diffusion on the Probability Simplex

Authors: Griffin Floto, Thorsteinn Jonsson, Mihai Nica, Scott Sanner, Eric Zhengyu Zhu

Abstract: Diffusion models learn to reverse the progressive noising of a data distribution to create a generative model. However, the desired continuous nature of the noising process can be at odds with discrete data. To deal with this tension between continuous and discrete objects, we propose a method of performing diffusion on the probability simplex. Using the probability simplex naturally creates an in… ▽ More Diffusion models learn to reverse the progressive noising of a data distribution to create a generative model. However, the desired continuous nature of the noising process can be at odds with discrete data. To deal with this tension between continuous and discrete objects, we propose a method of performing diffusion on the probability simplex. Using the probability simplex naturally creates an interpretation where points correspond to categorical probability distributions. Our method uses the softmax function applied to an Ornstein-Unlenbeck Process, a well-known stochastic differential equation. We find that our methodology also naturally extends to include diffusion on the unit cube which has applications for bounded image generation. △ Less

Submitted 11 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

arXiv:2307.07918 [pdf, ps, other]

A Data Fusion Method for Quantile Treatment Effects

Authors: Yijiao Zhang, Zhongyi Zhu

Abstract: With the increasing availability of datasets, develo** data fusion methods to leverage the strengths of different datasets to draw causal effects is of great practical importance to many scientific fields. In this paper, we consider estimating the quantile treatment effects using small validation data with fully-observed confounders and large auxiliary data with unmeasured confounders. We propos… ▽ More With the increasing availability of datasets, develo** data fusion methods to leverage the strengths of different datasets to draw causal effects is of great practical importance to many scientific fields. In this paper, we consider estimating the quantile treatment effects using small validation data with fully-observed confounders and large auxiliary data with unmeasured confounders. We propose a Fused Quantile Treatment effects Estimator (FQTE) by integrating the information from two datasets based on doubly robust estimating functions. We allow for the misspecification of the models on the dataset with unmeasured confounders. Under mild conditions, we show that the proposed FQTE is asymptotically normal and more efficient than the initial QTE estimator using the validation data solely. By establishing the asymptotic linear forms of related estimators, convenient methods for covariance estimation are provided. Simulation studies demonstrate the empirical validity and improved efficiency of our fused estimators. We illustrate the proposed method with an application. △ Less

Submitted 15 July, 2023; originally announced July 2023.

Comments: 42 pages, 1 figure

arXiv:2302.01075 [pdf, other]

MonoFlow: Rethinking Divergence GANs via the Perspective of Wasserstein Gradient Flows

Authors: Mingxuan Yi, Zhanxing Zhu, Song Liu

Abstract: The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsi… ▽ More The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing map** of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework. △ Less

Submitted 8 August, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: ICML 2023

arXiv:2301.11697 [pdf, other]

Big portfolio selection by graph-based conditional moments method

Authors: Zhoufan Zhu, Ningning Zhang, Ke Zhu

Abstract: How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which… ▽ More How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: 35 pages

MSC Class: 62M10; 68T07; 91G10

arXiv:2301.06259 [pdf, other]

Understanding Best Subset Selection: A Tale of Two C(omplex)ities

Authors: Saptarshi Roy, Ambuj Tewari, Ziwei Zhu

Abstract: For decades, best subset selection (BSS) has eluded statisticians mainly due to its computational bottleneck. However, until recently, modern computational breakthroughs have rekindled theoretical interest in BSS and have led to new findings. Recently, \cite{guo2020best} showed that the model selection performance of BSS is governed by a margin quantity that is robust to the design dependence, unl… ▽ More For decades, best subset selection (BSS) has eluded statisticians mainly due to its computational bottleneck. However, until recently, modern computational breakthroughs have rekindled theoretical interest in BSS and have led to new findings. Recently, \cite{guo2020best} showed that the model selection performance of BSS is governed by a margin quantity that is robust to the design dependence, unlike modern methods such as LASSO, SCAD, MCP, etc. Motivated by their theoretical results, in this paper, we also study the variable selection properties of best subset selection for high-dimensional sparse linear regression setup. We show that apart from the identifiability margin, the following two complexity measures play a fundamental role in characterizing the margin condition for model consistency: (a) complexity of \emph{residualized features}, (b) complexity of \emph{spurious projections}. In particular, we establish a simple margin condition that depends only on the identifiability margin and the dominating one of the two complexity measures. Furthermore, we show that a margin condition depending on similar margin quantity and complexity measures is also necessary for model consistency of BSS. For a broader understanding, we also consider some simple illustrative examples to demonstrate the variation in the complexity measures that refines our theoretical understanding of the model selection performance of BSS under different correlation structures. △ Less

Submitted 17 July, 2023; v1 submitted 15 January, 2023; originally announced January 2023.

Comments: 46 pages, 2 Figures

arXiv:2301.01091 [pdf, other]

Fitting mixed logit random regret minimization models using maximum simulated likelihood

Authors: Ziyue Zhu, Álvaro A. Gutiérrez-Vargas, Martina Vandebroek

Abstract: This article describes the mixrandregret command, which extends the randregret command introduced in Gutiérrez-Vargas et al. (2021, The Stata Journal 21: 626-658) incorporating random coefficients for Random Regret Minimization models. The newly developed command mixrandregret allows the inclusion of random coefficients in the regret function of the classical RRM model introduced in Chorus (2010,… ▽ More This article describes the mixrandregret command, which extends the randregret command introduced in Gutiérrez-Vargas et al. (2021, The Stata Journal 21: 626-658) incorporating random coefficients for Random Regret Minimization models. The newly developed command mixrandregret allows the inclusion of random coefficients in the regret function of the classical RRM model introduced in Chorus (2010, European Journal of Transport and Infrastructure Research 10: 181-196). The command allows the user to specify a combination of fixed and random coefficients. In addition, the user can specify normal and log-normal distributions for the random coefficients using the commands' options. The models are fitted using simulated maximum likelihood using numerical integration to approximate the choice probabilities. △ Less

Submitted 3 January, 2023; originally announced January 2023.

arXiv:2212.12206 [pdf, other]

Principled and Efficient Transfer Learning of Deep Models via Neural Collapse

Authors: Xiao Li, Sheng Liu, **xin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, Qing Qu

Abstract: As model size continues to grow and access to labeled training data remains limited, transfer learning has become a popular approach in many scientific and engineering fields. This study explores the phenomenon of neural collapse (NC) in transfer learning for classification problems, which is characterized by the last-layer features and classifiers of deep networks having zero within-class variabi… ▽ More As model size continues to grow and access to labeled training data remains limited, transfer learning has become a popular approach in many scientific and engineering fields. This study explores the phenomenon of neural collapse (NC) in transfer learning for classification problems, which is characterized by the last-layer features and classifiers of deep networks having zero within-class variability in features and maximally and equally separated between-class feature means. Through the lens of NC, in this work the following findings on transfer learning are discovered: (i) preventing within-class variability collapse to a certain extent during model pre-training on source data leads to better transferability, as it preserves the intrinsic structures of the input data better; (ii) obtaining features with more NC on downstream data during fine-tuning results in better test accuracy. These results provide new insight into commonly used heuristics in model pre-training, such as loss design, data augmentation, and projection heads, and lead to more efficient and principled methods for fine-tuning large pre-trained models. Compared to full model fine-tuning, our proposed fine-tuning methods achieve comparable or even better performance while reducing fine-tuning parameters by at least 70% as well as alleviating overfitting. △ Less

Submitted 26 February, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

Comments: First two authors contributed equally, 29 pages, 14 figures, and 7 tables

arXiv:2212.00428 [pdf, other]

Transfer Learning for High-dimensional Quantile Regression via Convolution Smoothing

Authors: Yijiao Zhang, Zhongyi Zhu

Abstract: This paper studies the high-dimensional quantile regression problem under the transfer learning framework, where possibly related source datasets are available to make improvements on the estimation or prediction based solely on the target data. In the oracle case with known transferable sources, a smoothed two-step transfer learning algorithm based on convolution smoothing is proposed and the L1/… ▽ More This paper studies the high-dimensional quantile regression problem under the transfer learning framework, where possibly related source datasets are available to make improvements on the estimation or prediction based solely on the target data. In the oracle case with known transferable sources, a smoothed two-step transfer learning algorithm based on convolution smoothing is proposed and the L1/L2 estimation error bounds of the corresponding estimator are also established. To avoid including non-informative sources, we propose to select the transferable sources adaptively and establish its selection consistency under regular conditions. Monte Carlo simulations as well as an empirical analysis of gene expression data demonstrate the effectiveness of the proposed procedure. △ Less

Submitted 1 May, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: 27 pages, 6 figures

arXiv:2210.02192 [pdf, other]

Are All Losses Created Equal: A Neural Collapse Perspective

Authors: **xin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, Zhihui Zhu

Abstract: While cross entropy (CE) is the most commonly used loss to train deep neural networks for classification tasks, many alternative losses have been developed to obtain better empirical performance. Among them, which one is the best to use is still a mystery, because there seem to be multiple factors affecting the answer, such as properties of the dataset, the choice of network architecture, and so o… ▽ More While cross entropy (CE) is the most commonly used loss to train deep neural networks for classification tasks, many alternative losses have been developed to obtain better empirical performance. Among them, which one is the best to use is still a mystery, because there seem to be multiple factors affecting the answer, such as properties of the dataset, the choice of network architecture, and so on. This paper studies the choice of loss function by examining the last-layer features of deep networks, drawing inspiration from a recent line work showing that the global optimal solution of CE and mean-square-error (MSE) losses exhibits a Neural Collapse phenomenon. That is, for sufficiently large networks trained until convergence, (i) all features of the same class collapse to the corresponding class mean and (ii) the means associated with different classes are in a configuration where their pairwise distances are all equal and maximized. We extend such results and show through global solution and landscape analyses that a broad family of loss functions including commonly used label smoothing (LS) and focal loss (FL) exhibits Neural Collapse. Hence, all relevant losses(i.e., CE, LS, FL, MSE) produce equivalent features on training data. Based on the unconstrained feature model assumption, we provide either the global landscape analysis for LS loss or the local landscape analysis for FL loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions either in the global scope for LS loss or in the local scope for FL loss near the optimal solution. The experiments further show that Neural Collapse features obtained from all relevant losses lead to largely identical performance on test data as well, provided that the network is sufficiently large and trained until convergence. △ Less

Submitted 8 October, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: 32 page, 10 figures, NeurIPS 2022

arXiv:2209.10675 [pdf, other]

A Validation Approach to Over-parameterized Matrix and Image Recovery

Authors: Lijun Ding, Zhen Qin, Liwei Jiang, **xin Zhou, Zhihui Zhu

Abstract: In this paper, we study the problem of recovering a low-rank matrix from a number of noisy random linear measurements. We consider the setting where the rank of the ground-truth matrix is unknown a prior and use an overspecified factored representation of the matrix variable, where the global optimal solutions overfit and do not correspond to the underlying ground-truth. We then solve the associat… ▽ More In this paper, we study the problem of recovering a low-rank matrix from a number of noisy random linear measurements. We consider the setting where the rank of the ground-truth matrix is unknown a prior and use an overspecified factored representation of the matrix variable, where the global optimal solutions overfit and do not correspond to the underlying ground-truth. We then solve the associated nonconvex problem using gradient descent with small random initialization. We show that as long as the measurement operators satisfy the restricted isometry property (RIP) with its rank parameter scaling with the rank of ground-truth matrix rather than scaling with the overspecified matrix variable, gradient descent iterations are on a particular trajectory towards the ground-truth matrix and achieve nearly information-theoretically optimal recovery when stop appropriately. We then propose an efficient early stop** strategy based on the common hold-out method and show that it detects nearly optimal estimator provably. Moreover, experiments show that the proposed validation approach can also be efficiently used for image restoration with deep image prior which over-parameterizes an image with a deep network. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: 29 pages and 9 figures

arXiv:2209.09211 [pdf, other]

Neural Collapse with Normalized Features: A Geometric Analysis over the Riemannian Manifold

Authors: Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, Qing Qu

Abstract: When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also… ▽ More When training overparameterized deep networks for classification tasks, it has been widely observed that the learned features exhibit a so-called "neural collapse" phenomenon. More specifically, for the output features of the penultimate layer, for each class the within-class features converge to their means, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's classifier. As feature normalization in the last layer becomes a common practice in modern representation learning, in this work we theoretically justify the neural collapse phenomenon for normalized features. Based on an unconstrained feature model, we simplify the empirical loss function in a multi-class classification task into a nonconvex optimization problem over the Riemannian manifold by constraining all features and classifiers over the sphere. In this context, we analyze the nonconvex landscape of the Riemannian optimization problem over the product of spheres, showing a benign global landscape in the sense that the only global minimizers are the neural collapse solutions while all other critical points are strict saddles with negative curvature. Experimental results on practical deep networks corroborate our theory and demonstrate that better representations can be learned faster via feature normalization. △ Less

Submitted 7 March, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: The first two authors contributed to this work equally; 38 pages, 13 figures. Accepted at NeurIPS'22

arXiv:2207.09098 [pdf, other]

ReBoot: Distributed statistical learning via refitting bootstrap samples

Authors: Yumeng Wang, Ziwei Zhu, Xuming He

Abstract: In this paper, we propose a one-shot distributed learning algorithm via refitting bootstrap samples, which we refer to as ReBoot. ReBoot refits a new model to mini-batches of bootstrap samples that are continuously drawn from each of the locally fitted models. It requires only one round of communication of model parameters without much memory. Theoretically, we analyze the statistical error rate o… ▽ More In this paper, we propose a one-shot distributed learning algorithm via refitting bootstrap samples, which we refer to as ReBoot. ReBoot refits a new model to mini-batches of bootstrap samples that are continuously drawn from each of the locally fitted models. It requires only one round of communication of model parameters without much memory. Theoretically, we analyze the statistical error rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems, respectively. In both cases, ReBoot provably achieves the full-sample statistical rate. In particular, we show that the systematic bias of ReBoot, the error that is independent of the number of subsamples (i.e., the number of sites), is $O(n ^ {-2})$ in GLM, where $n$ is the subsample size (the sample size of each local site). This rate is sharper than that of model parameter averaging and its variants, implying the higher tolerance of ReBoot with respect to data splits to maintain the full-sample rate. Our simulation study demonstrates the statistical advantage of ReBoot over competing methods. Finally, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks for image classification. FedReBoot exhibits substantial superiority over Federated Averaging (FedAvg) within early rounds of communication. △ Less

Submitted 7 May, 2024; v1 submitted 19 July, 2022; originally announced July 2022.

arXiv:2206.06463 [pdf, other]

A statistical reconstruction algorithm for positronium lifetime imaging using time-of-flight positron emission tomography

Authors: Hsin-Hsiung Huang, Zheyuan Zhu, Slun Booppasiri, Zhuo Chen, Shuo Pang, Chien-Min Kao

Abstract: Positron emission tomography (PET) is an important modality for diagnosing diseases such as cancer and Alzheimer's disease, capable of revealing the uptake of radiolabeled molecules that target specific pathological markers of the diseases. Recently, positronium lifetime imaging (PLI) that adds to traditional PET the ability to explore properties of the tissue microenvironment beyond tracer uptake… ▽ More Positron emission tomography (PET) is an important modality for diagnosing diseases such as cancer and Alzheimer's disease, capable of revealing the uptake of radiolabeled molecules that target specific pathological markers of the diseases. Recently, positronium lifetime imaging (PLI) that adds to traditional PET the ability to explore properties of the tissue microenvironment beyond tracer uptake has been demonstrated with time-of-flight (TOF) PET and the use of non-pure positron emitters. However, achieving accurate reconstruction of lifetime images from data acquired by systems having a finite TOF resolution still presents a challenge. This paper focuses on the two-dimensional PLI, introducing a maximum likelihood estimation (MLE) method that employs an exponentially modified Gaussian (EMG) probability distribution that describes the positronium lifetime data produced by TOF PET. We evaluate the performance of our EMG-based MLE method against \st{traditional} approaches using exponential likelihood functions and penalized surrogate methods. Results from computer-simulated data reveal that the proposed EMG-MLE method can yield quantitatively accurate lifetime images. We also demonstrate that the proposed MLE formulation can be extended to handle PLI data containing multiple positron populations. △ Less

Submitted 9 May, 2024; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: Submitted to IEEE-TPRMS

arXiv:2206.01474 [pdf, other]

Offline Reinforcement Learning with Causal Structured World Models

Authors: Zheng-Mao Zhu, Xiong-Hui Chen, Hong-Long Tian, Kun Zhang, Yang Yu

Abstract: Model-based methods have recently shown promising for offline reinforcement learning (RL), aiming to learn good policies from historical data without interacting with the environment. Previous model-based offline RL methods learn fully connected nets as world-models that map the states and actions to the next-step states. However, it is sensible that a world-model should adhere to the underlying c… ▽ More Model-based methods have recently shown promising for offline reinforcement learning (RL), aiming to learn good policies from historical data without interacting with the environment. Previous model-based offline RL methods learn fully connected nets as world-models that map the states and actions to the next-step states. However, it is sensible that a world-model should adhere to the underlying causal effect such that it will support learning an effective policy generalizing well in unseen states. In this paper, We first provide theoretical results that causal world-models can outperform plain world-models for offline RL by incorporating the causal structure into the generalization error bound. We then propose a practical algorithm, oFfline mOdel-based reinforcement learning with CaUsal Structure (FOCUS), to illustrate the feasibility of learning and leveraging causal structure in offline RL. Experimental results on two benchmarks show that FOCUS reconstructs the underlying causal structure accurately and robustly. Consequently, it performs better than the plain model-based offline RL algorithms and other causal model-based RL algorithms. △ Less

Submitted 3 June, 2022; originally announced June 2022.

arXiv:2204.14184 [pdf]

doi 10.1080/21680566.2022.2108522

Modeling Ride-Sourcing Matching and Pickup Processes based on Additive Gaussian Process Models

Authors: Zheng Zhu, Meng Xu, Yining Di, Xiqun Chen, **gru Yu

Abstract: Matching and pickup processes are core features of ride-sourcing services. Previous studies have adopted abundant analytical models to depict the two processes and obtain operational insights; while the goodness of fit between models and data was dismissed. To simultaneously consider the fitness between models and data and analytically tractable formations, we propose a data-driven approach based… ▽ More Matching and pickup processes are core features of ride-sourcing services. Previous studies have adopted abundant analytical models to depict the two processes and obtain operational insights; while the goodness of fit between models and data was dismissed. To simultaneously consider the fitness between models and data and analytically tractable formations, we propose a data-driven approach based on the additive Gaussian Process Model (AGPM) for ride-sourcing market modeling. The framework is tested based on real-world data collected in Hangzhou, China. We fit analytical models, machine learning models, and AGPMs, in which the number of matches or pickups are used as outputs and spatial, temporal, demand, and supply covariates are utilized as inputs. The results demonstrate the advantages of AGPMs in recovering the two processes in terms of estimation accuracy. Furthermore, we illustrate the modeling power of AGPM by utilizing the trained model to design and estimate idle vehicle relocation strategies. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: 30 pages, 8 figures, 4 tables. Submitted and under review in Transportmetrica B: Transport Dynamics

arXiv:2203.04883 [pdf, other]

A-Optimal Split Questionnaire Designs for Multivariate Continuous Variables

Authors: Dae-Gyu Jang, Zhengyuan Zhu, Cindy Yu

Abstract: A split questionnaire design (SQD), an alternative to full questionnaires, can reduce the response burden and improve survey quality. One can design a split questionnaire to reduce the information loss from missing data induced by the split questionnaire. This study develops a methodology for finding optimal SQD (OSQD) for multivariate continuous variables, applying a probabilistic design and opti… ▽ More A split questionnaire design (SQD), an alternative to full questionnaires, can reduce the response burden and improve survey quality. One can design a split questionnaire to reduce the information loss from missing data induced by the split questionnaire. This study develops a methodology for finding optimal SQD (OSQD) for multivariate continuous variables, applying a probabilistic design and optimality criterion approach. Our method employs previous survey data to compute the Fisher information matrix and A-optimality criterion to find OSQD for the current survey study. We derive theoretical findings on the relationship between the correlation structure and OSQD and the robustness of local OSQD. We conduct simulation studies to compare local and two global OSQDs; mini-max OSQD and Bayes OSQD) to baselines. We also apply our method to the 2016 Pet Demographic Survey (PDS) data. In both simulation studies and the real data application, local and global OSQDs outperform the baselines. △ Less

Submitted 9 March, 2022; originally announced March 2022.

arXiv:2203.01238 [pdf, other]

On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

Authors: **xin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, Zhihui Zhu

Abstract: When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero… ▽ More When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. This phenomenon is called Neural Collapse (NC), which seems to take place regardless of the choice of loss functions. In this work, we justify NC under the mean squared error (MSE) loss, where recent empirical evidence shows that it performs comparably or even better than the de-facto cross-entropy loss. Under a simplified unconstrained feature model, we provide the first global landscape analysis for vanilla nonconvex MSE loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. Furthermore, we justify the usage of rescaled MSE loss by probing the optimization landscape around the NC solutions, showing that the landscape can be improved by tuning the rescaling hyperparameters. Finally, our theoretical findings are experimentally verified on practical network architectures. △ Less

Submitted 12 March, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2202.14026 [pdf, other]

Robust Training under Label Noise by Over-parameterization

Authors: Sheng Liu, Zhihui Zhu, Qing Qu, Chong You

Abstract: Recently, over-parameterized deep networks, with increasingly more network parameters than training samples, have dominated the performances of modern machine learning. However, when the training data is corrupted, it has been well-known that over-parameterized networks tend to overfit and do not generalize. In this work, we propose a principled approach for robust training of over-parameterized d… ▽ More Recently, over-parameterized deep networks, with increasingly more network parameters than training samples, have dominated the performances of modern machine learning. However, when the training data is corrupted, it has been well-known that over-parameterized networks tend to overfit and do not generalize. In this work, we propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted. The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data. Specifically, we model the label noise via another sparse over-parameterization term, and exploit implicit algorithmic regularizations to recover and separate the underlying corruptions. Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets. Furthermore, our experimental results are corroborated by theory on simplified linear models, showing that exact separation between sparse noise and low-rank data can be achieved under incoherent conditions. The work opens many interesting directions for improving over-parameterized models by using sparse over-parameterization and implicit regularization. △ Less

Submitted 2 August, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: 25 pages, 4 figures and 6 tables. Code is available at https://github.com/shengliu66/SOP

arXiv:2201.01036 [pdf, other]

Supervised Homogeneity Fusion: a Combinatorial Approach

Authors: Wen Wang, Shihao Wu, Ziwei Zhu, Ling Zhou, Peter X. -K. Song

Abstract: Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grou** approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the… ▽ More Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grou** approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the statistical aspect, we identify a fundamental quantity called grou** sensitivity that underpins the difficulty of recovering the true groups. We show that $L_0$-Fusion achieves grou** consistency under the weakest possible requirement of the grou** sensitivity: if this requirement is violated, then the minimax risk of group misspecification will fail to converge to zero. Moreover, we show that in the high-dimensional regime, one can apply $L_0$-Fusion coupled with a sure screening set of features without any essential loss of statistical efficiency, while reducing the computational cost substantially. On the algorithmic aspect, we provide a MIO formulation for $L_0$-Fusion along with a warm start strategy. Simulation and real data analysis demonstrate that $L_0$-Fusion exhibits superiority over its competitors in terms of grou** accuracy. △ Less

Submitted 4 January, 2022; originally announced January 2022.

arXiv:2110.12088 [pdf, other]

Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations

Authors: Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, Yang Liu

Abstract: Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verifica… ▽ More Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verification makes it hard to theoretically study the property and treatment of real-world label noise; (2) These efforts are often of large scales, which may result in unfair comparisons of robust methods within reasonable and accessible computation power. To better understand real-world label noise, it is crucial to build controllable and moderate-sized real-world noisy datasets with both ground-truth and noisy labels. This work presents two new benchmark datasets CIFAR-10N, CIFAR-100N, equip** the training datasets of CIFAR-10, CIFAR-100 with human-annotated real-world noisy labels we collected from Amazon Mechanical Turk. We quantitatively and qualitatively show that real-world noisy labels follow an instance-dependent pattern rather than the classically assumed and adopted ones (e.g., class-dependent label noise). We then initiate an effort to benchmarking a subset of the existing solutions using CIFAR-10N and CIFAR-100N. We further proceed to study the memorization of correct and wrong predictions, which further illustrates the difference between human noise and class-dependent synthetic noise. We show indeed the real-world noise patterns impose new and outstanding challenges as compared to synthetic label noise. These observations require us to rethink the treatment of noisy labels, and we hope the availability of these two datasets would facilitate the development and evaluation of future learning with noisy label solutions. Datasets and leaderboards are available at http://noisylabels.com. △ Less

Submitted 27 March, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

Comments: Published as a conference paper at ICLR 2022

arXiv:2110.06282 [pdf, other]

The Rich Get Richer: Disparate Impact of Semi-Supervised Learning

Authors: Zhaowei Zhu, Tianyi Luo, Yang Liu

Abstract: Semi-supervised learning (SSL) has demonstrated its potential to improve the model accuracy for a variety of learning tasks when the high-quality supervised data is severely limited. Although it is often established that the average accuracy for the entire population of data is improved, it is unclear how SSL fares with different sub-populations. Understanding the above question has substantial fa… ▽ More Semi-supervised learning (SSL) has demonstrated its potential to improve the model accuracy for a variety of learning tasks when the high-quality supervised data is severely limited. Although it is often established that the average accuracy for the entire population of data is improved, it is unclear how SSL fares with different sub-populations. Understanding the above question has substantial fairness implications when different sub-populations are defined by the demographic groups that we aim to treat fairly. In this paper, we reveal the disparate impacts of deploying SSL: the sub-population who has a higher baseline accuracy without using SSL (the "rich" one) tends to benefit more from SSL; while the sub-population who suffers from a low baseline accuracy (the "poor" one) might even observe a performance drop after adding the SSL module. We theoretically and empirically establish the above observation for a broad family of SSL algorithms, which either explicitly or implicitly use an auxiliary "pseudo-label". Experiments on a set of image and text classification tasks confirm our claims. We introduce a new metric, Benefit Ratio, and promote the evaluation of the fairness of SSL (Equalized Benefit Ratio). We further discuss how the disparate impact can be mitigated. We hope our paper will alarm the potential pitfall of using SSL and encourage a multifaceted evaluation of future SSL algorithms. △ Less

Submitted 31 August, 2023; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Published as a conference paper at ICLR 2022. Revised constants Theorems 1,2, and Lemma 3 (consider the union bound). Add acknowledgments to Nautilus

arXiv:2110.02449 [pdf, ps, other]

doi 10.1016/j.csda.2022.107553

Empirical likelihood inference for longitudinal data with covariate measurement errors: An application to the LEAN study

Authors: Yuexia Zhang, Guoyou Qin, Zhongyi Zhu, Jiajia Zhang

Abstract: Measurement errors usually arise during the longitudinal data collection process. Ignoring the effects of measurement errors will lead to invalid estimates. The Lifestyle Education for Activity and Nutrition (LEAN) study was designed to assess the effectiveness of intervention for enhancing weight loss over nine months. The covariates systolic blood pressure (SBP) and diastolic blood pressure (DBP… ▽ More Measurement errors usually arise during the longitudinal data collection process. Ignoring the effects of measurement errors will lead to invalid estimates. The Lifestyle Education for Activity and Nutrition (LEAN) study was designed to assess the effectiveness of intervention for enhancing weight loss over nine months. The covariates systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured at baseline, month 4, and month 9. At each assessment time, there were two replicate measurements for SBP and DBP. The replicate measurement errors of SBP follow different distributions, as does DBP. To account for the distributional difference of replicate measurement errors, a new method for analyzing longitudinal data with replicate covariate measurement errors is developed based on the empirical likelihood method. The asymptotic properties of the proposed estimator are established under some regularity conditions. The confidence region for the parameters of interest can be constructed based on the chi-squared approximation without estimating the covariance matrix. Additionally, the proposed empirical likelihood estimator is asymptotically more efficient than the estimator of Lin et al. (2018). Extensive simulations demonstrate that the proposed method can eliminate the effects of measurement errors in the covariate and has a high estimation efficiency. The proposed method indicates the significant effect of the intervention on BMI in the LEAN study. △ Less

Submitted 2 July, 2022; v1 submitted 5 October, 2021; originally announced October 2021.

arXiv:2110.01189 [pdf, other]

Volatility prediction comparison via robust volatility proxies: An empirical deviation perspective

Authors: Weichen Wang, Ran An, Ziwei Zhu

Abstract: Volatility forecasting is crucial to risk management and portfolio construction. One particular challenge of assessing volatility forecasts is how to construct a robust proxy for the unknown true volatility. In this work, we show that the empirical loss comparison between two volatility predictors hinges on the deviation of the volatility proxy from the true volatility. We then establish non-asymp… ▽ More Volatility forecasting is crucial to risk management and portfolio construction. One particular challenge of assessing volatility forecasts is how to construct a robust proxy for the unknown true volatility. In this work, we show that the empirical loss comparison between two volatility predictors hinges on the deviation of the volatility proxy from the true volatility. We then establish non-asymptotic deviation bounds for three robust volatility proxies, two of which are based on clipped data, and the third of which is based on exponentially weighted Huber loss minimization. In particular, in order for the Huber approach to adapt to non-stationary financial returns, we propose to solve a tuning-free weighted Huber loss minimization problem to jointly estimate the volatility and the optimal robustification parameter at each time point. We then inflate this robustification parameter and use it to update the volatility proxy to achieve optimal balance between the bias and variance of the global empirical loss. We also extend this Huber method to construct volatility predictors. Finally, we exploit the proposed robust volatility proxy to compare different volatility predictors on the Bitcoin market data. It turns out that when the sample size is limited, applying the robust volatility proxy gives more consistent and stable evaluation of volatility forecasts. △ Less

Submitted 4 October, 2021; originally announced October 2021.

Comments: 48 pages

MSC Class: 62F35

arXiv:2109.11154 [pdf, other]

Rank Overspecified Robust Matrix Recovery: Subgradient Method and Exact Recovery

Authors: Lijun Ding, Liwei Jiang, Yudong Chen, Qing Qu, Zhihui Zhu

Abstract: We study the robust recovery of a low-rank matrix from sparsely and grossly corrupted Gaussian measurements, with no prior knowledge on the intrinsic rank. We consider the robust matrix factorization approach. We employ a robust $\ell_1$ loss function and deal with the challenge of the unknown rank by using an overspecified factored representation of the matrix variable. We then solve the associat… ▽ More We study the robust recovery of a low-rank matrix from sparsely and grossly corrupted Gaussian measurements, with no prior knowledge on the intrinsic rank. We consider the robust matrix factorization approach. We employ a robust $\ell_1$ loss function and deal with the challenge of the unknown rank by using an overspecified factored representation of the matrix variable. We then solve the associated nonconvex nonsmooth problem using a subgradient method with diminishing stepsizes. We show that under a regularity condition on the sensing matrices and corruption, which we call restricted direction preserving property (RDPP), even with rank overspecified, the subgradient method converges to the exact low-rank solution at a sublinear rate. Moreover, our result is more general in the sense that it automatically speeds up to a linear rate once the factor rank matches the unknown rank. On the other hand, we show that the RDPP condition holds under generic settings, such as Gaussian measurements under independent or adversarial sparse corruptions, where the result could be of independent interest. Both the exact recovery and the convergence rate of the proposed subgradient method are numerically verified in the overspecified regime. Moreover, our experiment further shows that our particular design of diminishing stepsize effectively prevents overfitting for robust recovery under overparameterized models, such as robust matrix sensing and learning robust deep image prior. This regularization effect is worth further investigation. △ Less

Submitted 26 October, 2021; v1 submitted 23 September, 2021; originally announced September 2021.

Comments: 75 pages, 3 figures

arXiv:2108.07940 [pdf, ps, other]

Weak signal identification and inference in penalized likelihood models for categorical responses

Authors: Yuexia Zhang, Peibei Shi, Zhongyi Zhu, Linbo Wang, Annie Qu

Abstract: Penalized likelihood models are widely used to simultaneously select variables and estimate model parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify… ▽ More Penalized likelihood models are widely used to simultaneously select variables and estimate model parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify weak signals and make inferences in penalized likelihood models, including the special case when the responses are categorical. To identify weak signals, we use the estimated selection probability of each covariate as a measure of the signal strength and formulate a signal identification criterion. To construct confidence intervals, we propose a two-step inference procedure. Extensive simulation studies show that the proposed procedure outperforms several existing methods. We illustrate the proposed method by applying it to the Practice Fusion diabetes data set. △ Less

Submitted 11 December, 2022; v1 submitted 17 August, 2021; originally announced August 2021.

MSC Class: 62F99 ACM Class: G.3

arXiv:2107.06939 [pdf, other]

On sure early selection of the best subset

Authors: Ziwei Zhu, Shihao Wu

Abstract: The early solution path, which tracks the first few variables that enter the model of a selection procedure, is of profound importance to scientific discoveries. In practice, it is often statistically hopeless to identify all the important features with no false discovery, let alone the intimidating expense of experiments to test their significance. Such realistic limitation calls for statistical… ▽ More The early solution path, which tracks the first few variables that enter the model of a selection procedure, is of profound importance to scientific discoveries. In practice, it is often statistically hopeless to identify all the important features with no false discovery, let alone the intimidating expense of experiments to test their significance. Such realistic limitation calls for statistical guarantee for the early discoveries of a model selector. In this paper, we focus on the early solution path of best subset selection (BSS), where the sparsity constraint is set to be lower than {the true sparsity}. Under a sparse high-dimensional linear model, we establish the sufficient and (near) necessary condition for BSS to achieve sure early selection, or equivalently, zero false discovery throughout its early path. Essentially, this condition boils down to a lower bound of the minimum projected signal margin that characterizes the gap of the captured signal strength between sure selection models and those with spurious discoveries. Defined through projection operators, this margin is independent of the restricted eigenvalues of the design, suggesting the robustness of BSS against collinearity. Moreover, our model selection guarantee tolerates reasonable optimization error and thus applies to near best subsets. Finally, to overcome the computational hurdle of BSS under high dimension, we propose the "screen then select" (STS) strategy to reduce dimension for BSS. Our numerical experiments show that the resulting early path exhibits much lower false discovery rate (FDR) than LASSO, MCP and SCAD, especially in the presence of highly correlated design. We also investigate the early paths of the iterative hard thresholding algorithms, which are greedy computational surrogates for BSS, and which yield comparable FDR as our STS procedure. △ Less

Submitted 17 November, 2022; v1 submitted 14 July, 2021; originally announced July 2021.

arXiv:2105.02375 [pdf, other]

A Geometric Analysis of Neural Collapse with Unconstrained Features

Authors: Zhihui Zhu, Tianyu Ding, **xin Zhou, Xiao Li, Chong You, Jeremias Sulam, Qing Qu

Abstract: We provide the first global optimization landscape analysis of $Neural\;Collapse$ -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that ($i$) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equi… ▽ More We provide the first global optimization landscape analysis of $Neural\;Collapse$ -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that ($i$) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and ($ii$) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified $unconstrained\;feature\;model$, which isolates the topmost layers from the classifier of the neural network. In this context, we show that the classical cross-entropy loss with weight decay has a benign global landscape, in the sense that the only global minimizers are the Simplex ETFs while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. In contrast to existing landscape analysis for deep neural networks which is often disconnected from practice, our analysis of the simplified model not only does it explain what kind of features are learned in the last layer, but it also shows why they can be efficiently optimized in the simplified settings, matching the empirical observations in practical deep network architectures. These findings could have profound implications for optimization, generalization, and robustness of broad interests. For example, our experiments demonstrate that one may set the feature dimension equal to the number of classes and fix the last-layer classifier to be a Simplex ETF for network training, which reduces memory cost by over $20\%$ on ResNet18 without sacrificing the generalization performance. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 42 pages, 8 figures, 1 table; the first two authors contributed to this work equally

arXiv:2102.05291 [pdf, other]

Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels

Authors: Zhaowei Zhu, Yiwen Song, Yang Liu

Abstract: The label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels. Existing works heavily rely on finding "anchor points" or their approximates, defined as instances belonging to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, an… ▽ More The label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels. Existing works heavily rely on finding "anchor points" or their approximates, defined as instances belonging to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, and the estimation accuracy is also often throttled by the number of available anchor points. In this paper, we propose an alternative option to the above task. Our main contribution is the discovery of an efficient estimation procedure based on a clusterability condition. We prove that with clusterable representations of features, using up to third-order consensuses of noisy labels among neighbor representations is sufficient to estimate a unique transition matrix. Compared with methods using anchor points, our approach uses substantially more instances and benefits from a much better sample complexity. We demonstrate the estimation accuracy and advantages of our estimates using both synthetic noisy labels (on CIFAR-10/100) and real human-level noisy labels (on Clothing1M and our self-collected human-annotated CIFAR-10). Our code and human-level noisy CIFAR-10 labels are available at https://github.com/UCSC-REAL/HOC. △ Less

Submitted 13 July, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: ICML 2021

arXiv:2101.09418 [pdf, other]

A Geospatial Functional Model For OCO-2 Data with Application on Imputation and Land Fraction Estimation

Authors: Xinyue Chang, Zhengyuan Zhu, Xiongtao Dai, Jonathan Hobbs

Abstract: Data from NASA's Orbiting Carbon Observatory-2 (OCO-2) satellite is essential to many carbon management strategies. A retrieval algorithm is used to estimate CO2 concentration using the radiance data measured by OCO-2. However, due to factors such as cloud cover and cosmic rays, the spatial coverage of the retrieval algorithm is limited in some areas of critical importance for carbon cycle science… ▽ More Data from NASA's Orbiting Carbon Observatory-2 (OCO-2) satellite is essential to many carbon management strategies. A retrieval algorithm is used to estimate CO2 concentration using the radiance data measured by OCO-2. However, due to factors such as cloud cover and cosmic rays, the spatial coverage of the retrieval algorithm is limited in some areas of critical importance for carbon cycle science. Mixed land/water pixels along the coastline are also not used in the retrieval processing due to the lack of valid ancillary variables including land fraction. We propose an approach to model spatial spectral data to solve these two problems by radiance imputation and land fraction estimation. The spectral observations are modeled as spatially indexed functional data with footprint-specific parameters and are reduced to much lower dimensions by functional principal component analysis. The principal component scores are modeled as random fields to account for the spatial dependence, and the missing spectral observations are imputed by kriging the principal component scores. The proposed method is shown to impute spectral radiance with high accuracy for observations over the Pacific Ocean. An unmixing approach based on this model provides much more accurate land fraction estimates in our validation study along Greece coastlines. △ Less

Submitted 23 January, 2021; originally announced January 2021.

arXiv:2012.10030 [pdf, other]

Regularized Estimation in High-Dimensional Vector Auto-Regressive Models using Spatio-Temporal Information

Authors: Zhenzhong Wang, Abolfazl Safikhani, Zhengyuan Zhu, David S. Matteson

Abstract: A Vector Auto-Regressive (VAR) model is commonly used to model multivariate time series, and there are many penalized methods to handle high dimensionality. However in terms of spatio-temporal data, most methods do not take the spatial and temporal structure of the data into consideration, which may lead to unreliable network detection and inaccurate forecasts. This paper proposes a data-driven we… ▽ More A Vector Auto-Regressive (VAR) model is commonly used to model multivariate time series, and there are many penalized methods to handle high dimensionality. However in terms of spatio-temporal data, most methods do not take the spatial and temporal structure of the data into consideration, which may lead to unreliable network detection and inaccurate forecasts. This paper proposes a data-driven weighted l1 regularized approach for spatio-temporal VAR model. Extensive simulation studies are carried out to compare the proposed method with four existing methods of high-dimensional VAR model, demonstrating improvements of our method over others in parameter estimation, network detection and out-of-sample forecasts. We also apply our method on a traffic data set to evaluate its performance in real application. In addition, we explore the theoretical properties of l1 regularized estimation of VAR model under the weakly sparse scenario, in which the exact sparsity can be viewed as a special case. To the best of our knowledge, this direction has not been considered yet in the literature. For general stationary VAR process, we derive the non-asymptotic upper bounds on l1 regularized estimation errors under the weakly sparse scenario, provide the conditions of estimation consistency, and further simplify these conditions for a special VAR(1) case. △ Less

Submitted 17 December, 2020; originally announced December 2020.

arXiv:2012.10009 [pdf, other]

Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes

Authors: Jiaming Qiu, Xiongtao Dai, Zhengyuan Zhu

Abstract: We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the den… ▽ More We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation as well as applications to electronic medical record and rainfall data. △ Less

Submitted 13 September, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: Add additional consistency results; rearrange some figures and tables for better presentation (results not changed); correct typos

arXiv:2010.10090 [pdf, other]

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Authors: Guangda Ji, Zhanxing Zhu

Abstract: Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent kernel enables us to approximate a wide neural network with a… ▽ More Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features. In this paper, we theoretically analyze the knowledge distillation of a wide neural network. First we provide a transfer risk bound for the linearized model of the network. Then we propose a metric of the task's training difficulty, called data inefficiency. Based on this metric, we show that for a perfect teacher, a high ratio of teacher's soft labels can be beneficial. Finally, for the case of imperfect teacher, we find that hard labels can correct teacher's wrong prediction, which explains the practice of mixing hard and soft labels. △ Less

Submitted 20 October, 2020; originally announced October 2020.

arXiv:2010.10079 [pdf, other]

Neural Approximate Sufficient Statistics for Implicit Models

Authors: Yanzhi Chen, Dinghuai Zhang, Michael Gutmann, Aaron Courville, Zhanxing Zhu

Abstract: We consider the fundamental problem of how to automatically construct summary statistics for implicit generative models where the evaluation of the likelihood function is intractable, but sampling data from the model is possible. The idea is to frame the task of constructing sufficient statistics as learning mutual information maximizing representations of the data with the help of deep neural net… ▽ More We consider the fundamental problem of how to automatically construct summary statistics for implicit generative models where the evaluation of the likelihood function is intractable, but sampling data from the model is possible. The idea is to frame the task of constructing sufficient statistics as learning mutual information maximizing representations of the data with the help of deep neural networks. The infomax learning procedure does not need to estimate any density or density ratio. We apply our approach to both traditional approximate Bayesian computation and recent neural likelihood methods, boosting their performance on a range of tasks. △ Less

Submitted 30 March, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: ICLR2021 spotlight

arXiv:2010.02347 [pdf, other]

Learning with Instance-Dependent Label Noise: A Sample Sieve Approach

Authors: Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, Yang Liu

Abstract: Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the… ▽ More Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, providing theoretically rigorous solutions for learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES$^{2}$ (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted examples. The implementation of CORES$^{2}$ does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES$^{2}$ in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES$^{2}$ on CIFAR10 and CIFAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance. Code is available at https://github.com/UCSC-REAL/cores. △ Less

Submitted 22 March, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: ICLR 2021

arXiv:2010.01748 [pdf, other]

Policy Learning Using Weak Supervision

Authors: **gkang Wang, Hongyi Guo, Zhaowei Zhu, Yang Liu

Abstract: Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available c… ▽ More Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervision" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high. △ Less

Submitted 2 November, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

Comments: NeurIPS 2021

arXiv:2009.07888 [pdf, other]

Transfer Learning in Deep Reinforcement Learning: A Survey

Authors: Zhuangdi Zhu, Kaixiang Lin, Anil K. Jain, Jiayu Zhou

Abstract: Reinforcement learning is a learning paradigm for solving sequential decision-making problems. Recent years have witnessed remarkable progress in reinforcement learning upon the fast development of deep neural networks. Along with the promising prospects of reinforcement learning in numerous domains such as robotics and game-playing, transfer learning has arisen to tackle various challenges faced… ▽ More Reinforcement learning is a learning paradigm for solving sequential decision-making problems. Recent years have witnessed remarkable progress in reinforcement learning upon the fast development of deep neural networks. Along with the promising prospects of reinforcement learning in numerous domains such as robotics and game-playing, transfer learning has arisen to tackle various challenges faced by reinforcement learning, by transferring knowledge from external expertise to facilitate the efficiency and effectiveness of the learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under which we analyze their goals, methodologies, compatible reinforcement learning backbones, and practical applications. We also draw connections between transfer learning and other relevant topics from the reinforcement learning perspective and explore their potential challenges that await future research progress. △ Less

Submitted 4 July, 2023; v1 submitted 16 September, 2020; originally announced September 2020.

Showing 1–50 of 145 results for author: Zhu, Z