Search | arXiv e-print repository

General Distribution Learning: A theoretical framework for Deep Learning

Abstract: There remain numerous unanswered research questions on deep learning (DL) within the classical learning theory framework. These include the remarkable generalization capabilities of overparametrized neural networks (NNs), the efficient optimization performance despite non-convexity of objectives, the mechanism of flat minima for generalization, and the exceptional performance of deep architectures… ▽ More There remain numerous unanswered research questions on deep learning (DL) within the classical learning theory framework. These include the remarkable generalization capabilities of overparametrized neural networks (NNs), the efficient optimization performance despite non-convexity of objectives, the mechanism of flat minima for generalization, and the exceptional performance of deep architectures in solving physical problems. This paper introduces General Distribution Learning (GD Learning), a novel theoretical learning framework designed to address a comprehensive range of machine learning and statistical tasks, including classification, regression and parameter estimation. Departing from traditional statistical machine learning, GD Learning focuses on the true underlying distribution. In GD Learning, learning error, corresponding to the expected error in classical statistical learning framework, is divided into fitting errors due to models and algorithms, as well as sampling errors introduced by limited sampling data. The framework significantly incorporates prior knowledge, especially in scenarios characterized by data scarcity, thereby enhancing performance. Within the GD Learning framework, we demonstrate that the global optimal solutions in non-convex optimization can be approached by minimizing the gradient norm and the non-uniformity of the eigenvalues of the model's Jacobian matrix. This insight leads to the development of the gradient structure control algorithm. GD Learning also offers fresh insights into the questions on deep learning, including overparameterization and non-convex optimization, bias-variance trade-off, and the mechanism of flat minima. △ Less

Submitted 26 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2105.04026 by other authors. arXiv admin note: text overlap with arXiv:2105.04026 by other authors

arXiv:2406.03628 [pdf, other]

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Authors: Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

Abstract: Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversamplin… ▽ More Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 59 pages, 7 figures

arXiv:2404.17019 [pdf, other]

Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules

Authors: Michael Lingzhi Li, Kosuke Imai

Abstract: A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today's scientists across disciplines. In this paper, we demonstrate that Neyman's methodology can also be used to experimentally evaluate the efficacy of i… ▽ More A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today's scientists across disciplines. In this paper, we demonstrate that Neyman's methodology can also be used to experimentally evaluate the efficacy of individualized treatment rules (ITRs), which are derived by modern causal machine learning algorithms. In particular, we show how to account for additional uncertainty resulting from a training process based on cross-fitting. The primary advantage of Neyman's approach is that it can be applied to any ITR regardless of the properties of machine learning algorithms that are used to derive the ITR. We also show, somewhat surprisingly, that for certain metrics, it is more efficient to conduct this ex-post experimental evaluation of an ITR than to conduct an ex-ante experimental evaluation that randomly assigns some units to the ITR. Our analysis demonstrates that Neyman's repeated sampling framework is as relevant for causal inference today as it has been since its inception. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.12290 [pdf, other]

Debiased Distribution Compression

Authors: Lingxiao Li, Raaz Dwivedi, Lester Mackey

Abstract: Modern compression methods can summarize a target distribution $\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein kerne… ▽ More Modern compression methods can summarize a target distribution $\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein kernel thinning (SKT) returns $\sqrt{n}$ equal-weighted points with $\widetilde{O}(n^{-1/2})$ maximum mean discrepancy (MMD) to $\mathbb{P}$. For larger-scale compression tasks, low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as $\text{poly-log}(n)$ weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering. △ Less

Submitted 26 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted to ICML 2024

arXiv:2404.11713 [pdf, ps, other]

Propensity Score Analysis with Guaranteed Subgroup Balance

Authors: Yan Li, Yong-Fang Kuo, Liang Li

Abstract: Estimating the causal treatment effects by subgroups is important in observational studies when the treatment effect heterogeneity may be present. Existing propensity score methods rely on a correctly specified propensity score model. Model misspecification results in biased treatment effect estimation and covariate imbalance. We proposed a new algorithm, the propensity score analysis with guarant… ▽ More Estimating the causal treatment effects by subgroups is important in observational studies when the treatment effect heterogeneity may be present. Existing propensity score methods rely on a correctly specified propensity score model. Model misspecification results in biased treatment effect estimation and covariate imbalance. We proposed a new algorithm, the propensity score analysis with guaranteed subgroup balance (G-SBPS), to achieve covariate mean balance in all subgroups. We further incorporated nonparametric kernel regression for the propensity scores and developed a kernelized G-SBPS (kG-SBPS) to improve the subgroup mean balance of covariate transformations in a rich functional class. This extension is more robust to propensity score model misspecification. Extensive numerical studies showed that G-SBPS and kG-SBPS improve both subgroup covariate balance and subgroup treatment effect estimation, compared to existing approaches. We applied G-SBPS and kG-SBPS to a dataset on right heart catheterization to estimate the subgroup average treatment effects on the hospital length of stay and a dataset on diabetes self-management training to estimate the subgroup average treatment effects for the treated on the hospitalization rate. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.04794 [pdf, other]

A Deep Learning Approach to Nonparametric Propensity Score Estimation with Optimized Covariate Balance

Authors: Maosen Peng, Yan Li, Chong Wu, Liang Li

Abstract: This paper proposes a novel propensity score weighting analysis. We define two sufficient and necessary conditions for a function of the covariates to be the propensity score. The first is "local balance", which ensures the conditional independence of covariates and treatment assignment across a dense grid of propensity score values. The second condition, "local calibration", guarantees that a bal… ▽ More This paper proposes a novel propensity score weighting analysis. We define two sufficient and necessary conditions for a function of the covariates to be the propensity score. The first is "local balance", which ensures the conditional independence of covariates and treatment assignment across a dense grid of propensity score values. The second condition, "local calibration", guarantees that a balancing score is a propensity score. Using three-layer feed-forward neural networks, we develop a nonparametric propensity score model that satisfies these conditions, effectively circumventing the issue of model misspecification and optimizing covariate balance to minimize bias and stabilize the inverse probability weights. Our proposed method performed substantially better than existing methods in extensive numerical studies of both real and simulated benchmark datasets. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: Corresponding author: Chong Wu (Email: [email protected]) and Liang Li (Email: [email protected])

arXiv:2403.11348 [pdf, other]

COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits

Authors: Mintong Kang, Nezihe Merve Gürel, Linyi Li, Bo Li

Abstract: Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In th… ▽ More Conformal prediction has shown spurring performance in constructing statistically rigorous prediction sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during the inference can violate the exchangeability assumption, challenge the coverage guarantees, and result in a subsequent decline in empirical coverage. In this work, we propose a certifiably robust learning-reasoning conformal prediction framework (COLEP) via probabilistic circuits, which comprise a data-driven learning component that trains statistical models to learn different semantic concepts, and a reasoning component that encodes knowledge and characterizes the relationships among the trained models for logic reasoning. To achieve exact and efficient reasoning, we employ probabilistic circuits (PCs) within the reasoning component. Theoretically, we provide end-to-end certification of prediction coverage for COLEP in the presence of bounded adversarial perturbations. We also provide certified coverage considering the finite size of the calibration set. Furthermore, we prove that COLEP achieves higher prediction coverage and accuracy over a single model as long as the utilities of knowledge models are non-trivial. Empirically, we show the validity and tightness of our certified coverage, demonstrating the robust conformal prediction of COLEP on various datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to 12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on AwA2. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: Accepted to ICLR 2024

arXiv:2403.07031 [pdf, other]

The Cram Method for Efficient Simultaneous Learning and Evaluation

Authors: Zeyang Jia, Kosuke Imai, Michael Lingzhi Li

Abstract: We introduce the "cram" method, a general and efficient approach to simultaneous learning and evaluation using a generic machine learning (ML) algorithm. In a single pass of batched data, the proposed method repeatedly trains an ML algorithm and tests its empirical performance. Because it utilizes the entire sample for both learning and evaluation, cramming is significantly more data-efficient tha… ▽ More We introduce the "cram" method, a general and efficient approach to simultaneous learning and evaluation using a generic machine learning (ML) algorithm. In a single pass of batched data, the proposed method repeatedly trains an ML algorithm and tests its empirical performance. Because it utilizes the entire sample for both learning and evaluation, cramming is significantly more data-efficient than sample-splitting. The cram method also naturally accommodates online learning algorithms, making its implementation computationally efficient. To demonstrate the power of the cram method, we consider the standard policy learning setting where cramming is applied to the same data to both develop an individualized treatment rule (ITR) and estimate the average outcome that would result if the learned ITR were to be deployed. We show that under a minimal set of assumptions, the resulting crammed evaluation estimator is consistent and asymptotically normal. While our asymptotic results require a relatively weak stabilization condition of ML algorithm, we develop a simple, generic method that can be used with any policy learning algorithm to satisfy this condition. Our extensive simulation studies show that, when compared to sample-splitting, cramming reduces the evaluation standard error by more than 40% while improving the performance of learned policy. We also apply the cram method to a randomized clinical trial to demonstrate its applicability to real-world problems. Finally, we briefly discuss future extensions of the cram method to other learning and evaluation settings. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2403.04568 [pdf, other]

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Authors: Long-Fei Li, Peng Zhao, Zhi-Hua Zhou

Abstract: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mapp… ▽ More We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature map**s, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: AISTATS 2024

arXiv:2402.02697 [pdf, ps, other]

Deep Equilibrium Models are Almost Equivalent to Not-so-deep Explicit Models for High-dimensional Gaussian Mixtures

Authors: Zenan Ling, Longbo Li, Zhanbo Feng, Yixuan Zhang, Feng Zhou, Robert C. Qiu, Zhenyu Liao

Abstract: Deep equilibrium models (DEQs), as a typical implicit neural network, have demonstrated remarkable success on various tasks. There is, however, a lack of theoretical understanding of the connections and differences between implicit DEQs and explicit neural network models. In this paper, leveraging recent advances in random matrix theory (RMT), we perform an in-depth analysis on the eigenspectra of… ▽ More Deep equilibrium models (DEQs), as a typical implicit neural network, have demonstrated remarkable success on various tasks. There is, however, a lack of theoretical understanding of the connections and differences between implicit DEQs and explicit neural network models. In this paper, leveraging recent advances in random matrix theory (RMT), we perform an in-depth analysis on the eigenspectra of the conjugate kernel (CK) and neural tangent kernel (NTK) matrices for implicit DEQs, when the input data are drawn from a high-dimensional Gaussian mixture. We prove, in this setting, that the spectral behavior of these Implicit-CKs and NTKs depend on the DEQ activation function and initial weight variances, but only via a system of four nonlinear equations. As a direct consequence of this theoretical result, we demonstrate that a shallow explicit network can be carefully designed to produce the same CK or NTK as a given DEQ. Despite derived here for Gaussian mixture data, empirical results show the proposed theory and design principle also apply to popular real-world datasets. △ Less

Submitted 19 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted by ICML 2024

arXiv:2401.17393 [pdf, other]

Estimating the EVSI with Gaussian Approximations and Spline-Based Series Methods

Authors: Linke Li, Hawre Jalal, Anna Heath

Abstract: Background. The Expected Value of Sample Information (EVSI) measures the expected benefits that could be obtained by collecting additional data. Estimating EVSI using the traditional nested Monte Carlo method is computationally expensive but the recently developed Gaussian approximation (GA) approach can efficiently estimate EVSI across different sample sizes. However, the conventional GA may resu… ▽ More Background. The Expected Value of Sample Information (EVSI) measures the expected benefits that could be obtained by collecting additional data. Estimating EVSI using the traditional nested Monte Carlo method is computationally expensive but the recently developed Gaussian approximation (GA) approach can efficiently estimate EVSI across different sample sizes. However, the conventional GA may result in biased EVSI estimates if the decision models are highly nonlinear. This bias may lead to suboptimal study designs when GA is used to optimize the value of different studies. Therefore, we extend the conventional GA approach to improve its performance for nonlinear decision models. Methods. Our method provides accurate EVSI estimates by approximating the conditional benefit based on two steps. First, a Taylor series approximation is applied to estimate the conditional benefit as a function of the conditional moments of the parameters of interest using a spline, which is fitted to the samples of the parameters and the corresponding benefits. Next, the conditional moments of parameters are approximated by the conventional GA and Fisher information. The proposed approach is applied to several data collection exercises involving non-Gaussian parameters and nonlinear decision models. Its performance is compared with the nested Monte Carlo method, the conventional GA approach, and the nonparametric regression-based method for EVSI calculation. Results. The proposed approach provides accurate EVSI estimates across different sample sizes when the parameters of interest are non-Gaussian and the decision models are nonlinear. The computational cost of the proposed method is similar to other novel methods. Conclusions. The proposed approach can estimate EVSI across sample sizes accurately and efficiently, which may support researchers in determining an economically optimal study design using EVSI. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: 11 pages, 2 figures, presented at 44th Medical Decision Making Annual North American Meeting

arXiv:2401.16660 [pdf, ps, other]

A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information

Authors: Linke Li, Hawre Jalal, Anna Heath

Abstract: The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are… ▽ More The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are either computationally expensive or potentially inaccurate. To address these limitations, we propose a novel approach that estimates the ESS using the summary statistics of generated datasets and nonparametric regression methods. The simulation results suggest that the proposed method provides accurate ESS estimates at a low computational cost, making it an efficient and practical way to quantify the information contained in the probability distribution of a parameter. Overall, determining the ESS can help analysts understand the uncertainty levels in complex prior distributions in the probability analyses of decision models and perform efficient EVSI calculations. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 5 pages

arXiv:2401.12369 [pdf, other]

SubgroupTE: Advancing Treatment Effect Estimation with Subgroup Identification

Authors: Seungyeon Lee, Ruoqi Liu, Wenyu Song, Lang Li, ** Zhang

Abstract: Precise estimation of treatment effects is crucial for evaluating intervention effectiveness. While deep learning models have exhibited promising performance in learning counterfactual representations for treatment effect estimation (TEE), a major limitation in most of these models is that they treat the entire population as a homogeneous group, overlooking the diversity of treatment effects acros… ▽ More Precise estimation of treatment effects is crucial for evaluating intervention effectiveness. While deep learning models have exhibited promising performance in learning counterfactual representations for treatment effect estimation (TEE), a major limitation in most of these models is that they treat the entire population as a homogeneous group, overlooking the diversity of treatment effects across potential subgroups that have varying treatment effects. This limitation restricts the ability to precisely estimate treatment effects and provide subgroup-specific treatment recommendations. In this paper, we propose a novel treatment effect estimation model, named SubgroupTE, which incorporates subgroup identification in TEE. SubgroupTE identifies heterogeneous subgroups with different treatment responses and more precisely estimates treatment effects by considering subgroup-specific causal effects. In addition, SubgroupTE iteratively optimizes subgrou** and treatment effect estimation networks to enhance both estimation and subgroup identification. Comprehensive experiments on the synthetic and semi-synthetic datasets exhibit the outstanding performance of SubgroupTE compared with the state-of-the-art models on treatment effect estimation. Additionally, a real-world study demonstrates the capabilities of SubgroupTE in enhancing personalized treatment recommendations for patients with opioid use disorder (OUD) by advancing treatment effect estimation with subgroup identification. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2312.16769 [pdf, other]

Estimation and Inference for High-dimensional Multi-response Growth Curve Model

Authors: Xin Zhou, Yin Xia, Lexin Li

Abstract: A growth curve model (GCM) aims to characterize how an outcome variable evolves, develops and grows as a function of time, along with other predictors. It provides a particularly useful framework to model growth trend in longitudinal data. However, the estimation and inference of GCM with a large number of response variables faces numerous challenges, and remains underdeveloped. In this article, w… ▽ More A growth curve model (GCM) aims to characterize how an outcome variable evolves, develops and grows as a function of time, along with other predictors. It provides a particularly useful framework to model growth trend in longitudinal data. However, the estimation and inference of GCM with a large number of response variables faces numerous challenges, and remains underdeveloped. In this article, we study the high-dimensional multivariate-response linear GCM, and develop the corresponding estimation and inference procedures. Our proposal is far from a straightforward extension, and involves several innovative components. Specifically, we introduce a Kronecker product structure, which allows us to effectively decompose a very large covariance matrix, and to pool the correlated samples to improve the estimation accuracy. We devise a highly non-trivial multi-step estimation approach to estimate the individual covariance components separately and effectively. We also develop rigorous statistical inference procedures to test both the global effects and the individual effects, and establish the size and power properties, as well as the proper false discovery control. We demonstrate the effectiveness of the new method through both intensive simulations, and the analysis of a longitudinal neuroimaging data for Alzheimer's disease. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.05756 [pdf]

A quantitative fusion strategy of stock picking and timing based on Particle Swarm Optimized-Back Propagation Neural Network and Multivariate Gaussian-Hidden Markov Model

Authors: Huajian Li, Longjian Li, Jiajian Liang, Weinan Dai

Abstract: In recent years, machine learning (ML) has brought effective approaches and novel techniques to economic decision, investment forecasting, and risk management, etc., co** the variable and intricate nature of economic and financial environments. For the investment in stock market, this research introduces a pioneering quantitative fusion model combining stock timing and picking strategy by levera… ▽ More In recent years, machine learning (ML) has brought effective approaches and novel techniques to economic decision, investment forecasting, and risk management, etc., co** the variable and intricate nature of economic and financial environments. For the investment in stock market, this research introduces a pioneering quantitative fusion model combining stock timing and picking strategy by leveraging the Multivariate Gaussian-Hidden Markov Model (MGHMM) and Back Propagation Neural Network optimized by Particle Swarm (PSO-BPNN). After the information coefficients (IC) between fifty-two factors that have been winsorized, neutralized and standardized and the return of CSI 300 index are calculated, a given amount of factors that rank ahead are choose to be candidate factors heading for the input of PSO-BPNN after dimension reduction by Principal Component Analysis (PCA), followed by a certain amount of constituent stocks outputted. Subsequently, we conduct the prediction and trading on the basis of the screening stocks and stock market state outputted by MGHMM trained using inputting CSI 300 index data after Box-Cox transformation, bespeaking eximious performance during the period of past four years. Ultimately, some conventional forecast and trading methods are compared with our strategy in Chinese stock market. Our fusion strategy incorporating stock picking and timing presented in this article provide a innovative technique for financial analysis. △ Less

Submitted 22 December, 2023; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: 12 pages, 6 figures, 4 tables, 26 references

arXiv:2311.11543 [pdf, other]

A Comparison of Parameter Estimation Methods for Shared Frailty Models

Authors: Tingxuan Wu, Cindy Feng, Longhai Li

Abstract: This paper compares six different parameter estimation methods for shared frailty models via a series of simulation studies. A shared frailty model is a survival model that incorporates a random effect term, where the frailties are common or shared among individuals within specific groups. Several parameter estimation methods are available for fitting shared frailty models, such as penalized parti… ▽ More This paper compares six different parameter estimation methods for shared frailty models via a series of simulation studies. A shared frailty model is a survival model that incorporates a random effect term, where the frailties are common or shared among individuals within specific groups. Several parameter estimation methods are available for fitting shared frailty models, such as penalized partial likelihood (PPL), expectation-maximization (EM), pseudo full likelihood (PFL), hierarchical likelihood (HL), maximum marginal likelihood (MML), and maximization penalized likelihood (MPL) algorithms. These estimation methods are implemented in various R packages, providing researchers with various options for analyzing clustered survival data using shared frailty models. However, there is a limited amount of research comparing the performance of these parameter estimation methods for fitting shared frailty models. Consequently, it can be challenging for users to determine the most appropriate method for analyzing clustered survival data. To address this gap, this paper aims to conduct a series of simulation studies to compare the performance of different parameter estimation methods implemented in R packages. We will evaluate several key aspects, including parameter estimation, bias and variance of the parameter estimates, rate of convergence, and computational time required by each package. Through this systematic evaluation, our goal is to provide a comprehensive understanding of the advantages and limitations associated with each estimation method. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.00878 [pdf, other]

Backward Joint Model for Dynamic Prediction using Multivariate Longitudinal and Competing Risk Data

Authors: Wenhao Li, Liang Li, Brad C. Astor, Wei Yang, Tom H. Greene

Abstract: Joint modeling is a useful approach to dynamic prediction of clinical outcomes using longitudinally measured predictors. When the outcomes are competing risk events, fitting the conventional shared random effects joint model often involves intensive computation, especially when multiple longitudinal biomarkers are be used as predictors, as is often desired in prediction problems. Motivated by a lo… ▽ More Joint modeling is a useful approach to dynamic prediction of clinical outcomes using longitudinally measured predictors. When the outcomes are competing risk events, fitting the conventional shared random effects joint model often involves intensive computation, especially when multiple longitudinal biomarkers are be used as predictors, as is often desired in prediction problems. Motivated by a longitudinal cohort study of chronic kidney disease, this paper proposes a new joint model for the dynamic prediction of end-stage renal disease with the competing risk of death. The model factorizes the likelihood into the distribution of the competing risks data and the distribution of longitudinal data given the competing risks data. The estimation with the EM algorithm is efficient, stable and fast, with a one-dimensional integral in the E-step and convex optimization for most parameters in the M-step, regardless of the number of longitudinal predictors. The model also comes with a consistent albeit less efficient estimation method that can be quickly implemented with standard software, ideal for model building and diagnotics. This model enables the prediction of future longitudinal data trajectories conditional on being at risk at a future time, a practically significant problem that has not been studied in the statistical literature. We study the properties of the proposed method using simulations and a real dataset and compare its performance with the shared random effects joint model. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.19360 [pdf, other]

Balance, Imbalance, and Rebalance: Understanding Robust Overfitting from a Minimax Game Perspective

Authors: Yifei Wang, Liangchen Li, Jiansheng Yang, Zhouchen Lin, Yisen Wang

Abstract: Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after learning rate (LR) decay. In this paper, we explain this phenomenon by viewing adversarial training as a dynamic minimax game between the model trainer and the attacker. Specific… ▽ More Adversarial Training (AT) has become arguably the state-of-the-art algorithm for extracting robust features. However, researchers recently notice that AT suffers from severe robust overfitting problems, particularly after learning rate (LR) decay. In this paper, we explain this phenomenon by viewing adversarial training as a dynamic minimax game between the model trainer and the attacker. Specifically, we analyze how LR decay breaks the balance between the minimax game by empowering the trainer with a stronger memorization ability, and show such imbalance induces robust overfitting as a result of memorizing non-robust features. We validate this understanding with extensive experiments, and provide a holistic view of robust overfitting from the dynamics of both the two game players. This understanding further inspires us to alleviate robust overfitting by rebalancing the two players by either regularizing the trainer's capacity or improving the attack strength. Experiments show that the proposed ReBalanced Adversarial Training (ReBAT) can attain good robustness and does not suffer from robust overfitting even after very long training. Code is available at https://github.com/PKU-ML/ReBAT. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2310.16203 [pdf, other]

Multivariate Dynamic Mediation Analysis under a Reinforcement Learning Framework

Authors: Lan Luo, Chengchun Shi, Jitao Wang, Zhenke Wu, Lexin Li

Abstract: Mediation analysis is an important analytic tool commonly used in a broad range of scientific applications. In this article, we study the problem of mediation analysis when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. The problem is challenging, because the effect of a mediator involves not only the path from the treat… ▽ More Mediation analysis is an important analytic tool commonly used in a broad range of scientific applications. In this article, we study the problem of mediation analysis when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. The problem is challenging, because the effect of a mediator involves not only the path from the treatment to this mediator itself at the current time point, but also all possible paths pointed to this mediator from its upstream mediators, as well as the carryover effects from all previous time points. We propose a novel multivariate dynamic mediation analysis approach. Drawing inspiration from the Markov decision process model that is frequently employed in reinforcement learning, we introduce a Markov mediation process paired with a system of time-varying linear structural equation models to formulate the problem. We then formally define the individual mediation effect, built upon the idea of simultaneous interventions and intervention calculus. We next derive the closed-form expression and propose an iterative estimation procedure under the Markov mediation process model. We study both the asymptotic property and the empirical performance of the proposed estimator, and further illustrate our method with a mobile health application. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.07973 [pdf, other]

Statistical Performance Guarantee for Subgroup Identification with Generic Machine Learning

Authors: Michael Lingzhi Li, Kosuke Imai

Abstract: Across a wide array of disciplines, many researchers use machine learning (ML) algorithms to identify a subgroup of individuals who are likely to benefit from a treatment the most (``exceptional responders'') or those who are harmed by it. A common approach to this subgroup identification problem consists of two steps. First, researchers estimate the conditional average treatment effect (CATE) usi… ▽ More Across a wide array of disciplines, many researchers use machine learning (ML) algorithms to identify a subgroup of individuals who are likely to benefit from a treatment the most (``exceptional responders'') or those who are harmed by it. A common approach to this subgroup identification problem consists of two steps. First, researchers estimate the conditional average treatment effect (CATE) using an ML algorithm. Next, they use the estimated CATE to select those individuals who are predicted to be most affected by the treatment, either positively or negatively. Unfortunately, CATE estimates are often biased and noisy. In addition, utilizing the same data to both identify a subgroup and estimate its group average treatment effect results in a multiple testing problem. To address these challenges, we develop uniform confidence bands for estimation of the group average treatment effect sorted by generic ML algorithm (GATES). Using these uniform confidence bands, researchers can identify, with a statistical guarantee, a subgroup whose GATES exceeds a certain effect size, regardless of how this effect size is chosen. The validity of the proposed methodology depends solely on randomization of treatment and random sampling of units. Importantly, our method does not require modeling assumptions and avoids a computationally intensive resampling procedure. A simulation study shows that the proposed uniform confidence bands are reasonably informative and have an appropriate empirical coverage even when the sample size is as small as 100. We analyze a clinical trial of late-stage prostate cancer and find a relatively large proportion of exceptional responders. △ Less

Submitted 20 December, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2306.11906 [pdf]

Statistical thinking in simulation design: a continuing conversation on the balancing intercept problem

Authors: Boyi Guo, Linzi Li, Jacqueline E. Rudolph

Abstract: Epidemiologists have a growing interest in employing computational approaches to solve analytic problems, with simulation being arguably the most accessible among all approaches. While previous literature discussed the utility of simulation and demonstrated how to carry out them, few have focused on connecting underlying statistical concepts to these simulation approaches, creating gaps between th… ▽ More Epidemiologists have a growing interest in employing computational approaches to solve analytic problems, with simulation being arguably the most accessible among all approaches. While previous literature discussed the utility of simulation and demonstrated how to carry out them, few have focused on connecting underlying statistical concepts to these simulation approaches, creating gaps between theory and application. Based on the recent series of discussions on the balancing intercept, we explain the growing complexity when generalizing the balancing intercept to a wider class of simulations and revise the closed-form equation for the balancing intercept under assumptions. The discussion can broadly inform the future design of more complex simulations and emphasize the importance of applying statistical thinking in the new era of computational science. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 9 pages including title and abstract pages, references and supporting material, 1 figure

arXiv:2306.05436 [pdf, other]

Remaining Useful Life Modelling with an Escalator Health Condition Analytic System

Authors: Inez M. Zwetsloot, Yu Lin, Jiaqi Qiu, Lishuai Li, William Ka Fai Lee, Edmond Yin San Yeung, Colman Yiu Wah Yeung, Chris Chun Long Wong

Abstract: The refurbishment of an escalator is usually linked with its design life as recommended by the manufacturer. However, the actual useful life of an escalator should be determined by its operating condition which is affected by the runtime, workload, maintenance quality, vibration, etc., rather than age only. The objective of this project is to develop a comprehensive health condition analytic syste… ▽ More The refurbishment of an escalator is usually linked with its design life as recommended by the manufacturer. However, the actual useful life of an escalator should be determined by its operating condition which is affected by the runtime, workload, maintenance quality, vibration, etc., rather than age only. The objective of this project is to develop a comprehensive health condition analytic system for escalators to support refurbishment decisions. The analytic system consists of four parts: 1) online data gathering and processing; 2) a dashboard for condition monitoring; 3) a health index model; and 4) remaining useful life prediction. The results can be used for a) predicting the remaining useful life of the escalators, in order to support asset replacement planning and b) monitoring the real-time condition of escalators; including alerts when vibration exceeds the threshold and signal diagnosis, giving an indication of possible root cause (components) of the alert signal. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: 14 pages, 12 figures, 7 tables

arXiv:2305.20028 [pdf, other]

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Authors: Yucen Lily Li, Tim G. J. Rudner, Andrew Gordon Wilson

Abstract: Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practic… ▽ More Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions. △ Less

Submitted 8 May, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: ICLR 2024. Code available at https://github.com/yucenli/bnn-bo

arXiv:2305.19244 [pdf, other]

Testing for the Markov Property in Time Series via Deep Conditional Generative Learning

Authors: Yunzhe Zhou, Chengchun Shi, Lexin Li, Qiwei Yao

Abstract: The Markov property is widely imposed in analysis of time series data. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance. In this article, we propose a nonparametric test for the Markov property in high-dimensional time series via deep conditional generative learning. We also apply the test sequentially to determine the… ▽ More The Markov property is widely imposed in analysis of time series data. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance. In this article, we propose a nonparametric test for the Markov property in high-dimensional time series via deep conditional generative learning. We also apply the test sequentially to determine the order of the Markov model. We show that the test controls the type-I error asymptotically, and has the power approaching one. Our proposal makes novel contributions in several ways. We utilize and extend state-of-the-art deep generative learning to estimate the conditional density functions, and establish a sharp upper bound on the approximation error of the estimators. We derive a doubly robust test statistic, which employs a nonparametric estimation but achieves a parametric convergence rate. We further adopt sample splitting and cross-fitting to minimize the conditions required to ensure the consistency of the test. We demonstrate the efficacy of the test through both simulations and the three data applications. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.15751 [pdf, other]

High-dimensional Response Growth Curve Modeling for Longitudinal Neuroimaging Analysis

Authors: Lu Wang, Xiang Lyu, Zhengwu Zhang, Lexin Li

Abstract: There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often in… ▽ More There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often infeasible to tackle the large covariance matrix of the random effects involved in the model. In this article, we propose a high-dimensional response growth curve model, with three novel components: a low-rank factor model structure that substantially reduces the number of parameters in the large covariance matrix, a re-parameterization formulation coupled with a sparsity penalty that selects important fixed and random effect terms, and a computational trick that turns the inversion of a large matrix into the inversion of a stack of small matrices and thus considerably speeds up the computation. We develop an efficient expectation-maximization type estimation algorithm, and demonstrate the competitive performance of the proposed method through both simulations and a longitudinal study of brain structural connectivity in association with human immunodeficiency virus. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.11908 [pdf, other]

Sequential Best-Arm Identification with Application to Brain-Computer Interface

Authors: Xin Zhou, Botao Hao, Jian Kang, Tor Lattimore, Lexin Li

Abstract: A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. It allows individuals to interact with the device using only their thoughts, and holds immense potential for a wide range of applications in medicine, rehabilitation, and human augmentation. An electroencephalogram (EEG) and event-related potential (ERP)-b… ▽ More A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. It allows individuals to interact with the device using only their thoughts, and holds immense potential for a wide range of applications in medicine, rehabilitation, and human augmentation. An electroencephalogram (EEG) and event-related potential (ERP)-based speller system is a type of BCI that allows users to spell words without using a physical keyboard, but instead by recording and interpreting brain signals under different stimulus presentation paradigms. Conventional non-adaptive paradigms treat each word selection independently, leading to a lengthy learning process. To improve the sampling efficiency, we cast the problem as a sequence of best-arm identification tasks in multi-armed bandits. Leveraging pre-trained large language models (LLMs), we utilize the prior knowledge learned from previous tasks to inform and facilitate subsequent tasks. To do so in a coherent way, we propose a sequential top-two Thompson sampling (STTS) algorithm under the fixed-confidence setting and the fixed-budget setting. We study the theoretical property of the proposed algorithm, and demonstrate its substantial empirical improvement through both synthetic data analysis as well as a P300 BCI speller simulator example. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.05890 [pdf, other]

CUTS+: High-dimensional Causal Discovery from Irregular Time-series

Authors: Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Qin Zhong, **li Suo, Kunlun He

Abstract: Causal discovery in time-series is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and hug… ▽ More Causal discovery in time-series is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and huge causal graphs. Moreover, the missing entries in the observations further hamper the causal structural learning. To overcome these limitations, We propose CUTS+, which is built on the Granger-causality-based causal discovery method CUTS and raises the scalability by introducing a technique called Coarse-to-fine-discovery (C2FD) and leveraging a message-passing-based graph neural network (MPGNN). Compared to previous methods on simulated, quasi-real, and real datasets, we show that CUTS+ largely improves the causal discovery performance on high-dimensional data with different types of irregular sampling. △ Less

Submitted 16 August, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: Submit to AAAI-24

arXiv:2304.08184 [pdf, other]

Adjustment with Many Regressors Under Covariate-Adaptive Randomizations

Authors: Liang Jiang, Liyao Li, Ke Miao, Yichong Zhang

Abstract: Our paper discovers a new trade-off of using regression adjustments (RAs) in causal inference under covariate-adaptive randomizations (CARs). On one hand, RAs can improve the efficiency of causal estimators by incorporating information from covariates that are not used in the randomization. On the other hand, RAs can degrade estimation efficiency due to their estimation errors, which are not asymp… ▽ More Our paper discovers a new trade-off of using regression adjustments (RAs) in causal inference under covariate-adaptive randomizations (CARs). On one hand, RAs can improve the efficiency of causal estimators by incorporating information from covariates that are not used in the randomization. On the other hand, RAs can degrade estimation efficiency due to their estimation errors, which are not asymptotically negligible when the number of regressors is of the same order as the sample size. Ignoring the estimation errors of RAs may result in serious over-rejection of causal inference under the null hypothesis. To address the issue, we construct a new ATE estimator by optimally linearly combining the adjusted and unadjusted estimators. We then develop a unified inference theory for this estimator under CARs. It has two features: (1) the Wald test based on it achieves the exact asymptotic size under the null hypothesis, regardless of whether the number of covariates is fixed or diverges no faster than the sample size; and (2) it guarantees weak efficiency improvement over both the adjusted and unadjusted estimators. △ Less

Submitted 6 February, 2024; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: 75 pages, including appendix

arXiv:2303.18163 [pdf, other]

Robust Tensor Factor Analysis

Authors: Matteo Barigozzi, Yong He, Lingxiao Li, Lorenzo Trapani

Abstract: We consider (robust) inference in the context of a factor model for tensor-valued sequences. We study the consistency of the estimated common factors and loadings space when using estimators based on minimising quadratic loss functions. Building on the observation that such loss functions are adequate only if sufficiently many moments exist, we extend our results to the case of heavy-tailed distri… ▽ More We consider (robust) inference in the context of a factor model for tensor-valued sequences. We study the consistency of the estimated common factors and loadings space when using estimators based on minimising quadratic loss functions. Building on the observation that such loss functions are adequate only if sufficiently many moments exist, we extend our results to the case of heavy-tailed distributions by considering estimators based on minimising the Huber loss function, which uses an $L_{1}$ -norm weight on outliers. We show that such class of estimators is robust to the presence of heavy tails, even when only the second moment of the data exists. We also propose a modified version of the eigenvalue-ratio principle to estimate the dimensions of the core tensor and show the consistency of the resultant estimators without any condition on the relative rates of divergence of the sample size and dimensions. Extensive numerical studies are conducted to show the advantages of the proposed methods over the state-of-the-art ones especially under the heavy-tailed cases. An import/export dataset of a variety of commodities across multiple countries is analyzed to show the practical usefulness of the proposed robust estimation procedure. An R package ``RTFA" implementing the proposed methods is available on R CRAN. △ Less

Submitted 27 August, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.09616 [pdf, other]

Cross-validatory Z-Residual for Diagnosing Shared Frailty Models

Authors: Tingxuan Wu, Cindy Feng, Longhai Li

Abstract: Residual diagnostic methods play a critical role in assessing model assumptions and detecting outliers in statistical modelling. In the context of survival models with censored observations, Li et al. (2021) introduced the Z-residual, which follows an approximately normal distribution under the true model. This property makes it possible to use Z-residuals for diagnosing survival models in a way s… ▽ More Residual diagnostic methods play a critical role in assessing model assumptions and detecting outliers in statistical modelling. In the context of survival models with censored observations, Li et al. (2021) introduced the Z-residual, which follows an approximately normal distribution under the true model. This property makes it possible to use Z-residuals for diagnosing survival models in a way similar to how Pearson residuals are used in normal regression. However, computing residuals based on the full dataset can result in a conservative bias that reduces the power of detecting model mis-specification, as the same dataset is used for both model fitting and validation. Although cross-validation is a potential solution to this problem, it has not been commonly used in residual diagnostics due to computational challenges. In this paper, we propose a cross-validation approach for computing Z-residuals in the context of shared frailty models. Specifically, we develop a general function that calculates cross-validatory Z-residuals using the output from the \texttt{coxph} function in the \texttt{survival} package in R.Our simulation studies demonstrate that, for goodness-of-fit tests and outlier detection, cross-validatory Z-residuals are significantly more powerful and more discriminative than Z-residuals without cross-validation. We also compare the performance of Z-residuals with and without cross-validation in identifying outliers in a real application that models the recurrence time of kidney infection patients. Our findings suggest that cross-validatory Z-residuals can identify outliers that are missed by Z-residuals without cross-validation. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 32 pages, 14 figures

arXiv:2303.02817 [pdf, other]

Huber Principal Component Analysis for Large-dimensional Factor Models

Authors: Yong He, Lingxiao Li, Dong Liu, Wen-Xin Zhou

Abstract: Factor models have been widely used in economics and finance. However, the heavy-tailed nature of macroeconomic and financial data is often neglected in the existing literature. To address this issue and achieve robustness, we propose an approach to estimate factor loadings and scores by minimizing the Huber loss function, which is motivated by the equivalence of conventional Principal Component A… ▽ More Factor models have been widely used in economics and finance. However, the heavy-tailed nature of macroeconomic and financial data is often neglected in the existing literature. To address this issue and achieve robustness, we propose an approach to estimate factor loadings and scores by minimizing the Huber loss function, which is motivated by the equivalence of conventional Principal Component Analysis (PCA) and the constrained least squares method in the factor model. We provide two algorithms that use different penalty forms. The first algorithm, which we refer to as Huber PCA, minimizes the $\ell_2$-norm-type Huber loss and performs PCA on the weighted sample covariance matrix. The second algorithm involves an element-wise type Huber loss minimization, which can be solved by an iterative Huber regression algorithm. Our study examines the theoretical minimizer of the element-wise Huber loss function and demonstrates that it has the same convergence rate as conventional PCA when the idiosyncratic errors have bounded second moments. We also derive their asymptotic distributions under mild conditions. Moreover, we suggest a consistent model selection criterion that relies on rank minimization to estimate the number of factors robustly. We showcase the benefits of Huber PCA through extensive numerical experiments and a real financial portfolio selection example. An R package named ``HDRFA" has been developed to implement the proposed robust factor analysis. △ Less

Submitted 29 March, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

arXiv:2302.09106 [pdf, other]

Z-residual diagnostics for detecting misspecification of the functional form of covariates for shared frailty models

Authors: Tingxuan Wu, Longhai Li, Cindy Feng

Abstract: In survival analysis, the hazard function often depends on a set of covariates. Martingale and deviance residual are most widely used for examining the validity of the function form of covariates by checking whether there is a discernible trend in their scatterplot against continuous covariates. However, visual inspection of martingale and deviance residuals is often subjective. In addition, these… ▽ More In survival analysis, the hazard function often depends on a set of covariates. Martingale and deviance residual are most widely used for examining the validity of the function form of covariates by checking whether there is a discernible trend in their scatterplot against continuous covariates. However, visual inspection of martingale and deviance residuals is often subjective. In addition, these residuals lack a reference distribution due to censoring. It is therefore challenging to derive numerical statistical tests based on martingale or deviance residuals. In this paper, we extend the idea of randomized survival probability (Li et al. 2021) and develop a residual diagnostic tool that can provide both graphical and numerical tests for checking the covariate functional form in semi-parametric shared frailty models. We develop a general function that calculates Z-residuals for semi-parametric shared frailty models based on the output from the \texttt{coxph} function in the \texttt{survival} package in R. Our extensive simulation studies indicate that the derived numerical test based on Z-residuals has great power for checking the functional form of covariates. In a real data application on modelling the survival time of acute myeloid leukemia patients, the Z-residual diagnosis results show that a model with log-transformation is inappropriate for modelling the survival time, which could not be detected by other diagnostic methods. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: 21 pages, 7 figures

arXiv:2301.09231 [pdf, other]

GP-NAS-ensemble: a model for NAS Performance Prediction

Authors: Kunlong Chen, Liu Yang, Yitian Chen, Kun** Chen, Yidan Xu, Lujun Li

Abstract: It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We… ▽ More It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track. △ Less

Submitted 22 January, 2023; originally announced January 2023.

arXiv:2301.08353 [pdf, other]

AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction

Authors: YaChen Yan, Liubo Li

Abstract: Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that diff… ▽ More Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models. △ Less

Submitted 6 January, 2023; originally announced January 2023.

Comments: arXiv admin note: text overlap with arXiv:2301.01089, arXiv:2301.08139

arXiv:2301.06769 [pdf, ps, other]

Geometric ergodicity of SGLD via reflection coupling

Authors: Lei Li, Jian-Guo Liu, Yuliang Wang

Abstract: We consider the geometric ergodicity of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm under nonconvexity settings. Via the technique of reflection coupling, we prove the Wasserstein contraction of SGLD when the target distribution is log-concave only outside some compact set. The time discretization and the minibatch in SGLD introduce several difficulties when applying the reflection… ▽ More We consider the geometric ergodicity of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm under nonconvexity settings. Via the technique of reflection coupling, we prove the Wasserstein contraction of SGLD when the target distribution is log-concave only outside some compact set. The time discretization and the minibatch in SGLD introduce several difficulties when applying the reflection coupling, which are addressed by a series of careful estimates of conditional expectations. As a direct corollary, the SGLD with constant step size has an invariant distribution and we are able to obtain its geometric ergodicity in terms of $W_1$ distance. The generalization to non-gradient drifts is also included. △ Less

Submitted 17 January, 2023; originally announced January 2023.

arXiv:2301.01089 [pdf, other]

xDeepInt: a hybrid architecture for modeling the vector-wise and bit-wise feature interactions

Authors: YaChen Yan, Liubo Li

Abstract: Learning feature interactions is the key to success for the large-scale CTR prediction and recommendation. In practice, handcrafted feature engineering usually requires exhaustive searching. In order to reduce the high cost of human efforts in feature engineering, researchers propose several deep neural networks (DNN)-based approaches to learn the feature interactions in an end-to-end fashion. How… ▽ More Learning feature interactions is the key to success for the large-scale CTR prediction and recommendation. In practice, handcrafted feature engineering usually requires exhaustive searching. In order to reduce the high cost of human efforts in feature engineering, researchers propose several deep neural networks (DNN)-based approaches to learn the feature interactions in an end-to-end fashion. However, existing methods either do not learn both vector-wise interactions and bit-wise interactions simultaneously, or fail to combine them in a controllable manner. In this paper, we propose a new model, xDeepInt, based on a novel network architecture called polynomial interaction network (PIN) which learns higher-order vector-wise interactions recursively. By integrating subspace-crossing mechanism, we enable xDeepInt to balance the mixture of vector-wise and bit-wise feature interactions at a bounded order. Based on the network architecture, we customize a combined optimization strategy to conduct feature selection and interaction selection. We implement the proposed model and evaluate the model performance on three real-world datasets. Our experiment results demonstrate the efficacy and effectiveness of xDeepInt over state-of-the-art models. We open-source the TensorFlow implementation of xDeepInt: https://github.com/yanyachen/xDeepInt. △ Less

Submitted 3 January, 2023; originally announced January 2023.

arXiv:2211.00249 [pdf, other]

Robust Direct Learning for Causal Data Fusion

Authors: Xinyu Li, Yilin Li, Qing Cui, Longfei Li, Jun Zhou

Abstract: In the era of big data, the explosive growth of multi-source heterogeneous data offers many exciting challenges and opportunities for improving the inference of conditional average treatment effects. In this paper, we investigate homogeneous and heterogeneous causal data fusion problems under a general setting that allows for the presence of source-specific covariates. We provide a direct learning… ▽ More In the era of big data, the explosive growth of multi-source heterogeneous data offers many exciting challenges and opportunities for improving the inference of conditional average treatment effects. In this paper, we investigate homogeneous and heterogeneous causal data fusion problems under a general setting that allows for the presence of source-specific covariates. We provide a direct learning framework for integrating multi-source data that separates the treatment effect from other nuisance functions, and achieves double robustness against certain misspecification. To improve estimation precision and stability, we propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory; it assigns larger weights to samples containing more causal information with high interpretability. We introduce a two-step algorithm, the weighted multi-source direct learner, based on constructing a pseudo-outcome and regressing it on covariates under a weighted least square criterion; it offers us a powerful tool for causal data fusion, enjoying the advantages of easy implementation, double robustness and model flexibility. In simulation studies, we demonstrate the effectiveness of our proposed methods in both homogeneous and heterogeneous causal data fusion scenarios. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: 16 pages, 2 figures. Accepted for presentation at the 14th Asian Conference on Machine Learning (ACML 2022), and for publication in Proceedings of Machine Learning Research, Volume 189

arXiv:2210.14420 [pdf, other]

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

Authors: Yunzhe Zhou, Zhengling Qi, Chengchun Shi, Lexin Li

Abstract: In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on… ▽ More In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example. △ Less

Submitted 21 February, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: 18 pages, 6 figures. Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023

arXiv:2210.13400 [pdf, other]

Sampling with Mollified Interaction Energy Descent

Authors: Lingxiao Li, Qiang Liu, Anna Korba, Mikhail Yurochkin, Justin Solomon

Abstract: Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this paper, we present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). The… ▽ More Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this paper, we present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). These energies rely on mollifier functions -- smooth approximations of the Dirac delta originated from PDE theory. We show that as the mollifier approaches the Dirac delta, the MIE converges to the chi-square divergence with respect to the target measure and the gradient flow of the MIE agrees with that of the chi-square divergence. Optimizing this energy with proper discretization yields a practical first-order particle-based algorithm for sampling in both unconstrained and constrained domains. We show experimentally that for unconstrained sampling problems our algorithm performs on par with existing particle-based algorithms like SVGD, while for constrained sampling problems our method readily incorporates constrained optimization techniques to handle more flexible constraints with strong performance compared to alternatives. △ Less

Submitted 1 March, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

arXiv:2210.11620 [pdf, other]

LOT: Layer-wise Orthogonal Training on Improving $\ell_2$ Certified Robustness

Authors: Xiaojun Xu, Linyi Li, Bo Li

Abstract: Recent studies show that training deep neural networks (DNNs) with Lipschitz constraints are able to enhance adversarial robustness and other model properties such as stability. In this paper, we propose a layer-wise orthogonal training method (LOT) to effectively train 1-Lipschitz convolution layers via parametrizing an orthogonal matrix with an unconstrained matrix. We then efficiently compute t… ▽ More Recent studies show that training deep neural networks (DNNs) with Lipschitz constraints are able to enhance adversarial robustness and other model properties such as stability. In this paper, we propose a layer-wise orthogonal training method (LOT) to effectively train 1-Lipschitz convolution layers via parametrizing an orthogonal matrix with an unconstrained matrix. We then efficiently compute the inverse square root of a convolution kernel by transforming the input domain to the Fourier frequency domain. On the other hand, as existing works show that semi-supervised training helps improve empirical robustness, we aim to bridge the gap and prove that semi-supervised learning also improves the certified robustness of Lipschitz-bounded models. We conduct comprehensive evaluations for LOT under different settings. We show that LOT significantly outperforms baselines regarding deterministic l2 certified robustness, and scales to deeper neural networks. Under the supervised scenario, we improve the state-of-the-art certified robustness for all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to 34.59% on CIFAR-100 at radius rho = 36/255 for 40-layer networks). With semi-supervised learning over unlabelled data, we are able to improve state-of-the-art certified robustness on CIFAR-10 at rho = 108/255 from 36.04% to 42.39%. In addition, LOT consistently outperforms baselines on different model architectures with only 1/3 evaluation time. △ Less

Submitted 26 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: NeurIPS 2022

arXiv:2210.11107 [pdf, other]

Graphical model inference with external network data

Authors: Jack Jewson, Li Li, Laura Battaglia, Stephen Hansen, David Rossell, Piotr Zwiernik

Abstract: We consider two applications where we study how dependence structure between many variables is linked to external network data. We first study the interplay between social media connectedness and the co-evolution of the COVID-19 pandemic across USA counties. We next study study how the dependence between stock market returns across firms relates to similarities in economic and policy indicators fr… ▽ More We consider two applications where we study how dependence structure between many variables is linked to external network data. We first study the interplay between social media connectedness and the co-evolution of the COVID-19 pandemic across USA counties. We next study study how the dependence between stock market returns across firms relates to similarities in economic and policy indicators from text regulatory filings. Both applications are modelled via Gaussian graphical models where one has external network data. We develop spike-and-slab and graphical LASSO frameworks to integrate the network data, both facilitating the interpretation of the graphical model and improving inference. The goal is to detect when the network data relates to the graphical model and, if so, explain how. We found that counties strongly connected on Facebook are more likely to have similar COVID-19 evolution (positive partial correlations), accounting for various factors driving the mean. We also found that the association in stock market returns depends in a stronger fashion on economic than on policy indicators. The examples show that data integration can improve interpretation, statistical accuracy, and out-of-sample prediction, in some instances using significantly sparser graphical models. △ Less

Submitted 13 November, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

arXiv:2210.08326 [pdf, ps, other]

Distributionally Robust Causal Inference with Observational Data

Authors: Dimitris Bertsimas, Kosuke Imai, Michael Lingzhi Li

Abstract: We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then… ▽ More We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology. △ Less

Submitted 2 February, 2023; v1 submitted 15 October, 2022; originally announced October 2022.

arXiv:2210.08031 [pdf, other]

Neural Attentive Circuits

Authors: Nasim Rahaman, Martin Weiss, Francesco Locatello, Chris Pal, Yoshua Bengio, Bernhard Schölkopf, Li Erran Li, Nicolas Ballas

Abstract: Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modalities. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data us… ▽ More Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modalities. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature. △ Less

Submitted 19 October, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: To appear at NeurIPS 2022

arXiv:2210.07536 [pdf, other]

A Reinforcement Learning Approach to Estimating Long-term Treatment Effects

Authors: Ziyang Tang, Yiheng Duan, Stephanie Zhang, Lihong Li

Abstract: Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consumin… ▽ More Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset. △ Less

Submitted 14 October, 2022; originally announced October 2022.

arXiv:2208.12483 [pdf, other]

Dynamic Regret of Online Markov Decision Processes

Authors: Peng Zhao, Long-Fei Li, Zhi-Hua Zhou

Abstract: We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed… ▽ More We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds for the episodic (loop-free) SSP; and moreover, we demonstrate impossibility results for the infinite-horizon MDPs. △ Less

Submitted 26 August, 2022; originally announced August 2022.

arXiv:2208.05740 [pdf, other]

General Cutting Planes for Bound-Propagation-Based Neural Network Verification

Authors: Huan Zhang, Shiqi Wang, Kaidi Xu, Linyi Li, Bo Li, Suman Jana, Cho-Jui Hsieh, J. Zico Kolter

Abstract: Bound propagation methods, when combined with branch and bound, are among the most effective methods to formally verify properties of deep neural networks such as correctness, robustness, and safety. However, existing works cannot handle the general form of cutting plane constraints widely accepted in traditional solvers, which are crucial for strengthening verifiers with tightened convex relaxati… ▽ More Bound propagation methods, when combined with branch and bound, are among the most effective methods to formally verify properties of deep neural networks such as correctness, robustness, and safety. However, existing works cannot handle the general form of cutting plane constraints widely accepted in traditional solvers, which are crucial for strengthening verifiers with tightened convex relaxations. In this paper, we generalize the bound propagation procedure to allow the addition of arbitrary cutting plane constraints, including those involving relaxed integer variables that do not appear in existing bound propagation formulations. Our generalized bound propagation method, GCP-CROWN, opens up the opportunity to apply general cutting plane methods for neural network verification while benefiting from the efficiency and GPU acceleration of bound propagation methods. As a case study, we investigate the use of cutting planes generated by off-the-shelf mixed integer programming (MIP) solver. We find that MIP solvers can generate high-quality cutting planes for strengthening bound-propagation-based verifiers using our new formulation. Since the branching-focused bound propagation procedure and the cutting-plane-focused MIP solver can run in parallel utilizing different types of hardware (GPUs and CPUs), their combination can quickly explore a large number of branches with strong cutting planes, leading to strong verification performance. Experiments demonstrate that our method is the first verifier that can completely solve the oval20 benchmark and verify twice as many instances on the oval21 benchmark compared to the best tool in VNN-COMP 2021, and also noticeably outperforms state-of-the-art verifiers on a wide range of benchmarks. GCP-CROWN is part of the $α,\!β$-CROWN verifier, the VNN-COMP 2022 winner. Code is available at http://PaperCode.cc/GCP-CROWN △ Less

Submitted 4 December, 2022; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: Accepted by NeurIPS 2022. GCP-CROWN is part of the alpha-beta-CROWN verifier, the VNN-COMP 2022 winner

arXiv:2207.09304 [pdf, ps, other]

A sharp uniform-in-time error estimate for Stochastic Gradient Langevin Dynamics

Authors: Lei Li, Yuliang Wang

Abstract: We establish a sharp uniform-in-time error estimate for the Stochastic Gradient Langevin Dynamics (SGLD), which is a popular sampling algorithm. Under mild assumptions, we obtain a uniform-in-time $O(η^2)$ bound for the KL-divergence between the SGLD iteration and the Langevin diffusion, where $η$ is the step size (or learning rate). Our analysis is also valid for varying step sizes. Based on this… ▽ More We establish a sharp uniform-in-time error estimate for the Stochastic Gradient Langevin Dynamics (SGLD), which is a popular sampling algorithm. Under mild assumptions, we obtain a uniform-in-time $O(η^2)$ bound for the KL-divergence between the SGLD iteration and the Langevin diffusion, where $η$ is the step size (or learning rate). Our analysis is also valid for varying step sizes. Based on this, we are able to obtain an $O(η)$ bound for the distance between the SGLD iteration and the invariant distribution of the Langevin diffusion, in terms of Wasserstein or total variation distances. △ Less

Submitted 21 October, 2022; v1 submitted 19 July, 2022; originally announced July 2022.

arXiv:2207.04922 [pdf, ps, other]

On uniform-in-time diffusion approximation for stochastic gradient descent

Authors: Lei Li, Yuliang Wang

Abstract: The diffusion approximation of stochastic gradient descent (SGD) in current literature is only valid on a finite time interval. In this paper, we establish the uniform-in-time diffusion approximation of SGD, by only assuming that the expected loss is strongly convex and some other mild conditions, without assuming the convexity of each random loss function. The main technique is to establish the e… ▽ More The diffusion approximation of stochastic gradient descent (SGD) in current literature is only valid on a finite time interval. In this paper, we establish the uniform-in-time diffusion approximation of SGD, by only assuming that the expected loss is strongly convex and some other mild conditions, without assuming the convexity of each random loss function. The main technique is to establish the exponential decay rates of the derivatives of the solution to the backward Kolmogorov equation. The uniform-in-time approximation allows us to study asymptotic behaviors of SGD via the continuous stochastic differential equation (SDE) even when the random objective function $f(\cdot;ξ)$ is not strongly convex. △ Less

Submitted 11 July, 2022; originally announced July 2022.

arXiv:2207.03522 [pdf, other]

TF-GNN: Graph Neural Networks in TensorFlow

Authors: Oleksandr Ferludin, Arno Eigenwillig, Martin Blais, Dustin Zelle, Jan Pfeifer, Alvaro Sanchez-Gonzalez, Wai Lok Sibon Li, Sami Abu-El-Haija, Peter Battaglia, Neslihan Bulut, Jonathan Halcrow, Filipe Miguel Gonçalves de Almeida, Pedro Gonnet, Liangze Jiang, Parth Kothari, Silvio Lattanzi, André Linhares, Brandon Mayer, Vahab Mirrokni, John Palowitch, Mihir Paradkar, Jennifer She, Anton Tsitsulin, Kevin Villela, Lisa Wang , et al. (2 additional authors not shown)

Abstract: TensorFlow-GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. In addition to enabling machine learning researchers and advanced developers, TF-GNN offers low-code solutions to empower the broader developer community in graph learning. Many… ▽ More TensorFlow-GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. In addition to enabling machine learning researchers and advanced developers, TF-GNN offers low-code solutions to empower the broader developer community in graph learning. Many production models at Google use TF-GNN, and it has been recently released as an open source project. In this paper we describe the TF-GNN data model, its Keras message passing API, and relevant capabilities such as graph sampling and distributed training. △ Less

Submitted 23 July, 2023; v1 submitted 7 July, 2022; originally announced July 2022.

arXiv:2206.09800 [pdf, other]

Statistical Inference for Large-dimensional Tensor Factor Model by Iterative Projections

Authors: Matteo Barigozzi, Yong He, Lingxiao Li, Lorenzo Trapani

Abstract: Tensor Factor Models (TFM) are appealing dimension reduction tools for high-order large-dimensional tensor time series, and have wide applications in economics, finance and medical imaging. In this paper, we propose a projection estimator for the Tucker-decomposition based TFM, and provide its least-square interpretation which parallels to the least-square interpretation of the Principal Component… ▽ More Tensor Factor Models (TFM) are appealing dimension reduction tools for high-order large-dimensional tensor time series, and have wide applications in economics, finance and medical imaging. In this paper, we propose a projection estimator for the Tucker-decomposition based TFM, and provide its least-square interpretation which parallels to the least-square interpretation of the Principal Component Analysis (PCA) for the vector factor model. The projection technique simultaneously reduces the dimensionality of the signal component and the magnitudes of the idiosyncratic component tensor, thus leading to an increase of the signal-to-noise ratio. We derive a convergence rate of the projection estimator of the loadings and the common factor tensor which are faster than that of the naive PCA-based estimator. Our results are obtained under mild conditions which allow the idiosyncratic components to be weakly cross- and auto- correlated. We also provide a novel iterative procedure based on the eigenvalue-ratio principle to determine the factor numbers. Extensive numerical studies are conducted to investigate the empirical performance of the proposed projection estimators relative to the state-of-the-art ones. △ Less

Submitted 23 April, 2023; v1 submitted 20 June, 2022; originally announced June 2022.

Showing 1–50 of 303 results for author: Li, L