Search | arXiv e-print repository

Operator-informed score matching for Markov diffusion models

Abstract: Diffusion models are typically trained using score matching, yet score matching is agnostic to the particular forward process that defines the model. This paper argues that Markov diffusion models enjoy an advantage over other types of diffusion model, as their associated operators can be exploited to improve the training process. In particular, (i) there exists an explicit formal solution to the… ▽ More Diffusion models are typically trained using score matching, yet score matching is agnostic to the particular forward process that defines the model. This paper argues that Markov diffusion models enjoy an advantage over other types of diffusion model, as their associated operators can be exploited to improve the training process. In particular, (i) there exists an explicit formal solution to the forward process as a sequence of time-dependent kernel mean embeddings; and (ii) the derivation of score-matching and related estimators can be streamlined. Building upon (i), we propose Riemannian diffusion kernel smoothing, which ameliorates the need for neural score approximation, at least in the low-dimensional context; Building upon (ii), we propose operator-informed score matching, a variance reduction technique that is straightforward to implement in both low- and high-dimensional diffusion modeling and is demonstrated to improve score matching in an empirical proof-of-concept. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Preprint; 19 pages, 5 figures

arXiv:2405.18373 [pdf, other]

A Hessian-Aware Stochastic Differential Equation for Modelling SGD

Authors: Xiang Li, Zebang Shen, Liang Zhang, Niao He

Abstract: Continuous-time approximation of Stochastic Gradient Descent (SGD) is a crucial tool to study its esca** behaviors from stationary points. However, existing stochastic differential equation (SDE) models fail to fully capture these behaviors, even for simple quadratic objectives. Built on a novel stochastic backward error analysis framework, we derive the Hessian-Aware Stochastic Modified Equatio… ▽ More Continuous-time approximation of Stochastic Gradient Descent (SGD) is a crucial tool to study its esca** behaviors from stationary points. However, existing stochastic differential equation (SDE) models fail to fully capture these behaviors, even for simple quadratic objectives. Built on a novel stochastic backward error analysis framework, we derive the Hessian-Aware Stochastic Modified Equation (HA-SME), an SDE that incorporates Hessian information of the objective function into both its drift and diffusion terms. Our analysis shows that HA-SME matches the order-best approximation error guarantee among existing SDE models in the literature, while achieving a significantly reduced dependence on the smoothness parameter of the objective. Further, for quadratic objectives, under mild conditions, HA-SME is proved to be the first SDE model that recovers exactly the SGD dynamics in the distributional sense. Consequently, when the local landscape near a stationary point can be approximated by quadratics, HA-SME is expected to accurately predict the local esca** behaviors of SGD. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.14778 [pdf, ps, other]

Optimal Rates for Vector-Valued Spectral Regularization Learning Algorithms

Authors: Dimitri Meunier, Zikai Shen, Mattes Mollenhauer, Arthur Gretton, Zhu Li

Abstract: We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by… ▽ More We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present the upper bound for the finite sample risk general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space) which is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.07552 [pdf, other]

Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support Recovery

Authors: Caixing Wang, Ziliang Shen

Abstract: In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tack… ▽ More In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method. △ Less

Submitted 1 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: Forty-first International Conference on Machine Learning (ICML 2024), 27 pages, 4 figures, 14 tables

arXiv:2405.03329 [pdf, other]

Policy Learning for Balancing Short-Term and Long-Term Rewards

Authors: Peng Wu, Ziyu Shen, Feng Xie, Zhongyao Wang, Chunchen Liu, Yan Zeng

Abstract: Empirical researchers and decision-makers spanning various domains frequently seek profound insights into the long-term impacts of interventions. While the significance of long-term outcomes is undeniable, an overemphasis on them may inadvertently overshadow short-term gains. Motivated by this, this paper formalizes a new framework for learning the optimal policy that effectively balances both lon… ▽ More Empirical researchers and decision-makers spanning various domains frequently seek profound insights into the long-term impacts of interventions. While the significance of long-term outcomes is undeniable, an overemphasis on them may inadvertently overshadow short-term gains. Motivated by this, this paper formalizes a new framework for learning the optimal policy that effectively balances both long-term and short-term rewards, where some long-term outcomes are allowed to be missing. In particular, we first present the identifiability of both rewards under mild assumptions. Next, we deduce the semiparametric efficiency bounds, along with the consistency and asymptotic normality of their estimators. We also reveal that short-term outcomes, if associated, contribute to improving the estimator of the long-term reward. Based on the proposed estimators, we develop a principled policy learning approach and further derive the convergence rates of regret and estimation errors associated with the learned policy. Extensive experiments are conducted to validate the effectiveness of the proposed method, demonstrating its practical applicability. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2309.12600 [pdf, other]

Multiply Robust Federated Estimation of Targeted Average Treatment Effects

Authors: Larry Han, Zhu Shen, Jose Zubizarreta

Abstract: Federated or multi-site studies have distinct advantages over single-site studies, including increased generalizability, the ability to study underrepresented populations, and the opportunity to study rare exposures and outcomes. However, these studies are challenging due to the need to preserve the privacy of each individual's data and the heterogeneity in their covariate distributions. We propos… ▽ More Federated or multi-site studies have distinct advantages over single-site studies, including increased generalizability, the ability to study underrepresented populations, and the opportunity to study rare exposures and outcomes. However, these studies are challenging due to the need to preserve the privacy of each individual's data and the heterogeneity in their covariate distributions. We propose a novel federated approach to derive valid causal inferences for a target population using multi-site data. We adjust for covariate shift and covariate mismatch between sites by develo** multiply-robust and privacy-preserving nuisance function estimation. Our methodology incorporates transfer learning to estimate ensemble weights to combine information from source sites. We show that these learned weights are efficient and optimal under different scenarios. We showcase the finite sample advantages of our approach in terms of efficiency and robustness compared to existing approaches. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2309.05505 [pdf, other]

Share Your Representation Only: Guaranteed Improvement of the Privacy-Utility Tradeoff in Federated Learning

Authors: Zebang Shen, Jiayuan Ye, Anmin Kang, Hamed Hassani, Reza Shokri

Abstract: Repeated parameter sharing in federated learning causes significant information leakage about private data, thus defeating its main purpose: data privacy. Mitigating the risk of this information leakage, using state of the art differentially private algorithms, also does not come for free. Randomized mechanisms can prevent convergence of models on learning even the useful representation functions,… ▽ More Repeated parameter sharing in federated learning causes significant information leakage about private data, thus defeating its main purpose: data privacy. Mitigating the risk of this information leakage, using state of the art differentially private algorithms, also does not come for free. Randomized mechanisms can prevent convergence of models on learning even the useful representation functions, especially if there is more disagreement between local models on the classification functions (due to data heterogeneity). In this paper, we consider a representation federated learning objective that encourages various parties to collaboratively refine the consensus part of the model, with differential privacy guarantees, while separately allowing sufficient freedom for local personalization (without releasing it). We prove that in the linear representation setting, while the objective is non-convex, our proposed new algorithm \DPFEDREP\ converges to a ball centered around the \emph{global optimal} solution at a linear rate, and the radius of the ball is proportional to the reciprocal of the privacy budget. With this novel utility analysis, we improve the SOTA utility-privacy trade-off for this problem by a factor of $\sqrt{d}$, where $d$ is the input dimension. We empirically evaluate our method with the image classification task on CIFAR10, CIFAR100, and EMNIST, and observe a significant performance improvement over the prior work under the same small privacy budget. The code can be found in this link: https://github.com/shenzebang/CENTAUR-Privacy-Federated-Representation-Learning. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: ICLR 2023 revised

arXiv:2308.08025 [pdf, other]

Potential Energy Advantage of Quantum Economy

Authors: Junyu Liu, Hansheng Jiang, Zuo-Jun Max Shen

Abstract: Energy cost is increasingly crucial in the modern computing industry with the wide deployment of large-scale machine learning models and language models. For the firms that provide computing services, low energy consumption is important both from the perspective of their own market growth and the government's regulations. In this paper, we study the energy benefits of quantum computing vis-a-vis c… ▽ More Energy cost is increasingly crucial in the modern computing industry with the wide deployment of large-scale machine learning models and language models. For the firms that provide computing services, low energy consumption is important both from the perspective of their own market growth and the government's regulations. In this paper, we study the energy benefits of quantum computing vis-a-vis classical computing. Deviating from the conventional notion of quantum advantage based solely on computational complexity, we redefine advantage in an energy efficiency context. Through a Cournot competition model constrained by energy usage, we demonstrate quantum computing firms can outperform classical counterparts in both profitability and energy efficiency at Nash equilibrium. Therefore quantum computing may represent a more sustainable pathway for the computing industry. Moreover, we discover that the energy benefits of quantum computing economies are contingent on large-scale computation. Based on real physical parameters, we further illustrate the scale of operation necessary for realizing this energy efficiency advantage. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: 23 pages, many figures

arXiv:2308.06717 [pdf, other]

Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden Rewards

Authors: Ilgin Dogan, Zuo-Jun Max Shen, Anil Aswani

Abstract: In practice, incentive providers (i.e., principals) often cannot observe the reward realizations of incentivized agents, which is in contrast to many principal-agent models that have been previously studied. This information asymmetry challenges the principal to consistently estimate the agent's unknown rewards by solely watching the agent's decisions, which becomes even more challenging when the… ▽ More In practice, incentive providers (i.e., principals) often cannot observe the reward realizations of incentivized agents, which is in contrast to many principal-agent models that have been previously studied. This information asymmetry challenges the principal to consistently estimate the agent's unknown rewards by solely watching the agent's decisions, which becomes even more challenging when the agent has to learn its own rewards. This complex setting is observed in various real-life scenarios ranging from renewable energy storage contracts to personalized healthcare incentives. Hence, it offers not only interesting theoretical questions but also wide practical relevance. This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal. The agent tackles a multi-armed bandit (MAB) problem to maximize their expected reward plus incentive. On top of the agent's learning, the principal trains a parallel algorithm and faces a trade-off between consistently estimating the agent's unknown rewards and maximizing their own utility by offering adaptive incentives to lead the agent. For a non-parametric model, we introduce an estimator whose only input is the history of principal's incentives and agent's choices. We unite this estimator with a proposed data-driven incentive policy within a MAB framework. Without restricting the type of the agent's algorithm, we prove finite-sample consistency of the estimator and a rigorous regret bound for the principal by considering the sequential externality imposed by the agent. Lastly, our theoretical results are reinforced by simulations justifying applicability of our framework to green energy aggregator contracts. △ Less

Submitted 13 August, 2023; originally announced August 2023.

Comments: 72 pages, 6 figures. arXiv admin note: text overlap with arXiv:2304.07407

arXiv:2305.06584 [pdf, other]

Active Learning in the Predict-then-Optimize Framework: A Margin-Based Approach

Authors: Mo Liu, Paul Grigas, Heyuan Liu, Zuo-Jun Max Shen

Abstract: We develop the first active learning method in the predict-then-optimize framework. Specifically, we develop a learning method that sequentially decides whether to request the "labels" of feature samples from an unlabeled data stream, where the labels correspond to the parameters of an optimization model for decision-making. Our active learning method is the first to be directly informed by the de… ▽ More We develop the first active learning method in the predict-then-optimize framework. Specifically, we develop a learning method that sequentially decides whether to request the "labels" of feature samples from an unlabeled data stream, where the labels correspond to the parameters of an optimization model for decision-making. Our active learning method is the first to be directly informed by the decision error induced by the predicted parameters, which is referred to as the Smart Predict-then-Optimize (SPO) loss. Motivated by the structure of the SPO loss, our algorithm adopts a margin-based criterion utilizing the concept of distance to degeneracy and minimizes a tractable surrogate of the SPO loss on the collected data. In particular, we develop an efficient active learning algorithm with both hard and soft rejection variants, each with theoretical excess risk (i.e., generalization) guarantees. We further derive bounds on the label complexity, which refers to the number of samples whose labels are acquired to achieve a desired small level of SPO risk. Under some natural low-noise conditions, we show that these bounds can be better than the naive supervised learning approach that labels all samples. Furthermore, when using the SPO+ loss function, a specialized surrogate of the SPO loss, we derive a significantly smaller label complexity under separability conditions. We also present numerical evidence showing the practical value of our proposed algorithms in the settings of personalized pricing and the shortest path problem. △ Less

Submitted 11 May, 2023; originally announced May 2023.

arXiv:2304.07407 [pdf, other]

Repeated Principal-Agent Games with Unobserved Agent Rewards and Perfect-Knowledge Agents

Authors: Ilgin Dogan, Zuo-Jun Max Shen, Anil Aswani

Abstract: Motivated by a number of real-world applications from domains like healthcare and sustainable transportation, in this paper we study a scenario of repeated principal-agent games within a multi-armed bandit (MAB) framework, where: the principal gives a different incentive for each bandit arm, the agent picks a bandit arm to maximize its own expected reward plus incentive, and the principal observes… ▽ More Motivated by a number of real-world applications from domains like healthcare and sustainable transportation, in this paper we study a scenario of repeated principal-agent games within a multi-armed bandit (MAB) framework, where: the principal gives a different incentive for each bandit arm, the agent picks a bandit arm to maximize its own expected reward plus incentive, and the principal observes which arm is chosen and receives a reward (different than that of the agent) for the chosen arm. Designing policies for the principal is challenging because the principal cannot directly observe the reward that the agent receives for their chosen actions, and so the principal cannot directly learn the expected reward using existing estimation techniques. As a result, the problem of designing policies for this scenario, as well as similar ones, remains mostly unexplored. In this paper, we construct a policy that achieves a low regret (i.e., square-root regret up to a log factor) in this scenario for the case where the agent has perfect-knowledge about its own expected rewards for each bandit arm. We design our policy by first constructing an estimator for the agent's expected reward for each bandit arm. Since our estimator uses as data the sequence of incentives offered and subsequently chosen arms, the principal's estimation can be regarded as an analogy of online inverse optimization in MAB's. Next we construct a policy that we prove achieves a low regret by deriving finite-sample concentration bounds for our estimator. We conclude with numerical simulations demonstrating the applicability of our policy to real-life setting from collaborative transportation planning. △ Less

Submitted 7 May, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: 50 pages, 4 figures

arXiv:2304.01303 [pdf, ps, other]

Improved Bound for Mixing Time of Parallel Tempering

Authors: Holden Lee, Zeyu Shen

Abstract: In the field of sampling algorithms, MCMC (Markov Chain Monte Carlo) methods are widely used when direct sampling is not possible. However, multimodality of target distributions often leads to slow convergence and mixing. One common solution is parallel tempering. Though highly effective in practice, theoretical guarantees on its performance are limited. In this paper, we present a new lower bound… ▽ More In the field of sampling algorithms, MCMC (Markov Chain Monte Carlo) methods are widely used when direct sampling is not possible. However, multimodality of target distributions often leads to slow convergence and mixing. One common solution is parallel tempering. Though highly effective in practice, theoretical guarantees on its performance are limited. In this paper, we present a new lower bound for parallel tempering on the spectral gap that has a polynomial dependence on all parameters except $\log L$, where $(L + 1)$ is the number of levels. This improves the best existing bound which depends exponentially on the number of modes. Moreover, we complement our result with a hypothetical upper bound on spectral gap that has an exponential dependence on $\log L$, which shows that, in some sense, our bound is tight. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2212.00992 [pdf, other]

Stable Learning via Sparse Variable Independence

Authors: Han Yu, Peng Cui, Yue He, Zheyan Shen, Yong Lin, Renzhe Xu, Xingxuan Zhang

Abstract: The problem of covariate-shift generalization has attracted intensive research attention. Previous stable learning algorithms employ sample reweighting schemes to decorrelate the covariates when there is no explicit domain information about training data. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable varia… ▽ More The problem of covariate-shift generalization has attracted intensive research attention. Previous stable learning algorithms employ sample reweighting schemes to decorrelate the covariates when there is no explicit domain information about training data. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Besides, decorrelating within stable variables may bring about high variance of learned models because of the over-reduced effective sample size. A tremendous sample size is required for these algorithms to work. In this paper, with theoretical justification, we propose SVI (Sparse Variable Independence) for the covariate-shift generalization problem. We introduce sparsity constraint to compensate for the imperfectness of sample reweighting under the finite-sample setting in previous methods. Furthermore, we organically combine independence-based sample reweighting and sparsity-based variable selection in an iterative way to avoid decorrelating within stable variables, increasing the effective sample size to alleviate variance inflation. Experiments on both synthetic and real-world datasets demonstrate the improvement of covariate-shift generalization performance brought by SVI. △ Less

Submitted 2 December, 2022; originally announced December 2022.

Comments: Accepted by AAAI 2023

arXiv:2205.09459 [pdf, other]

Neural Network Architecture Beyond Width and Depth

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional archite… ▽ More This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets. △ Less

Submitted 14 January, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

Journal ref: Advances in Neural Information Processing Systems, 35:5669--5681, 2022

arXiv:2111.07964 [pdf, other]

Deep Network Approximation in Terms of Intrinsic Parameters

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: One of the arguments to explain the success of deep learning is the powerful approximation capacity of deep neural networks. Such capacity is generally accompanied by the explosive growth of the number of parameters, which, in turn, leads to high computational costs. It is of great interest to ask whether we can achieve successful deep learning with a small number of learnable parameters adapting… ▽ More One of the arguments to explain the success of deep learning is the powerful approximation capacity of deep neural networks. Such capacity is generally accompanied by the explosive growth of the number of parameters, which, in turn, leads to high computational costs. It is of great interest to ask whether we can achieve successful deep learning with a small number of learnable parameters adapting to the target function. From an approximation perspective, this paper shows that the number of parameters that need to be learned can be significantly smaller than people typically expect. First, we theoretically design ReLU networks with a few learnable parameters to achieve an attractive approximation. We prove by construction that, for any Lipschitz continuous function $f$ on $[0,1]^d$ with a Lipschitz constant $λ>0$, a ReLU network with $n+2$ intrinsic parameters (those depending on $f$) can approximate $f$ with an exponentially small error $5λ\sqrt{d}\,2^{-n}$. Such a result is generalized to generic continuous functions. Furthermore, we show that the idea of learning a small number of parameters to achieve a good approximation can be numerically observed. We conduct several experiments to verify that training a small part of parameters can also achieve good results for classification problems if other parameters are pre-specified or pre-trained from a related problem. △ Less

Submitted 14 June, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:19909-19934, 2022

arXiv:2111.02355 [pdf, other]

A Theoretical Analysis on Independence-driven Importance Weighting for Covariate-shift Generalization

Authors: Renzhe Xu, Xingxuan Zhang, Zheyan Shen, Tong Zhang, Peng Cui

Abstract: Covariate-shift generalization, a typical case in out-of-distribution (OOD) generalization, requires a good performance on the unknown test distribution, which varies from the accessible training distribution in the form of covariate shift. Recently, independence-driven importance weighting algorithms in stable learning literature have shown empirical effectiveness to deal with covariate-shift gen… ▽ More Covariate-shift generalization, a typical case in out-of-distribution (OOD) generalization, requires a good performance on the unknown test distribution, which varies from the accessible training distribution in the form of covariate shift. Recently, independence-driven importance weighting algorithms in stable learning literature have shown empirical effectiveness to deal with covariate-shift generalization on several learning models, including regression algorithms and deep neural networks, while their theoretical analyses are missing. In this paper, we theoretically prove the effectiveness of such algorithms by explaining them as feature selection processes. We first specify a set of variables, named minimal stable variable set, that is the minimal and optimal set of variables to deal with covariate-shift generalization for common loss functions, such as the mean squared loss and binary cross-entropy loss. Afterward, we prove that under ideal conditions, independence-driven importance weighting algorithms could identify the variables in this set. Analysis of asymptotic properties is also provided. These theories are further validated in several synthetic experiments. △ Less

Submitted 17 October, 2023; v1 submitted 3 November, 2021; originally announced November 2021.

Comments: ICML 2022

arXiv:2110.12351 [pdf, other]

Integrated Conditional Estimation-Optimization

Authors: Meng Qi, Paul Grigas, Zuo-Jun Max Shen

Abstract: Many real-world optimization problems involve uncertain parameters with probability distributions that can be estimated using contextual feature information. In contrast to the standard approach of first estimating the distribution of uncertain parameters and then optimizing the objective based on the estimation, we propose an integrated conditional estimation-optimization (ICEO) framework that es… ▽ More Many real-world optimization problems involve uncertain parameters with probability distributions that can be estimated using contextual feature information. In contrast to the standard approach of first estimating the distribution of uncertain parameters and then optimizing the objective based on the estimation, we propose an integrated conditional estimation-optimization (ICEO) framework that estimates the underlying conditional distribution of the random parameter while considering the structure of the optimization problem. We directly model the relationship between the conditional distribution of the random parameter and the contextual features, and then estimate the probabilistic model with an objective that aligns with the downstream optimization problem. We show that our ICEO approach is asymptotically consistent under moderate regularity conditions and further provide finite performance guarantees in the form of generalization bounds. Computationally, performing estimation with the ICEO approach is a non-convex and often non-differentiable optimization problem. We propose a general methodology for approximating the potentially non-differentiable map** from estimated conditional distribution to the optimal decision by a differentiable function, which greatly improves the performance of gradient-based algorithms applied to the non-convex problem. We also provide a polynomial optimization solution approach in the semi-algebraic case. Numerical experiments are also conducted to show the empirical success of our approach in different situations including with limited data samples and model mismatches. △ Less

Submitted 1 August, 2023; v1 submitted 24 October, 2021; originally announced October 2021.

arXiv:2110.03768 [pdf, other]

De-randomizing MCMC dynamics with the diffusion Stein operator

Authors: Zheyang Shen, Markus Heinonen, Samuel Kaski

Abstract: Approximate Bayesian inference estimates descriptors of an intractable target distribution - in essence, an optimization problem within a family of distributions. For example, Langevin dynamics (LD) extracts asymptotically exact samples from a diffusion process because the time evolution of its marginal distributions constitutes a curve that minimizes the KL-divergence via steepest descent in the… ▽ More Approximate Bayesian inference estimates descriptors of an intractable target distribution - in essence, an optimization problem within a family of distributions. For example, Langevin dynamics (LD) extracts asymptotically exact samples from a diffusion process because the time evolution of its marginal distributions constitutes a curve that minimizes the KL-divergence via steepest descent in the Wasserstein space. Parallel to LD, Stein variational gradient descent (SVGD) similarly minimizes the KL, albeit endowed with a novel Stein-Wasserstein distance, by deterministically transporting a set of particle samples, thus de-randomizes the stochastic diffusion process. We propose de-randomized kernel-based particle samplers to all diffusion-based samplers known as MCMC dynamics. Following previous work in interpreting MCMC dynamics, we equip the Stein-Wasserstein space with a fiber-Riemannian Poisson structure, with the capacity of characterizing a fiber-gradient Hamiltonian flow that simulates MCMC dynamics. Such dynamics discretizes into generalized SVGD (GSVGD), a Stein-type deterministic particle sampler, with particle updates coinciding with applying the diffusion Stein operator to a kernel function. We demonstrate empirically that GSVGD can de-randomize complex MCMC dynamics, which combine the advantages of auxiliary momentum variables and Riemannian structure, while maintaining the high sample quality from an interacting particle system. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 22 pages, 6 figures. NeurIPS 2021

arXiv:2107.11732 [pdf, other]

Federated Causal Inference in Heterogeneous Observational Data

Authors: Ruoxuan Xiong, Allison Koenecke, Michael Powell, Zhu Shen, Joshua T. Vogelstein, Susan Athey

Abstract: We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the… ▽ More We are interested in estimating the effect of a treatment applied to individuals at multiple sites, where data is stored locally for each site. Due to privacy constraints, individual-level data cannot be shared across sites; the sites may also have heterogeneous populations and treatment assignment mechanisms. Motivated by these considerations, we develop federated methods to draw inference on the average treatment effects of combined data across sites. Our methods first compute summary statistics locally using propensity scores and then aggregate these statistics across sites to obtain point and variance estimators of average treatment effects. We show that these estimators are consistent and asymptotically normal. To achieve these asymptotic properties, we find that the aggregation schemes need to account for the heterogeneity in treatment assignments and in outcomes across sites. We demonstrate the validity of our federated methods through a comparative study of two large medical claims databases. △ Less

Submitted 2 April, 2023; v1 submitted 25 July, 2021; originally announced July 2021.

arXiv:2107.02397 [pdf, other]

Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple, computable, and continuous activation function $σ$ leveraging a triangular-wave function and the softsign function. We first prove that $σ$-activated ne… ▽ More This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple, computable, and continuous activation function $σ$ leveraging a triangular-wave function and the softsign function. We first prove that $σ$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, we show that classification functions arising from image and signal classification are in the hypothesis space generated by $σ$-activated networks with width $36d(2d+1)$ and depth $12$ when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the rectified linear unit (ReLU) activation function by ours would improve the experiment results. △ Less

Submitted 26 September, 2022; v1 submitted 6 July, 2021; originally announced July 2021.

Journal ref: Journal of Machine Learning Research, Volume 23, Issue 276, September 2022, Pages 1--60

arXiv:2103.00502 [pdf, other]

doi 10.1016/j.matpur.2021.07.009

Optimal Approximation Rate of ReLU Networks in terms of Width and Depth

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: This paper concentrates on the approximation power of deep feed-forward neural networks in terms of width and depth. It is proved by construction that ReLU networks with width $\mathcal{O}\big(\max\{d\lfloor N^{1/d}\rfloor,\, N+2\}\big)$ and depth $\mathcal{O}(L)$ can approximate a Hölder continuous function on $[0,1]^d$ with an approximation rate… ▽ More This paper concentrates on the approximation power of deep feed-forward neural networks in terms of width and depth. It is proved by construction that ReLU networks with width $\mathcal{O}\big(\max\{d\lfloor N^{1/d}\rfloor,\, N+2\}\big)$ and depth $\mathcal{O}(L)$ can approximate a Hölder continuous function on $[0,1]^d$ with an approximation rate $\mathcal{O}\big(λ\sqrt{d} (N^2L^2\ln N)^{-α/d}\big)$, where $α\in (0,1]$ and $λ>0$ are Hölder order and constant, respectively. Such a rate is optimal up to a constant in terms of width and depth separately, while existing results are only nearly optimal without the logarithmic factor in the approximation rate. More generally, for an arbitrary continuous function $f$ on $[0,1]^d$, the approximation rate becomes $\mathcal{O}\big(\,\sqrt{d}\,ω_f\big( (N^2L^2\ln N)^{-1/d}\big)\,\big)$, where $ω_f(\cdot)$ is the modulus of continuity. We also extend our analysis to any continuous function $f$ on a bounded set. Particularly, if ReLU networks with depth $31$ and width $\mathcal{O}(N)$ are used to approximate one-dimensional Lipschitz continuous functions on $[0,1]$ with a Lipschitz constant $λ>0$, the approximation rate in terms of the total number of parameters, $W=\mathcal{O}(N^2)$, becomes $\mathcal{O}(\tfracλ{W\ln W})$, which has not been discovered in the literature for fixed-depth ReLU networks. △ Less

Submitted 24 July, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Journal ref: Journal de Mathématiques Pures et Appliquées, Volume 157, January 2022, Pages 101-135

arXiv:2012.13892 [pdf, other]

Adaptive Graph-based Generalized Regression Model for Unsupervised Feature Selection

Authors: Yanyong Huang, Zongxin Shen, Fuxu Cai, Tianrui Li, Fengmao Lv

Abstract: Unsupervised feature selection is an important method to reduce dimensions of high dimensional data without labels, which is benefit to avoid ``curse of dimensionality'' and improve the performance of subsequent machine learning tasks, like clustering and retrieval. How to select the uncorrelated and discriminative features is the key problem of unsupervised feature selection. Many proposed method… ▽ More Unsupervised feature selection is an important method to reduce dimensions of high dimensional data without labels, which is benefit to avoid ``curse of dimensionality'' and improve the performance of subsequent machine learning tasks, like clustering and retrieval. How to select the uncorrelated and discriminative features is the key problem of unsupervised feature selection. Many proposed methods select features with strong discriminant and high redundancy, or vice versa. However, they only satisfy one of these two criteria. Other existing methods choose the discriminative features with low redundancy by constructing the graph matrix on the original feature space. Since the original feature space usually contains redundancy and noise, it will degrade the performance of feature selection. In order to address these issues, we first present a novel generalized regression model imposed by an uncorrelated constraint and the $\ell_{2,1}$-norm regularization. It can simultaneously select the uncorrelated and discriminative features as well as reduce the variance of these data points belonging to the same neighborhood, which is help for the clustering task. Furthermore, the local intrinsic structure of data is constructed on the reduced dimensional space by learning the similarity-induced graph adaptively. Then the learnings of the graph structure and the indicator matrix based on the spectral analysis are integrated into the generalized regression model. Finally, we develop an alternative iterative optimization algorithm to solve the objective function. A series of experiments are carried out on nine real-world data sets to demonstrate the effectiveness of the proposed method in comparison with other competing approaches. △ Less

Submitted 27 December, 2020; originally announced December 2020.

arXiv:2011.04162 [pdf, other]

Sinkhorn Natural Gradient for Generative Models

Authors: Zebang Shen, Zhenfu Wang, Alejandro Ribeiro, Hamed Hassani

Abstract: We consider the problem of minimizing a functional over a parametric family of probability measures, where the parameterization is characterized via a push-forward structure. An important application of this problem is in training generative adversarial networks. In this regard, we propose a novel Sinkhorn Natural Gradient (SiNG) algorithm which acts as a steepest descent method on the probability… ▽ More We consider the problem of minimizing a functional over a parametric family of probability measures, where the parameterization is characterized via a push-forward structure. An important application of this problem is in training generative adversarial networks. In this regard, we propose a novel Sinkhorn Natural Gradient (SiNG) algorithm which acts as a steepest descent method on the probability space endowed with the Sinkhorn divergence. We show that the Sinkhorn information matrix (SIM), a key component of SiNG, has an explicit expression and can be evaluated accurately in complexity that scales logarithmically with respect to the desired accuracy. This is in sharp contrast to existing natural gradient methods that can only be carried out approximately. Moreover, in practical applications when only Monte-Carlo type integration is available, we design an empirical estimator for SIM and provide the stability analysis. In our experiments, we quantitatively compare SiNG with state-of-the-art SGD-type solvers on generative tasks to demonstrate its efficiency and efficacy of our method. △ Less

Submitted 8 November, 2020; originally announced November 2020.

Comments: accepted to NeurIPS 2020

arXiv:2010.14075 [pdf, other]

doi 10.1016/j.neunet.2021.04.011

Neural Network Approximation: Three Hidden Layers Are Enough

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: A three-hidden-layer neural network with super approximation power is introduced. This network is built with the floor function ($\lfloor x\rfloor$), the exponential function ($2^x$), the step function ($1_{x\geq 0}$), or their compositions as the activation function in each neuron and hence we call such networks as Floor-Exponential-Step (FLES) networks. For any width hyper-parameter… ▽ More A three-hidden-layer neural network with super approximation power is introduced. This network is built with the floor function ($\lfloor x\rfloor$), the exponential function ($2^x$), the step function ($1_{x\geq 0}$), or their compositions as the activation function in each neuron and hence we call such networks as Floor-Exponential-Step (FLES) networks. For any width hyper-parameter $N\in\mathbb{N}^+$, it is shown that FLES networks with width $\max\{d,N\}$ and three hidden layers can uniformly approximate a Hölder continuous function $f$ on $[0,1]^d$ with an exponential approximation rate $3λ(2\sqrt{d})^α 2^{-αN}$, where $α\in(0,1]$ and $λ>0$ are the Hölder order and constant, respectively. More generally for an arbitrary continuous function $f$ on $[0,1]^d$ with a modulus of continuity $ω_f(\cdot)$, the constructive approximation rate is $2ω_f(2\sqrt{d}){2^{-N}}+ω_f(2\sqrt{d}\,2^{-N})$. Moreover, we extend such a result to general bounded continuous functions on a bounded set $E\subseteq\mathbb{R}^d$. As a consequence, this new class of networks overcomes the curse of dimensionality in approximation power when the variation of $ω_f(r)$ as $r\rightarrow 0$ is moderate (e.g., $ω_f(r)\lesssim r^α$ for Hölder continuous functions), since the major term to be concerned in our approximation rate is essentially $\sqrt{d}$ times a function of $N$ independent of $d$ within the modulus of continuity. Finally, we extend our analysis to derive similar approximation results in the $L^p$-norm for $p\in[1,\infty)$ via replacing Floor-Exponential-Step activation functions by continuous activation functions. △ Less

Submitted 19 April, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

Journal ref: Neural Networks, Volume 141, September 2021, Pages 160-173

arXiv:2010.01762 [pdf, other]

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

Authors: Zejiang Shen, Jian Zhao, Melissa Dell, Yaoliang Yu, Weining Li

Abstract: Document images often have intricate layout structures, with numerous content regions (e.g. texts, figures, tables) densely arranged on each page. This makes the manual annotation of layout datasets expensive and inefficient. These characteristics also challenge existing active learning methods, as image-level scoring and selection suffer from the overexposure of common objects.Inspired by recent… ▽ More Document images often have intricate layout structures, with numerous content regions (e.g. texts, figures, tables) densely arranged on each page. This makes the manual annotation of layout datasets expensive and inefficient. These characteristics also challenge existing active learning methods, as image-level scoring and selection suffer from the overexposure of common objects.Inspired by recent progresses in semi-supervised learning and self-training, we propose an Object-Level Active Learning framework for efficient document layout Annotation, OLALA. In this framework, only regions with the most ambiguous object predictions within an image are selected for annotators to label, optimizing the use of the annotation budget. For unselected predictions, the semi-automatic correction algorithm is proposed to identify certain errors based on prior knowledge of layout structures and rectifies them with minor supervision. Additionally, we carefully design a perturbation-based object scoring function for document images. It governs the object selection process via evaluating prediction ambiguities, and considers both the positions and categories of predicted layout objects. Extensive experiments show that OLALA can significantly boost model performance and improve annotation efficiency, given the same labeling budget. Code for this paper can be accessed via https://github.com/lolipopshock/detectron2_al. △ Less

Submitted 29 March, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

Comments: 12 pages, 7 figures, 5 tables

arXiv:2007.10449 [pdf, other]

Sinkhorn Barycenter via Functional Gradient Descent

Authors: Zebang Shen, Zhenfu Wang, Alejandro Ribeiro, Hamed Hassani

Abstract: In this paper, we consider the problem of computing the barycenter of a set of probability distributions under the Sinkhorn divergence. This problem has recently found applications across various domains, including graphics, learning, and vision, as it provides a meaningful mechanism to aggregate knowledge. Unlike previous approaches which directly operate in the space of probability measures,… ▽ More In this paper, we consider the problem of computing the barycenter of a set of probability distributions under the Sinkhorn divergence. This problem has recently found applications across various domains, including graphics, learning, and vision, as it provides a meaningful mechanism to aggregate knowledge. Unlike previous approaches which directly operate in the space of probability measures, we recast the Sinkhorn barycenter problem as an instance of unconstrained functional optimization and develop a novel functional gradient descent method named Sinkhorn Descent (SD). We prove that SD converges to a stationary point at a sublinear rate, and under reasonable assumptions, we further show that it asymptotically finds a global minimizer of the Sinkhorn barycenter problem. Moreover, by providing a mean-field analysis, we show that SD preserves the weak convergence of empirical measures. Importantly, the computational complexity of SD scales linearly in the dimension $d$ and we demonstrate its scalability by solving a $100$-dimensional Sinkhorn barycenter problem. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: submitted to NIPS 2020

arXiv:2006.13326 [pdf, ps, other]

Safe Learning under Uncertain Objectives and Constraints

Authors: Mohammad Fereydounian, Zebang Shen, Aryan Mokhtari, Amin Karbasi, Hamed Hassani

Abstract: In this paper, we consider non-convex optimization problems under \textit{unknown} yet safety-critical constraints. Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures, where it is infeasible to know or identify all the constraints. Therefore, the parameter space should be explored in a conservative way to ensure that none of the constrai… ▽ More In this paper, we consider non-convex optimization problems under \textit{unknown} yet safety-critical constraints. Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures, where it is infeasible to know or identify all the constraints. Therefore, the parameter space should be explored in a conservative way to ensure that none of the constraints are violated during the optimization process once we start from a safe initialization point. To this end, we develop an algorithm called Reliable Frank-Wolfe (Reliable-FW). Given a general non-convex function and an unknown polytope constraint, Reliable-FW simultaneously learns the landscape of the objective function and the boundary of the safety polytope. More precisely, by assuming that Reliable-FW has access to a (stochastic) gradient oracle of the objective function and a noisy feasibility oracle of the safety polytope, it finds an $ε$-approximate first-order stationary point with the optimal ${\mathcal{O}}({1}/{ε^2})$ gradient oracle complexity (resp. $\tilde{\mathcal{O}}({1}/{ε^3})$ (also optimal) in the stochastic gradient setting), while ensuring the safety of all the iterates. Rather surprisingly, Reliable-FW only makes $\tilde{\mathcal{O}}(({d^2}/{ε^2})\log 1/δ)$ queries to the noisy feasibility oracle (resp. $\tilde{\mathcal{O}}(({d^2}/{ε^4})\log 1/δ)$ in the stochastic gradient setting) where $d$ is the dimension and $δ$ is the reliability parameter, tightening the existing bounds even for safe minimization of convex functions. We further specialize our results to the case that the objective function is convex. A crucial component of our analysis is to introduce and apply a technique called geometric shrinkage in the context of safe optimization. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: 42 pages, 2 figures

arXiv:2006.12231 [pdf, other]

doi 10.1162/neco_a_01364

Deep Network with Approximation Error Being Reciprocal of Width to Power of Square Root of Depth

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: A new network with super approximation power is introduced. This network is built with Floor ($\lfloor x\rfloor$) or ReLU ($\max\{0,x\}$) activation function in each neuron and hence we call such networks Floor-ReLU networks. For any hyper-parameters $N\in\mathbb{N}^+$ and $L\in\mathbb{N}^+$, it is shown that Floor-ReLU networks with width $\max\{d,\, 5N+13\}$ and depth $64dL+3$ can uniformly appr… ▽ More A new network with super approximation power is introduced. This network is built with Floor ($\lfloor x\rfloor$) or ReLU ($\max\{0,x\}$) activation function in each neuron and hence we call such networks Floor-ReLU networks. For any hyper-parameters $N\in\mathbb{N}^+$ and $L\in\mathbb{N}^+$, it is shown that Floor-ReLU networks with width $\max\{d,\, 5N+13\}$ and depth $64dL+3$ can uniformly approximate a Hölder function $f$ on $[0,1]^d$ with an approximation error $3λd^{α/2}N^{-α\sqrt{L}}$, where $α\in(0,1]$ and $λ$ are the Hölder order and constant, respectively. More generally for an arbitrary continuous function $f$ on $[0,1]^d$ with a modulus of continuity $ω_f(\cdot)$, the constructive approximation rate is $ω_f(\sqrt{d}\,N^{-\sqrt{L}})+2ω_f(\sqrt{d}){N^{-\sqrt{L}}}$. As a consequence, this new class of networks overcomes the curse of dimensionality in approximation power when the variation of $ω_f(r)$ as $r\to 0$ is moderate (e.g., $ω_f(r) \lesssim r^α$ for Hölder continuous functions), since the major term to be considered in our approximation rate is essentially $\sqrt{d}$ times a function of $N$ and $L$ independent of $d$ within the modulus of continuity. △ Less

Submitted 26 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Journal ref: Neural Computation, Volume 33, Issue 4, April 2021, Pages 1005-1036

arXiv:2006.10483 [pdf, other]

doi 10.1145/3394486.3403263

Algorithmic Decision Making with Conditional Fairness

Authors: Renzhe Xu, Peng Cui, Kun Kuang, Bo Li, Linjun Zhou, Zheyan Shen, Wei Cui

Abstract: Nowadays fairness issues have raised great concerns in decision-making systems. Various fairness notions have been proposed to measure the degree to which an algorithm is unfair. In practice, there frequently exist a certain set of variables we term as fair variables, which are pre-decision covariates such as users' choices. The effects of fair variables are irrelevant in assessing the fairness of… ▽ More Nowadays fairness issues have raised great concerns in decision-making systems. Various fairness notions have been proposed to measure the degree to which an algorithm is unfair. In practice, there frequently exist a certain set of variables we term as fair variables, which are pre-decision covariates such as users' choices. The effects of fair variables are irrelevant in assessing the fairness of the decision support algorithm. We thus define conditional fairness as a more sound fairness metric by conditioning on the fairness variables. Given different prior knowledge of fair variables, we demonstrate that traditional fairness notations, such as demographic parity and equalized odds, are special cases of our conditional fairness notations. Moreover, we propose a Derivable Conditional Fairness Regularizer (DCFR), which can be integrated into any decision-making model, to track the trade-off between precision and fairness of algorithmic decision making. Specifically, an adversarial representation based conditional independence loss is proposed in our DCFR to measure the degree of unfairness. With extensive experiments on three real-world datasets, we demonstrate the advantages of our conditional fairness notation and DCFR. △ Less

Submitted 18 July, 2021; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: KDD 2020

arXiv:2006.05371 [pdf, other]

Bayesian Probabilistic Numerical Integration with Tree-Based Models

Authors: Harrison Zhu, Xing Liu, Ruya Kang, Zhichao Shen, Seth Flaxman, François-Xavier Briol

Abstract: Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows users to quantify their uncertainty about the solution. The standard approach to BQ is based on a Gaussian process (GP) approximation of the integrand. As a result, BQ is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting ver… ▽ More Bayesian quadrature (BQ) is a method for solving numerical integration problems in a Bayesian manner, which allows users to quantify their uncertainty about the solution. The standard approach to BQ is based on a Gaussian process (GP) approximation of the integrand. As a result, BQ is inherently limited to cases where GP approximations can be done in an efficient manner, thus often prohibiting very high-dimensional or non-smooth target functions. This paper proposes to tackle this issue with a new Bayesian numerical integration algorithm based on Bayesian Additive Regression Trees (BART) priors, which we call BART-Int. BART priors are easy to tune and well-suited for discontinuous functions. We demonstrate that they also lend themselves naturally to a sequential design setting and that explicit convergence rates can be obtained in a variety of settings. The advantages and disadvantages of this new methodology are highlighted on a set of benchmark tests including the Genz functions, and on a Bayesian survey design problem. △ Less

Submitted 2 December, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

arXiv:2006.04414 [pdf, other]

Stable Adversarial Learning under Distributional Shifts

Authors: Jiashuo Liu, Zheyan Shen, Peng Cui, Linjun Zhou, Kun Kuang, Bo Li, Yishi Lin

Abstract: Machine learning algorithms with empirical risk minimization are vulnerable under distributional shifts due to the greedy adoption of all the correlations found in training data. Recently, there are robust learning methods aiming at this problem by minimizing the worst-case risk over an uncertainty set. However, they equally treat all covariates to form the decision sets regardless of the stabilit… ▽ More Machine learning algorithms with empirical risk minimization are vulnerable under distributional shifts due to the greedy adoption of all the correlations found in training data. Recently, there are robust learning methods aiming at this problem by minimizing the worst-case risk over an uncertainty set. However, they equally treat all covariates to form the decision sets regardless of the stability of their correlations with the target, resulting in the overwhelmingly large set and low confidence of the learner.In this paper, we propose Stable Adversarial Learning (SAL) algorithm that leverages heterogeneous data sources to construct a more practical uncertainty set and conduct differentiated robustness optimization, where covariates are differentiated according to the stability of their correlations with the target. We theoretically show that our method is tractable for stochastic gradient-based optimization and provide the performance guarantees for our method. Empirical studies on both simulation and real datasets validate the effectiveness of our method in terms of uniformly good performance across unknown distributional shifts. △ Less

Submitted 10 May, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 11 pages

Journal ref: Association for the Advancement of Artificial Intelligence, 2021

arXiv:2005.09856 [pdf, other]

A Novel Meta Learning Framework for Feature Selection using Data Synthesis and Fuzzy Similarity

Authors: Zixiao Shen, Xin Chen, Jonathan M. Garibaldi

Abstract: This paper presents a novel meta learning framework for feature selection (FS) based on fuzzy similarity. The proposed method aims to recommend the best FS method from four candidate FS methods for any given dataset. This is achieved by firstly constructing a large training data repository using data synthesis. Six meta features that represent the characteristics of the training dataset are then e… ▽ More This paper presents a novel meta learning framework for feature selection (FS) based on fuzzy similarity. The proposed method aims to recommend the best FS method from four candidate FS methods for any given dataset. This is achieved by firstly constructing a large training data repository using data synthesis. Six meta features that represent the characteristics of the training dataset are then extracted. The best FS method for each of the training datasets is used as the meta label. Both the meta features and the corresponding meta labels are subsequently used to train a classification model using a fuzzy similarity measure based framework. Finally the trained model is used to recommend the most suitable FS method for a given unseen dataset. This proposed method was evaluated based on eight public datasets of real-world applications. It successfully recommended the best method for five datasets and the second best method for one dataset, which outperformed any of the four individual FS methods. Besides, the proposed method is computationally efficient for algorithm selection, leading to negligible additional time for the feature selection process. Thus, the paper contributes a novel method for effectively recommending which feature selection method to use for any new given dataset. △ Less

Submitted 20 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:2005.05003 [pdf, other]

A Novel Weighted Combination Method for Feature Selection using Fuzzy Sets

Authors: Zixiao Shen, Xin Chen, Jonathan M. Garibaldi

Abstract: In this paper, we propose a novel weighted combination feature selection method using bootstrap and fuzzy sets. The proposed method mainly consists of three processes, including fuzzy sets generation using bootstrap, weighted combination of fuzzy sets and feature ranking based on defuzzification. We implemented the proposed method by combining four state-of-the-art feature selection methods and ev… ▽ More In this paper, we propose a novel weighted combination feature selection method using bootstrap and fuzzy sets. The proposed method mainly consists of three processes, including fuzzy sets generation using bootstrap, weighted combination of fuzzy sets and feature ranking based on defuzzification. We implemented the proposed method by combining four state-of-the-art feature selection methods and evaluated the performance based on three publicly available biomedical datasets using five-fold cross validation. Based on the feature selection results, our proposed method produced comparable (if not better) classification accuracies to the best of the individual feature selection methods for all evaluated datasets. More importantly, we also applied standard deviation and Pearson's correlation to measure the stability of the methods. Remarkably, our combination method achieved significantly higher stability than the four individual methods when variations and size reductions were introduced to the datasets. △ Less

Submitted 21 May, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

arXiv:2005.04888 [pdf, other]

Performance Optimization of a Fuzzy Entropy based Feature Selection and Classification Framework

Authors: Zixiao Shen, Xin Chen, Jonathan M. Garibaldi

Abstract: In this paper, based on a fuzzy entropy feature selection framework, different methods have been implemented and compared to improve the key components of the framework. Those methods include the combinations of three ideal vector calculations, three maximal similarity classifiers and three fuzzy entropy functions. Different feature removal orders based on the fuzzy entropy values were also compar… ▽ More In this paper, based on a fuzzy entropy feature selection framework, different methods have been implemented and compared to improve the key components of the framework. Those methods include the combinations of three ideal vector calculations, three maximal similarity classifiers and three fuzzy entropy functions. Different feature removal orders based on the fuzzy entropy values were also compared. The proposed method was evaluated on three publicly available biomedical datasets. From the experiments, we concluded the optimized combination of the ideal vector, similarity classifier and fuzzy entropy function for feature selection. The optimized framework was also compared with other six classical filter-based feature selection methods. The proposed method was ranked as one of the top performers together with the Correlation and ReliefF methods. More importantly, the proposed method achieved the most stable performance for all three datasets when the features being gradually removed. This indicates a better feature ranking performance than the other compared methods. △ Less

Submitted 21 May, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

arXiv:2004.08620 [pdf, other]

Optimization in Machine Learning: A Distribution Space Approach

Authors: Yongqiang Cai, Qianxiao Li, Zuowei Shen

Abstract: We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training pa… ▽ More We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g. a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated on simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning. △ Less

Submitted 18 April, 2020; originally announced April 2020.

Comments: 26 pages, 12 figures

arXiv:2003.09821 [pdf, other]

BS-NAS: Broadening-and-Shrinking One-Shot NAS with Searchable Numbers of Channels

Authors: Zan Shen, Jiang Qian, Bo** Zhuang, Shaojun Wang, **g Xiao

Abstract: One-Shot methods have evolved into one of the most popular methods in Neural Architecture Search (NAS) due to weight sharing and single training of a supernet. However, existing methods generally suffer from two issues: predetermined number of channels in each layer which is suboptimal; and model averaging effects and poor ranking correlation caused by weight coupling and continuously expanding se… ▽ More One-Shot methods have evolved into one of the most popular methods in Neural Architecture Search (NAS) due to weight sharing and single training of a supernet. However, existing methods generally suffer from two issues: predetermined number of channels in each layer which is suboptimal; and model averaging effects and poor ranking correlation caused by weight coupling and continuously expanding search space. To explicitly address these issues, in this paper, a Broadening-and-Shrinking One-Shot NAS (BS-NAS) framework is proposed, in which `broadening' refers to broadening the search space with a spring block enabling search for numbers of channels during training of the supernet; while `shrinking' refers to a novel shrinking strategy gradually turning off those underperforming operations. The above innovations broaden the search space for wider representation and then shrink it by gradually removing underperforming operations, followed by an evolutionary algorithm to efficiently search for the optimal architecture. Extensive experiments on ImageNet illustrate the effectiveness of the proposed BS-NAS as well as the state-of-the-art performance. △ Less

Submitted 22 March, 2020; originally announced March 2020.

Comments: 14 pages

arXiv:2003.03080 [pdf, other]

Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations

Authors: Simone Rossi, Markus Heinonen, Edwin V. Bonilla, Zheyang Shen, Maurizio Filippone

Abstract: Variational inference techniques based on inducing variables provide an elegant framework for scalable posterior estimation in Gaussian process (GP) models. Besides enabling scalability, one of their main advantages over sparse approximations using direct marginal likelihood maximization is that they provide a robust alternative for point estimation of the inducing inputs, i.e. the location of the… ▽ More Variational inference techniques based on inducing variables provide an elegant framework for scalable posterior estimation in Gaussian process (GP) models. Besides enabling scalability, one of their main advantages over sparse approximations using direct marginal likelihood maximization is that they provide a robust alternative for point estimation of the inducing inputs, i.e. the location of the inducing variables. In this work we challenge the common wisdom that optimizing the inducing inputs in the variational framework yields optimal performance. We show that, by revisiting old model approximations such as the fully-independent training conditionals endowed with powerful sampling-based inference methods, treating both inducing locations and GP hyper-parameters in a Bayesian way can improve performance significantly. Based on stochastic gradient Hamiltonian Monte Carlo, we develop a fully Bayesian approach to scalable GP and deep GP models, and demonstrate its state-of-the-art performance through an extensive experimental campaign across several regression and classification problems. △ Less

Submitted 23 February, 2021; v1 submitted 6 March, 2020; originally announced March 2020.

arXiv:2001.11359 [pdf, other]

FOCUS: Dealing with Label Quality Disparity in Federated Learning

Authors: Yiqiang Chen, Xiaodong Yang, Xin Qin, Han Yu, Biao Chen, Zhiqi Shen

Abstract: Ubiquitous systems with End-Edge-Cloud architecture are increasingly being used in healthcare applications. Federated Learning (FL) is highly useful for such applications, due to silo effect and privacy preserving. Existing FL approaches generally do not account for disparities in the quality of local data labels. However, the clients in ubiquitous systems tend to suffer from label noise due to va… ▽ More Ubiquitous systems with End-Edge-Cloud architecture are increasingly being used in healthcare applications. Federated Learning (FL) is highly useful for such applications, due to silo effect and privacy preserving. Existing FL approaches generally do not account for disparities in the quality of local data labels. However, the clients in ubiquitous systems tend to suffer from label noise due to varying skill-levels, biases or malicious tampering of the annotators. In this paper, we propose Federated Opportunistic Computing for Ubiquitous Systems (FOCUS) to address this challenge. It maintains a small set of benchmark samples on the FL server and quantifies the credibility of the client local data without directly observing them by computing the mutual cross-entropy between performance of the FL model on the local datasets and that of the client local FL model on the benchmark dataset. Then, a credit weighted orchestration is performed to adjust the weight assigned to clients in the FL model based on their credibility values. FOCUS has been experimentally evaluated on both synthetic data and real-world data. The results show that it effectively identifies clients with noisy labels and reduces their impact on the model performance, thereby significantly outperforming existing FL approaches. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: 7 pages

arXiv:2001.03040 [pdf, other]

doi 10.1137/20M134695X

Deep Network Approximation for Smooth Functions

Authors: Jianfeng Lu, Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: This paper establishes the (nearly) optimal approximation error characterization of deep rectified linear unit (ReLU) networks for smooth functions in terms of both width and depth simultaneously. To that end, we first prove that multivariate polynomials can be approximated by deep ReLU networks of width $\mathcal{O}(N)$ and depth $\mathcal{O}(L)$ with an approximation error $\mathcal{O}(N^{-L})$.… ▽ More This paper establishes the (nearly) optimal approximation error characterization of deep rectified linear unit (ReLU) networks for smooth functions in terms of both width and depth simultaneously. To that end, we first prove that multivariate polynomials can be approximated by deep ReLU networks of width $\mathcal{O}(N)$ and depth $\mathcal{O}(L)$ with an approximation error $\mathcal{O}(N^{-L})$. Through local Taylor expansions and their deep ReLU network approximations, we show that deep ReLU networks of width $\mathcal{O}(N\ln N)$ and depth $\mathcal{O}(L\ln L)$ can approximate $f\in C^s([0,1]^d)$ with a nearly optimal approximation error $\mathcal{O}(\|f\|_{C^s([0,1]^d)}N^{-2s/d}L^{-2s/d})$. Our estimate is non-asymptotic in the sense that it is valid for arbitrary width and depth specified by $N\in\mathbb{N}^+$ and $L\in\mathbb{N}^+$, respectively. △ Less

Submitted 24 September, 2021; v1 submitted 9 January, 2020; originally announced January 2020.

Journal ref: SIAM Journal on Mathematical Analysis, Volume 53, Issue 5, September 2021, Pages 5465-5506

arXiv:1912.10382 [pdf, ps, other]

Deep Learning via Dynamical Systems: An Approximation Perspective

Authors: Qianxiao Li, Ting Lin, Zuowei Shen

Abstract: We build on the dynamical systems approach to deep learning, where deep residual networks are idealized as continuous-time dynamical systems, from the approximation perspective. In particular, we establish general sufficient conditions for universal approximation using continuous-time deep residual networks, which can also be understood as approximation theories in $L^p$ using flow maps of dynamic… ▽ More We build on the dynamical systems approach to deep learning, where deep residual networks are idealized as continuous-time dynamical systems, from the approximation perspective. In particular, we establish general sufficient conditions for universal approximation using continuous-time deep residual networks, which can also be understood as approximation theories in $L^p$ using flow maps of dynamical systems. In specific cases, rates of approximation in terms of the time horizon are also established. Overall, these results reveal that composition function approximation through flow maps present a new paradigm in approximation theory and contributes to building a useful mathematical framework to investigate deep learning. △ Less

Submitted 7 June, 2020; v1 submitted 21 December, 2019; originally announced December 2019.

Comments: Revision 1

arXiv:1911.12580 [pdf, other]

Stable Learning via Sample Reweighting

Authors: Zheyan Shen, Peng Cui, Tong Zhang, Kun Kuang

Abstract: We consider the problem of learning linear prediction models with model misspecification bias. In such case, the collinearity among input variables may inflate the error of parameter estimation, resulting in instability of prediction results when training and test distributions do not match. In this paper we theoretically analyze this fundamental problem and propose a sample reweighting method tha… ▽ More We consider the problem of learning linear prediction models with model misspecification bias. In such case, the collinearity among input variables may inflate the error of parameter estimation, resulting in instability of prediction results when training and test distributions do not match. In this paper we theoretically analyze this fundamental problem and propose a sample reweighting method that reduces collinearity among input variables. Our method can be seen as a pretreatment of data to improve the condition of design matrix, and it can then be combined with any standard learning method for parameter estimation and variable selection. Empirical studies on both simulation and real datasets demonstrate the effectiveness of our method in terms of more stable performance across different distributed data. △ Less

Submitted 28 November, 2019; originally announced November 2019.

Comments: Accepted as poster paper at AAAI2020

arXiv:1910.14380 [pdf, other]

A Decentralized Proximal Point-type Method for Saddle Point Problems

Authors: Weijie Liu, Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil, Zebang Shen, Nenggan Zheng

Abstract: In this paper, we focus on solving a class of constrained non-convex non-concave saddle point problems in a decentralized manner by a group of nodes in a network. Specifically, we assume that each node has access to a summand of a global objective function and nodes are allowed to exchange information only with their neighboring nodes. We propose a decentralized variant of the proximal point metho… ▽ More In this paper, we focus on solving a class of constrained non-convex non-concave saddle point problems in a decentralized manner by a group of nodes in a network. Specifically, we assume that each node has access to a summand of a global objective function and nodes are allowed to exchange information only with their neighboring nodes. We propose a decentralized variant of the proximal point method for solving this problem. We show that when the objective function is $ρ$-weakly convex-weakly concave the iterates converge to approximate stationarity with a rate of $\mathcal{O}(1/\sqrt{T})$ where the approximation error depends linearly on $\sqrtρ$. We further show that when the objective function satisfies the Minty VI condition (which generalizes the convex-concave case) we obtain convergence to stationarity with a rate of $\mathcal{O}(1/\sqrt{T})$. To the best of our knowledge, our proposed method is the first decentralized algorithm with theoretical guarantees for solving a non-convex non-concave decentralized saddle point problem. Our numerical results for training a general adversarial network (GAN) in a decentralized manner match our theoretical guarantees. △ Less

Submitted 31 October, 2019; originally announced October 2019.

Comments: 18 pages

arXiv:1910.09396 [pdf, ps, other]

Efficient Projection-Free Online Methods with Stochastic Recursive Gradient

Authors: Jiahao Xie, Zebang Shen, Chao Zhang, Boyu Wang, Hui Qian

Abstract: This paper focuses on projection-free methods for solving smooth Online Convex Optimization (OCO) problems. Existing projection-free methods either achieve suboptimal regret bounds or have high per-iteration computational costs. To fill this gap, two efficient projection-free online methods called ORGFW and MORGFW are proposed for solving stochastic and adversarial OCO problems, respectively. By e… ▽ More This paper focuses on projection-free methods for solving smooth Online Convex Optimization (OCO) problems. Existing projection-free methods either achieve suboptimal regret bounds or have high per-iteration computational costs. To fill this gap, two efficient projection-free online methods called ORGFW and MORGFW are proposed for solving stochastic and adversarial OCO problems, respectively. By employing a recursive gradient estimator, our methods achieve optimal regret bounds (up to a logarithmic factor) while possessing low per-iteration computational costs. Experimental results demonstrate the efficiency of the proposed methods compared to state-of-the-arts. △ Less

Submitted 23 October, 2019; v1 submitted 21 October, 2019; originally announced October 2019.

Comments: 15 pages, 3 figures

arXiv:1910.09223 [pdf, ps, other]

Aggregated Gradient Langevin Dynamics

Authors: Chao Zhang, Jiahao Xie, Zebang Shen, Peilin Zhao, Tengfei Zhou, Hui Qian

Abstract: In this paper, we explore a general Aggregated Gradient Langevin Dynamics framework (AGLD) for the Markov Chain Monte Carlo (MCMC) sampling. We investigate the nonasymptotic convergence of AGLD with a unified analysis for different data accessing (e.g. random access, cyclic access and random reshuffle) and snapshot updating strategies, under convex and nonconvex settings respectively. It is the fi… ▽ More In this paper, we explore a general Aggregated Gradient Langevin Dynamics framework (AGLD) for the Markov Chain Monte Carlo (MCMC) sampling. We investigate the nonasymptotic convergence of AGLD with a unified analysis for different data accessing (e.g. random access, cyclic access and random reshuffle) and snapshot updating strategies, under convex and nonconvex settings respectively. It is the first time that bounds for I/O friendly strategies such as cyclic access and random reshuffle have been established in the MCMC literature. The theoretic results also indicate that methods in AGLD possess the merits of both the low per-iteration computational complexity and the short mixture time. Empirical studies demonstrate that our framework allows to derive novel schemes to generate high-quality samples for large-scale Bayesian posterior learning tasks. △ Less

Submitted 21 October, 2019; originally announced October 2019.

arXiv:1910.04322 [pdf, other]

One Sample Stochastic Frank-Wolfe

Authors: Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, Amin Karbasi

Abstract: One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit.… ▽ More One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of $\mathcal{O}(1/ε^2)$ for reaching an $ε$-suboptimal solution in the stochastic convex setting, and a $(1-1/e)-ε$ approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an $ε$-first-order stationary point after at most $\mathcal{O}(1/ε^3)$ iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration. △ Less

Submitted 9 October, 2019; originally announced October 2019.

arXiv:1905.09917 [pdf, other]

Learning spectrograms with convolutional spectral kernels

Authors: Zheyang Shen, Markus Heinonen, Samuel Kaski

Abstract: We introduce the convolutional spectral kernel (CSK), a novel family of non-stationary, nonparametric covariance kernels for Gaussian process (GP) models, derived from the convolution between two imaginary radial basis functions. We present a principled framework to interpret CSK, as well as other deep probabilistic models, using approximated Fourier transform, yielding a concise representation of… ▽ More We introduce the convolutional spectral kernel (CSK), a novel family of non-stationary, nonparametric covariance kernels for Gaussian process (GP) models, derived from the convolution between two imaginary radial basis functions. We present a principled framework to interpret CSK, as well as other deep probabilistic models, using approximated Fourier transform, yielding a concise representation of input-frequency spectrogram. Observing through the lens of the spectrogram, we provide insight on the interpretability of deep models. We then infer the functional hyperparameters using scalable variational and MCMC methods. On small- and medium-sized spatiotemporal datasets, we demonstrate improved generalization of GP models when equipped with CSK, and their capability to extract non-stationary periodic patterns. △ Less

Submitted 14 October, 2019; v1 submitted 23 May, 2019; originally announced May 2019.

Comments: 15 pages, 7 figures

arXiv:1903.01540 [pdf, other]

A Stochastic Trust Region Method for Non-convex Minimization

Authors: Zebang Shen, Pan Zhou, Cong Fang, Alejandro Ribeiro

Abstract: We target the problem of finding a local minimum in non-convex finite-sum minimization. Towards this goal, we first prove that the trust region method with inexact gradient and Hessian estimation can achieve a convergence rate of order $\mathcal{O}(1/{k^{2/3}})$ as long as those differential estimations are sufficiently accurate. Combining such result with a novel Hessian estimator, we propose the… ▽ More We target the problem of finding a local minimum in non-convex finite-sum minimization. Towards this goal, we first prove that the trust region method with inexact gradient and Hessian estimation can achieve a convergence rate of order $\mathcal{O}(1/{k^{2/3}})$ as long as those differential estimations are sufficiently accurate. Combining such result with a novel Hessian estimator, we propose the sample-efficient stochastic trust region (STR) algorithm which finds an $(ε, \sqrtε)$-approximate local minimum within $\mathcal{O}({\sqrt{n}}/{ε^{1.5}})$ stochastic Hessian oracle queries. This improves state-of-the-art result by $\mathcal{O}(n^{1/6})$. Experiments verify theoretical conclusions and the efficiency of STR. △ Less

Submitted 4 March, 2019; originally announced March 2019.

arXiv:1902.10170 [pdf, other]

doi 10.1016/j.neunet.2019.07.011

Nonlinear Approximation via Compositions

Authors: Zuowei Shen, Haizhao Yang, Shijun Zhang

Abstract: Given a function dictionary $\cal D$ and an approximation budget $N\in\mathbb{N}^+$, nonlinear approximation seeks the linear combination of the best $N$ terms $\{T_n\}_{1\le n\le N}\subseteq{\cal D}$ to approximate a given function $f$ with the minimum approximation error\[\varepsilon_{L,f}:=\min_{\{g_n\}\subseteq{\mathbb{R}},\{T_n\}\subseteq{\cal D}}\|f(x)-\sum_{n=1}^N g_n T_n(x)\|.\]Motivated b… ▽ More Given a function dictionary $\cal D$ and an approximation budget $N\in\mathbb{N}^+$, nonlinear approximation seeks the linear combination of the best $N$ terms $\{T_n\}_{1\le n\le N}\subseteq{\cal D}$ to approximate a given function $f$ with the minimum approximation error\[\varepsilon_{L,f}:=\min_{\{g_n\}\subseteq{\mathbb{R}},\{T_n\}\subseteq{\cal D}}\|f(x)-\sum_{n=1}^N g_n T_n(x)\|.\]Motivated by recent success of deep learning, we propose dictionaries with functions in a form of compositions, i.e.,\[T(x)=T^{(L)}\circ T^{(L-1)}\circ\cdots\circ T^{(1)}(x)\]for all $T\in\cal D$, and implement $T$ using ReLU feed-forward neural networks (FNNs) with $L$ hidden layers. We further quantify the improvement of the best $N$-term approximation rate in terms of $N$ when $L$ is increased from $1$ to $2$ or $3$ to show the power of compositions. In the case when $L>3$, our analysis shows that increasing $L$ cannot improve the approximation rate in terms of $N$. In particular, for any function $f$ on $[0,1]$, regardless of its smoothness and even the continuity, if $f$ can be approximated using a dictionary when $L=1$ with the best $N$-term approximation rate $\varepsilon_{L,f}={\cal O}(N^{-η})$, we show that dictionaries with $L=2$ can improve the best $N$-term approximation rate to $\varepsilon_{L,f}={\cal O}(N^{-2η})$. We also show that for Hölder continuous functions of order $α$ on $[0,1]^d$, the application of a dictionary with $L=3$ in nonlinear approximation can achieve an essentially tight best $N$-term approximation rate $\varepsilon_{L,f}={\cal O}(N^{-2α/d})$. Finally, we show that dictionaries consisting of wide FNNs with a few hidden layers are more attractive in terms of computational efficiency than dictionaries with narrow and very deep FNNs for approximating Hölder continuous functions if the number of computer cores is larger than $N$ in parallel computing. △ Less

Submitted 5 November, 2020; v1 submitted 26 February, 2019; originally announced February 2019.

Journal ref: Neural Networks, Volume 119, November 2019, Pages 74-84

arXiv:1812.03643 [pdf, other]

doi 10.1038/s41467-019-11401-8

Increasing trend of scientists to switch between topics

Authors: An Zeng, Zhesi Shen, Jianlin Zhou, Ying Fan, Zengru Di, Yougui Wang, H. Eugene Stanley, Shlomo Havlin

Abstract: We analyze the publication records of individual scientists, aiming to quantify the topic switching dynamics of scientists and its influence. For each scientist, the relations among her publications are characterized via shared references. We find that the co-citing network of the papers of a scientist exhibits a clear community structure where each major community represents a research topic. Our… ▽ More We analyze the publication records of individual scientists, aiming to quantify the topic switching dynamics of scientists and its influence. For each scientist, the relations among her publications are characterized via shared references. We find that the co-citing network of the papers of a scientist exhibits a clear community structure where each major community represents a research topic. Our analysis suggests that scientists tend to have a narrow distribution of the number of topics. However, researchers nowadays switch more frequently between topics than those in the early days. We also find that high switching probability in early career (<12y) is associated with low overall productivity, while it is correlated with high overall productivity in latter career. Interestingly, the average citation per paper, however, is in all career stages negatively correlated with the switching probability. We propose a model with exploitation and exploration mechanisms that can explain the main observed features. △ Less

Submitted 10 December, 2018; originally announced December 2018.

Comments: 37 pages, 21 figures

arXiv:1810.13306 [pdf, other]

Automated Machine Learning: From Principles to Practices

Authors: Zhenqian Shen, Yongqi Zhang, Lanning Wei, Huan Zhao, Quanming Yao

Abstract: Machine learning (ML) methods have been develo** rapidly, but configuring and selecting proper methods to achieve a desired performance is increasingly difficult and tedious. To address this challenge, automated machine learning (AutoML) has emerged, which aims to generate satisfactory ML configurations for given tasks in a data-driven way. In this paper, we provide a comprehensive survey on thi… ▽ More Machine learning (ML) methods have been develo** rapidly, but configuring and selecting proper methods to achieve a desired performance is increasingly difficult and tedious. To address this challenge, automated machine learning (AutoML) has emerged, which aims to generate satisfactory ML configurations for given tasks in a data-driven way. In this paper, we provide a comprehensive survey on this topic. We begin with the formal definition of AutoML and then introduce its principles, including the bi-level learning objective, the learning strategy, and the theoretical interpretation. Then, we summarize the AutoML practices by setting up the taxonomy of existing works based on three main factors: the search space, the search algorithm, and the evaluation strategy. Each category is also explained with the representative methods. Then, we illustrate the principles and practices with exemplary applications from configuring ML pipeline, one-shot neural architecture search, and integration with foundation models. Finally, we highlight the emerging directions of AutoML and conclude the survey. △ Less

Submitted 27 February, 2024; v1 submitted 31 October, 2018; originally announced October 2018.

Comments: This is a preliminary and will be kept updated

Showing 1–50 of 59 results for author: Shen, Z