Search | arXiv e-print repository

Agnostic Learning of General ReLU Activation Using Gradient Descent

Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan

Abstract: We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iteration… ▽ More We provide a convergence analysis of gradient descent for the problem of agnostically learning a single ReLU function under Gaussian distributions. Unlike prior work that studies the setting of zero bias, we consider the more challenging scenario when the bias of the ReLU function is non-zero. Our main result establishes that starting from random initialization, in a polynomial number of iterations gradient descent outputs, with high probability, a ReLU function that achieves a competitive error guarantee when compared to the error of the best ReLU function. We also provide finite sample guarantees, and these techniques generalize to a broader class of marginal distributions beyond Gaussians. △ Less

Submitted 4 August, 2022; originally announced August 2022.

Comments: 28 oages

arXiv:2107.10209 [pdf, ps, other]

Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations

Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan

Abstract: We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form $f(x) = {a}^{\mathsf{T}}σ({W}^\mathsf{T}x+b)$, where $x$ is drawn from the Gaussian distribution, and $σ(t) := \max(t,0)$ is the ReLU activation.… ▽ More We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form $f(x) = {a}^{\mathsf{T}}σ({W}^\mathsf{T}x+b)$, where $x$ is drawn from the Gaussian distribution, and $σ(t) := \max(t,0)$ is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias $b$ is zero. In order to deal with the presence of the bias terms, our proposed algorithm consists of robustly decomposing multiple higher order tensors arising from the Hermite expansion of the function $f(x)$. Using these ideas we also establish identifiability of the network parameters under minimal assumptions. △ Less

Submitted 1 August, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

Comments: 45 pages (including appendix). This version fixes an error in the previous version of the paper

arXiv:1811.01316 [pdf, other]

Nonlinear Collaborative Scheme for Deep Neural Networks

Authors: Hui-Ling Zhen, Xi Lin, Alan Z. Tang, Zhenhua Li, Qingfu Zhang, Sam Kwong

Abstract: Conventional research attributes the improvements of generalization ability of deep neural networks either to powerful optimizers or the new network design. Different from them, in this paper, we aim to link the generalization ability of a deep network to optimizing a new objective function. To this end, we propose a \textit{nonlinear collaborative scheme} for deep network training, with the key t… ▽ More Conventional research attributes the improvements of generalization ability of deep neural networks either to powerful optimizers or the new network design. Different from them, in this paper, we aim to link the generalization ability of a deep network to optimizing a new objective function. To this end, we propose a \textit{nonlinear collaborative scheme} for deep network training, with the key technique as combining different loss functions in a nonlinear manner. We find that after adaptively tuning the weights of different loss functions, the proposed objective function can efficiently guide the optimization process. What is more, we demonstrate that, from the mathematical perspective, the nonlinear collaborative scheme can lead to (i) smaller KL divergence with respect to optimal solutions; (ii) data-driven stochastic gradient descent; (iii) tighter PAC-Bayes bound. We also prove that its advantage can be strengthened by nonlinearity increasing. To some extent, we bridge the gap between learning (i.e., minimizing the new objective function) and generalization (i.e., minimizing a PAC-Bayes bound) in the new scheme. We also interpret our findings through the experiments on Residual Networks and DenseNet, showing that our new scheme performs superior to single-loss and multi-loss schemes no matter with randomization or not. △ Less

Submitted 3 November, 2018; originally announced November 2018.

Comments: 11 pages, 3 figures (20 subfigures), prepared to submit to IEEE Trans. on Neural Networks and Learning Systems

arXiv:1409.0391 [pdf, ps, other]

Estimating Linear Mixed-effects State Space Model Based on Disturbance Smoothing

Authors: Jie Zhou, Ai** Tang

Abstract: We extend the linear mixed-effects state model to accommodate the correlated individuals and investigate its parameter and state estimation based on disturbance smoothing in this paper. For parameter estimation, EM and score based algorithms are considered. Intermediate quantity of EM algorithm is investigated firstly from which the explicit recursive formulas for the maximizer of the intermediate… ▽ More We extend the linear mixed-effects state model to accommodate the correlated individuals and investigate its parameter and state estimation based on disturbance smoothing in this paper. For parameter estimation, EM and score based algorithms are considered. Intermediate quantity of EM algorithm is investigated firstly from which the explicit recursive formulas for the maximizer of the intermediate quantity are derived out for two given models. As for score based algorithms, explicit formulas for the score vector are achieved from which it is shown that the maximum likelihood estimation is equivalent to moment estimation. For state estimation we advocate it should be carried out without assuming the random effects being known in advance especially when the longitudinal observations are sparse. To this end an algorithm named mixture Kalman filter with kernel smoothing (MKF-KS) is proposed. Numerical studies are carried out to investigate the proposed algorithms which validate the efficacy of the proposed inference approaches. △ Less

Submitted 2 September, 2014; v1 submitted 1 September, 2014; originally announced September 2014.

Comments: 27 pages, 2 figures

arXiv:1401.3518 [pdf, other]

doi 10.1103/PhysRevE.93.052301

Percolation under Noise: Detecting Explosive Percolation Using the Second Largest Component

Authors: Wes Viles, Cedric E. Ginestet, Ariana Tang, Mark A. Kramer, Eric D. Kolaczyk

Abstract: We consider the problem of distinguishing classical (Erdős-Rényi) percolation from explosive (Achlioptas) percolation, under noise. A statistical model of percolation is constructed allowing for the birth and death of edges as well as the presence of noise in the observations. This graph-valued stochastic process is composed of a latent and an observed non-stationary process, where the observed gr… ▽ More We consider the problem of distinguishing classical (Erdős-Rényi) percolation from explosive (Achlioptas) percolation, under noise. A statistical model of percolation is constructed allowing for the birth and death of edges as well as the presence of noise in the observations. This graph-valued stochastic process is composed of a latent and an observed non-stationary process, where the observed graph process is corrupted by Type I and Type II errors. This produces a hidden Markov graph model. We show that for certain choices of parameters controlling the noise, the classical (ER) percolation is visually indistinguishable from the explosive (Achlioptas) percolation model. In this setting, we compare two different criteria for discriminating between these two percolation models, based on a quantile difference (QD) of the first component's size and on the maximal size of the second largest component. We show through data simulations that this second criterion outperforms the QD of the first component's size, in terms of discriminatory power. The maximal size of the second component therefore provides a useful statistic for distinguishing between the ER and Achlioptas models of percolation, under physically motivated conditions for the birth and death of edges, and under noise. The potential application of the proposed criteria for percolation detection in clinical neuroscience is also discussed. △ Less

Submitted 15 January, 2014; originally announced January 2014.

Comments: 9 pages and 8 figures. Submitted to Physics Review, Series E

Journal ref: Phys. Rev. E 93, 052301 (2016)

arXiv:1006.1673 [pdf, ps, other]

doi 10.1109/JSAC.2011.110406

Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret

Authors: Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, Ananthram Swami

Abstract: The problem of distributed learning and channel access is considered in a cognitive network with multiple secondary users. The availability statistics of the channels are initially unknown to the secondary users and are estimated using sensing decisions. There is no explicit information exchange or prior agreement among the secondary users. We propose policies for distributed learning and access w… ▽ More The problem of distributed learning and channel access is considered in a cognitive network with multiple secondary users. The availability statistics of the channels are initially unknown to the secondary users and are estimated using sensing decisions. There is no explicit information exchange or prior agreement among the secondary users. We propose policies for distributed learning and access which achieve order-optimal cognitive system throughput (number of successful secondary transmissions) under self play, i.e., when implemented at all the secondary users. Equivalently, our policies minimize the regret in distributed learning and access. We first consider the scenario when the number of secondary users is known to the policy, and prove that the total regret is logarithmic in the number of transmission slots. Our distributed learning and access policy achieves order-optimal regret by comparing to an asymptotic lower bound for regret under any uniformly-good learning and access policy. We then consider the case when the number of secondary users is fixed but unknown, and is estimated through feedback. We propose a policy in this scenario whose asymptotic sum regret which grows slightly faster than logarithmic in the number of transmission slots. △ Less

Submitted 8 June, 2010; originally announced June 2010.

Comments: Submitted to IEEE JSAC on Advances in Cognitive Radio Networking and Communications, Dec. 2009, Revised May 2010

Showing 1–6 of 6 results for author: Tang, A