Skip to main content

Showing 1–14 of 14 results for author: Moroshko, E

.
  1. arXiv:2306.03534  [pdf, other

    cs.LG math.NA

    Continual Learning in Linear Classification on Separable Data

    Authors: Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh, Nathan Srebro, Daniel Soudry

    Abstract: We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a special case of the Projection Onto Convex Sets (POCS) framework. We then develop upper bounds on the forgetting and other quantities of interest under various set… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

  2. arXiv:2205.09588  [pdf, other

    cs.LG math.NA

    How catastrophic can catastrophic forgetting be in linear regression?

    Authors: Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry

    Abstract: To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research… ▽ More

    Submitted 25 May, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Journal ref: 35th Annual Conference on Learning Theory (2022)

  3. arXiv:2102.09769  [pdf, other

    cs.LG

    On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

    Authors: Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson, Daniel Soudry

    Abstract: Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involve… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Comments: 33 pages, 2 figures

    MSC Class: 68T07 (Primary) ACM Class: I.2.6; G.1.6

  4. arXiv:2007.06738  [pdf, other

    cs.LG stat.ML

    Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

    Authors: Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

    Abstract: We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accuratel… ▽ More

    Submitted 13 July, 2020; originally announced July 2020.

  5. arXiv:2002.09277  [pdf, other

    cs.LG stat.ML

    Kernel and Rich Regimes in Overparametrized Models

    Authors: Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

    Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More

    Submitted 27 July, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: This updates and significantly extends a previous article (arXiv:1906.05827), Sections 6 and 7 are the most major additions. 31 pages. arXiv admin note: text overlap with arXiv:1906.05827

  6. arXiv:1906.05827   

    cs.LG stat.ML

    Kernel and Rich Regimes in Overparametrized Models

    Authors: Blake Woodworth, Suriya Gunasekar, Pedro Savarese, Edward Moroshko, Itay Golan, Jason Lee, Daniel Soudry, Nathan Srebro

    Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More

    Submitted 25 February, 2020; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: This paper has been substantially modified, updated, and expanded with additional content (arXiv:2002.09277). To avoid confusion with already existing citations, we are withdrawing the old version of this article

  7. arXiv:1906.05591  [pdf, other

    cs.LG eess.SY stat.ML

    Finite Sample Analysis Of Dynamic Regression Parameter Learning

    Authors: Mark Kozdoba, Edward Moroshko, Shie Mannor, Koby Crammer

    Abstract: We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of… ▽ More

    Submitted 11 October, 2022; v1 submitted 13 June, 2019; originally announced June 2019.

    Journal ref: NeurIPS 2022

  8. arXiv:1902.10360  [pdf, ps, other

    cs.CL cs.IR

    An Editorial Network for Enhanced Document Summarization

    Authors: Edward Moroshko, Guy Feigenblat, Haggai Roitman, David Konopnicki

    Abstract: We suggest a new idea of Editorial Network - a mixed extractive-abstractive summarization approach, which is applied as a post-processing step over a given sequence of extracted sentences. Our network tries to imitate the decision process of a human editor during summarization. Within such a process, each extracted sentence may be either kept untouched, rephrased or completely rejected. We further… ▽ More

    Submitted 27 February, 2019; originally announced February 2019.

  9. arXiv:1812.07010  [pdf, other

    cs.LG cs.CV stat.ML

    Multi Instance Learning For Unbalanced Data

    Authors: Mark Kozdoba, Edward Moroshko, Lior Shani, Takuya Takagi, Takashi Katoh, Shie Mannor, Koby Crammer

    Abstract: In the context of Multi Instance Learning, we analyze the Single Instance (SI) learning objective. We show that when the data is unbalanced and the family of classifiers is sufficiently rich, the SI method is a useful learning algorithm. In particular, we show that larger data imbalance, a quality that is typically perceived as negative, in fact implies a better resilience of the algorithm to the… ▽ More

    Submitted 17 December, 2018; originally announced December 2018.

  10. arXiv:1803.03319  [pdf, other

    cs.LG stat.ML

    Efficient Loss-Based Decoding on Graphs For Extreme Classification

    Authors: Itay Evron, Edward Moroshko, Koby Crammer

    Abstract: In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set. We build on a recent extreme classification framework with logarithmic time and space, and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds. Our f… ▽ More

    Submitted 8 November, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

    Journal ref: Advances in Neural Information Processing Systems 32 (2018), 7232-7243

  11. arXiv:1402.4084  [pdf, other

    cs.LG

    Selective Sampling with Drift

    Authors: Edward Moroshko, Koby Crammer

    Abstract: Recently there has been much work on selective sampling, an online active learning setting, in which algorithms work in rounds. On each round an algorithm receives an input and makes a prediction. Then, it can decide whether to query a label, and if so to update its model, otherwise the input is discarded. Most of this work is focused on the stationary case, where it is assumed that there is a fix… ▽ More

    Submitted 17 February, 2014; originally announced February 2014.

  12. arXiv:1303.3754  [pdf, other

    cs.LG

    A Last-Step Regression Algorithm for Non-Stationary Online Learning

    Authors: Edward Moroshko, Koby Crammer

    Abstract: The goal of a learner in standard online learning is to maintain an average loss close to the loss of the best-performing single function in some class. In many real-world problems, such as rating or ranking items, there is no single best target function during the runtime of the algorithm, instead the best (local) target function is drifting over time. We develop a novel last-step minmax optimal… ▽ More

    Submitted 15 March, 2013; originally announced March 2013.

    Comments: arXiv admin note: substantial text overlap with arXiv:1303.0140

  13. arXiv:1303.0140  [pdf, other

    cs.LG stat.ML

    Second-Order Non-Stationary Online Learning for Regression

    Authors: Nina Vaits, Edward Moroshko, Koby Crammer

    Abstract: The goal of a learner, in standard online learning, is to have the cumulative loss not much larger compared with the best-performing function from some fixed class. Numerous algorithms were shown to have this gap arbitrarily close to zero, compared with the best function that is chosen off-line. Nevertheless, many real-world applications, such as adaptive filtering, are non-stationary in nature, a… ▽ More

    Submitted 1 March, 2013; originally announced March 2013.

  14. arXiv:1301.6058  [pdf, other

    cs.LG

    Weighted Last-Step Min-Max Algorithm with Improved Sub-Logarithmic Regret

    Authors: Edward Moroshko, Koby Crammer

    Abstract: In online learning the performance of an algorithm is typically compared to the performance of a fixed function from some class, with a quantity called regret. Forster proposed a last-step min-max algorithm which was somewhat simpler than the algorithm of Vovk, yet with the same regret. In fact the algorithm he analyzed assumed that the choices of the adversary are bounded, yielding artificially o… ▽ More

    Submitted 25 January, 2013; originally announced January 2013.