Skip to main content

Showing 1–25 of 25 results for author: Malach, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17748  [pdf, other

    cs.LG math.OC stat.ML

    A New Perspective on Shampoo's Preconditioner

    Authors: Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson

    Abstract: Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connec… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.11741  [pdf, other

    cs.LG cs.AI

    Transcendence: Generative Models Can Outperform The Experts That Train Them

    Authors: Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach

    Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities… ▽ More

    Submitted 28 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Code, models, and data at https://transcendence.eddie.win

  3. arXiv:2402.11004  [pdf, other

    cs.LG

    The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

    Authors: Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis

    Abstract: Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induct… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  4. arXiv:2402.01032  [pdf, other

    cs.LG cs.AI cs.CL

    Repeat After Me: Transformers are Better than State Space Models at Copying

    Authors: Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

    Abstract: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks th… ▽ More

    Submitted 3 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  5. arXiv:2309.06979  [pdf, other

    cs.LG cs.CL

    Auto-Regressive Next-Token Predictors are Universal Learners

    Authors: Eran Malach

    Abstract: Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

  6. arXiv:2309.03800  [pdf, other

    cs.LG cs.AI stat.ML

    Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck

    Authors: Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

    Abstract: In modern deep learning, algorithmic choices (such as width, depth, and learning rate) are known to modulate nuanced resource tradeoffs. This work investigates how these complexities necessarily arise for feature learning in the presence of computational-statistical gaps. We begin by considering offline sparse parity learning, a supervised classification problem which admits a statistical query lo… ▽ More

    Submitted 30 October, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: v2: NeurIPS 2023 camera-ready updates

  7. arXiv:2309.01640  [pdf, other

    cs.LG cs.AI

    Corgi^2: A Hybrid Offline-Online Approach To Storage-Aware Data Shuffling For SGD

    Authors: Etay Livne, Gal Kaplun, Eran Malach, Shai Shalev-Schwatz

    Abstract: When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: 19 pages, 5 figures

  8. arXiv:2302.06354  [pdf, other

    cs.LG cs.AI

    Less is More: Selective Layer Finetuning with SubTuning

    Authors: Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, Eran Malach

    Abstract: Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, kee** the rest of the weights frozen at their initial (pretrained) v… ▽ More

    Submitted 2 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

  9. arXiv:2207.08799  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

    Authors: Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

    Abstract: There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-spars… ▽ More

    Submitted 15 January, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

    Comments: v3: final camera-ready revisions for NeurIPS 2022

  10. arXiv:2203.14649  [pdf, other

    cs.LG cs.AI stat.ML

    Knowledge Distillation: Bad Models Can Be Good Role Models

    Authors: Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

    Abstract: Large neural networks trained in the overparameterized regime are able to fit noise to zero train error. Recent work \citep{nakkiran2020distributional} has empirically observed that such networks behave as "conditional samplers" from the noisy distribution. That is, they replicate the noise in the train data to unseen examples. We give a theoretical framework for studying this conditional sampling… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

  11. arXiv:2108.04190  [pdf, ps, other

    cs.LG stat.ML

    On the Power of Differentiable Learning versus PAC and SQ Learning

    Authors: Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

    Abstract: We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends… ▽ More

    Submitted 5 February, 2022; v1 submitted 9 August, 2021; originally announced August 2021.

  12. arXiv:2103.01210  [pdf, ps, other

    cs.LG stat.ML

    Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

    Authors: Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

    Abstract: We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small ev… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

  13. arXiv:2102.00434  [pdf, ps, other

    cs.LG cs.NE stat.ML

    The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

    Authors: Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

    Abstract: Several recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks. In this work we study the intricat… ▽ More

    Submitted 18 July, 2021; v1 submitted 31 January, 2021; originally announced February 2021.

    Comments: COLT 2021 camera ready version

  14. arXiv:2010.01369  [pdf, other

    cs.LG stat.ML

    Computational Separation Between Convolutional and Fully-Connected Networks

    Authors: Eran Malach, Shai Shalev-Shwartz

    Abstract: Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks. However, the advantage of using convolutional networks over fully-connected networks is not understood from a theoretical perspective. In this work, we show how convolutional networks can leverage locality in the data, and thus achieve a computational advantage over fully-connected networks.… ▽ More

    Submitted 3 October, 2020; originally announced October 2020.

  15. arXiv:2008.08059  [pdf, ps, other

    cs.LG stat.ML

    When Hardness of Approximation Meets Hardness of Learning

    Authors: Eran Malach, Shai Shalev-Shwartz

    Abstract: A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples. The hypothesis of the learner is taken from some fixed class of functions (e.g., linear classifiers, neural networks etc.). A failure of the learning algorithm can occur due to two possible reasons: wrong choice of hypothesis class (hardn… ▽ More

    Submitted 23 August, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

  16. arXiv:2002.07400  [pdf, other

    cs.LG stat.ML

    Learning Parities with Neural Networks

    Authors: Amit Daniely, Eran Malach

    Abstract: In recent years we see a rapidly growing line of research which shows learnability of various models via common neural network algorithms. Yet, besides a very few outliers, these results show learnability of models that can be learned using linear methods. Namely, such results show that learning neural-networks with gradient-descent is competitive with learning a linear classifier on top of a data… ▽ More

    Submitted 3 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

  17. arXiv:2002.00585  [pdf, ps, other

    cs.LG stat.ML

    Proving the Lottery Ticket Hypothesis: Pruning is All You Need

    Authors: Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

    Abstract: The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded wei… ▽ More

    Submitted 3 February, 2020; originally announced February 2020.

  18. arXiv:1910.11923  [pdf, other

    cs.LG stat.ML

    Learning Boolean Circuits with Neural Networks

    Authors: Eran Malach, Shai Shalev-Shwartz

    Abstract: While on some natural distributions, neural-networks are trained efficiently using gradient-based algorithms, it is known that learning them is computationally hard in the worst-case. To separate hard from easy to learn distributions, we observe the property of local correlation: correlation between local patterns of the input and the target label. We focus on learning deep neural-networks using a… ▽ More

    Submitted 18 January, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

  19. arXiv:1907.05444  [pdf, other

    cs.LG stat.ML

    On the Optimality of Trees Generated by ID3

    Authors: Alon Brutzkus, Amit Daniely, Eran Malach

    Abstract: Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we introduce a novel metric of a decision tree algorithm's performance, called mean iteration statistical consistency (MIC), which measures optimality of trees generated by ID3. As opposed to… ▽ More

    Submitted 23 February, 2020; v1 submitted 11 July, 2019; originally announced July 2019.

  20. arXiv:1906.08654  [pdf, ps, other

    cs.LG stat.ML

    ID3 Learns Juntas for Smoothed Product Distributions

    Authors: Alon Brutzkus, Amit Daniely, Eran Malach

    Abstract: In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of… ▽ More

    Submitted 20 June, 2019; originally announced June 2019.

  21. arXiv:1906.05032  [pdf, ps, other

    cs.LG stat.ML

    Decoupling Gating from Linearity

    Authors: Jonathan Fiat, Eran Malach, Shai Shalev-Shwartz

    Abstract: ReLU neural-networks have been in the focus of many recent theoretical works, trying to explain their empirical success. Nonetheless, there is still a gap between current theoretical results and empirical observations, even in the case of shallow (one hidden-layer) networks. For example, in the task of memorizing a random sample of size $m$ and dimension $d$, the best theoretical result requires t… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

  22. arXiv:1903.03488  [pdf, other

    cs.LG stat.ML

    Is Deeper Better only when Shallow is Good?

    Authors: Eran Malach, Shai Shalev-Shwartz

    Abstract: Understanding the power of depth in feed-forward neural networks is an ongoing challenge in the field of deep learning theory. While current works account for the importance of depth for the expressive power of neural-networks, it remains an open question whether these benefits are exploited during a gradient-based optimization process. In this work we explore the relation between expressivity pro… ▽ More

    Submitted 8 March, 2019; originally announced March 2019.

  23. arXiv:1803.09522  [pdf, other

    cs.LG stat.ML

    A Provably Correct Algorithm for Deep Learning that Actually Works

    Authors: Eran Malach, Shai Shalev-Shwartz

    Abstract: We describe a layer-by-layer algorithm for training deep convolutional networks, where each step involves gradient updates for a two layer network followed by a simple clustering algorithm. Our algorithm stems from a deep generative model that generates mages level by level, where lower resolution images correspond to latent semantic classes. We analyze the convergence rate of our algorithm assumi… ▽ More

    Submitted 24 June, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

  24. arXiv:1710.10174  [pdf, other

    cs.LG

    SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

    Authors: Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

    Abstract: Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated b… ▽ More

    Submitted 27 October, 2017; originally announced October 2017.

  25. arXiv:1706.02613  [pdf, other

    cs.LG

    Decoupling "when to update" from "how to update"

    Authors: Eran Malach, Shai Shalev-Shwartz

    Abstract: Deep learning requires data. A useful approach to obtain data is to be creative and mine data from various sources, that were created for different purposes. Unfortunately, this approach often leads to noisy labels. In this paper, we propose a meta algorithm for tackling the noisy labels problem. The key idea is to decouple "when to update" from "how to update". We demonstrate the effectiveness of… ▽ More

    Submitted 26 March, 2018; v1 submitted 8 June, 2017; originally announced June 2017.