-
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Authors:
Zachary Ankner,
Cody Blakeney,
Kartik Sreenivasan,
Max Marion,
Matthew L. Leavitt,
Mansheej Paul
Abstract:
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected…
▽ More
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a $1.45\times$ reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Mini-Batch Optimization of Contrastive Loss
Authors:
Jaewoong Cho,
Kartik Sreenivasan,
Keon Lee,
Kyunghoo Mun,
Soheun Yi,
Jeong-Gwan Lee,
Anna Lee,
Jy-yong Sohn,
Dimitris Papailiopoulos,
Kangwook Lee
Abstract:
Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging t…
▽ More
Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging to consider all possible positive and negative pairs, leading to the use of mini-batch optimization. In this paper, we investigate the theoretical aspects of mini-batch optimization in contrastive learning. We show that mini-batch optimization is equivalent to full-batch optimization if and only if all $\binom{N}{B}$ mini-batches are selected, while sub-optimality may arise when examining only a subset. We then demonstrate that utilizing high-loss mini-batches can speed up SGD convergence and propose a spectral clustering-based approach for identifying these high-loss mini-batches. Our experimental results validate our theoretical findings and demonstrate that our proposed algorithm outperforms vanilla SGD in practically relevant settings, providing a better understanding of mini-batch optimization in contrastive learning.
△ Less
Submitted 12 July, 2023;
originally announced July 2023.
-
Teaching Arithmetic to Small Transformers
Authors:
Nayoung Lee,
Kartik Sreenivasan,
Jason D. Lee,
Kangwook Lee,
Dimitris Papailiopoulos
Abstract:
Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as add…
▽ More
Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning
Authors:
Yingcong Li,
Kartik Sreenivasan,
Angeliki Giannou,
Dimitris Papailiopoulos,
Samet Oymak
Abstract:
Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositi…
▽ More
Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.
△ Less
Submitted 7 November, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
Rare Gems: Finding Lottery Tickets at Initialization
Authors:
Kartik Sreenivasan,
Jy-yong Sohn,
Liu Yang,
Matthew Grinde,
Alliot Nagle,
Hongyi Wang,
Eric Xing,
Kangwook Lee,
Dimitris Papailiopoulos
Abstract:
Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming "train, prune, re-train" approach. Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work by Frankle e…
▽ More
Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming "train, prune, re-train" approach. Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work by Frankle et al. and Su et al. presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by proposing Gem-Miner which finds lottery tickets at initialization that beat current baselines. Gem-Miner finds lottery tickets trainable to accuracy competitive or better than Iterative Magnitude Pruning (IMP), and does so up to $19\times$ faster.
△ Less
Submitted 2 June, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
Finding Everything within Random Binary Networks
Authors:
Kartik Sreenivasan,
Shashank Rajput,
Jy-yong Sohn,
Dimitris Papailiopoulos
Abstract:
A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used cont…
▽ More
A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used continuous-valued random initializations can indeed be pruned to approximate any target network. In this work, we show that the amplitude of those random weights does not even matter. We prove that any target network can be approximated up to arbitrary accuracy by simply pruning a random network of binary $\{\pm1\}$ weights that is only a polylogarithmic factor wider and deeper than the target network.
△ Less
Submitted 22 October, 2021; v1 submitted 17 October, 2021;
originally announced October 2021.
-
An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks
Authors:
Shashank Rajput,
Kartik Sreenivasan,
Dimitris Papailiopoulos,
Amin Karbasi
Abstract:
It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/δ^2}+\sqrt{n})$ neurons and…
▽ More
It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/δ^2}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(e^{1/δ^2}(d+\sqrt{n})+n)$ weights, where $δ$ is the minimum distance between the points. In this work, we improve the dependence on $δ$ from exponential to almost linear, proving that $\widetilde{\mathcal{O}}(\frac{1}δ+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(\frac{d}δ+n)$ weights are sufficient. Our construction uses Gaussian random weights only in the first layer, while all the subsequent layers use binary or integer weights. We also prove new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
Attack of the Tails: Yes, You Really Can Backdoor Federated Learning
Authors:
Hongyi Wang,
Kartik Sreenivasan,
Shashank Rajput,
Harit Vishwakarma,
Saurabh Agarwal,
Jy-yong Sohn,
Kangwook Lee,
Dimitris Papailiopoulos
Abstract:
Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training. The goal of a backdoor is to corrupt the performance of the trained model on specific sub-tasks (e.g., by classifying green cars as frogs). A range of FL backdoor attacks have been introduced in the literature, but also methods to defend against them, and it is cur…
▽ More
Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training. The goal of a backdoor is to corrupt the performance of the trained model on specific sub-tasks (e.g., by classifying green cars as frogs). A range of FL backdoor attacks have been introduced in the literature, but also methods to defend against them, and it is currently an open question whether FL systems can be tailored to be robust against backdoors. In this work, we provide evidence to the contrary. We first establish that, in the general case, robustness to backdoors implies model robustness to adversarial examples, a major open problem in itself. Furthermore, detecting the presence of a backdoor in a FL model is unlikely assuming first order oracles or polynomial time. We couple our theoretical results with a new family of backdoor attacks, which we refer to as edge-case backdoors. An edge-case backdoor forces a model to misclassify on seemingly easy inputs that are however unlikely to be part of the training, or test data, i.e., they live on the tail of the input distribution. We explain how these edge-case backdoors can lead to unsavory failures and may have serious repercussions on fairness, and exhibit that with careful tuning at the side of the adversary, one can insert them across a range of machine learning tasks (e.g., image classification, OCR, text prediction, sentiment analysis).
△ Less
Submitted 9 July, 2020;
originally announced July 2020.