-
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Authors:
Zachary Ankner,
Cody Blakeney,
Kartik Sreenivasan,
Max Marion,
Matthew L. Leavitt,
Mansheej Paul
Abstract:
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected…
▽ More
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a $1.45\times$ reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Authors:
Max Marion,
Ahmet Üstün,
Luiza Pozzobon,
Alex Wang,
Marzieh Fadaee,
Sara Hooker
Abstract:
Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scra** the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work…
▽ More
Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scra** the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Existence of pulses for a reaction-diffusion system of blood coagulation
Authors:
Nicolas Ratto,
Martine Marion,
Vitaly Volpert
Abstract:
The paper is devoted to the investigation of a reaction-diffusion system of equations describing the process of blood coagulation. Existence of pulses solutions, that is, positive stationary solutions with zero limit at infinity is studied. It is shown that such solutions exist if and only if the speed of the travelling wave described by the same system is positive. The proof is based on the Leray…
▽ More
The paper is devoted to the investigation of a reaction-diffusion system of equations describing the process of blood coagulation. Existence of pulses solutions, that is, positive stationary solutions with zero limit at infinity is studied. It is shown that such solutions exist if and only if the speed of the travelling wave described by the same system is positive. The proof is based on the Leray-Schauder method using topological degree for elliptic problems in unbounded domains and a priori estimates of solutions in some appropriate weighted spaces.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Hydrothermal formation of Clay-Carbonate alteration assemblages in the Nili Fossae region of Mars
Authors:
Adrian J. Brown,
Simon J. Hook,
Alice M. Baldridge,
James K. Crowley,
Nathan T. Bridges,
Bradley J. Thomson,
Giles M. Marion,
Carlos R. de Souza Filho,
Janice L. Bishop
Abstract:
The Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) has returned observations of the Nili Fossae region indicating the presence of Mg- carbonate in small (<10km sq2), relatively bright rock units that are commonly fractured (Ehlmann et al., 2008b). We have analyzed spectra from CRISM images and used co-located HiRISE images in order to further characterize these carbonate-bearing unit…
▽ More
The Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) has returned observations of the Nili Fossae region indicating the presence of Mg- carbonate in small (<10km sq2), relatively bright rock units that are commonly fractured (Ehlmann et al., 2008b). We have analyzed spectra from CRISM images and used co-located HiRISE images in order to further characterize these carbonate-bearing units. We applied absorption band map** techniques to investigate a range of possible phyllosilicate and carbonate minerals that could be present in the Nili Fossae region. We also describe a clay-carbonate hydrothermal alteration mineral assemblage in the Archean Warrawoona Group of Western Australia that is a potential Earth analog to the Nili Fossae carbonate-bearing rock units. We discuss the geological and biological implications for hydrothermal processes on Noachian Mars.
△ Less
Submitted 5 February, 2014;
originally announced February 2014.
-
Global existence for fully nonlinear reaction-diffusion systems describing multicomponent reactive flows
Authors:
Martine Marion,
Roger Temam
Abstract:
We consider combustion problems in the presence of complex chemistry and nonlinear diffusion laws leading to fully nonlinear multispecies reaction-diffusion equations. We establish results of existence of solution and maximum principle, i.e. positivity of the mass fractions, which rely on specific properties of the models. The nonlinear diffusion coefficients are obtained by resolution of the so-c…
▽ More
We consider combustion problems in the presence of complex chemistry and nonlinear diffusion laws leading to fully nonlinear multispecies reaction-diffusion equations. We establish results of existence of solution and maximum principle, i.e. positivity of the mass fractions, which rely on specific properties of the models. The nonlinear diffusion coefficients are obtained by resolution of the so-called Stefan-Maxwell equations.
△ Less
Submitted 9 October, 2013;
originally announced October 2013.