-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Authors:
Guilherme Penedo,
Hynek Kydlíček,
Loubna Ben allal,
Anton Lozhkov,
Margaret Mitchell,
Colin Raffel,
Leandro Von Werra,
Thomas Wolf
Abstract:
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produ…
▽ More
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
The Falcon Series of Open Language Models
Authors:
Ebtesam Almazrouei,
Hamza Alobeidli,
Abdulaziz Alshamsi,
Alessandro Cappelli,
Ruxandra Cojocaru,
Mérouane Debbah,
Étienne Goffinet,
Daniel Hesslow,
Julien Launay,
Quentin Malartic,
Daniele Mazzotta,
Badreddine Noune,
Baptiste Pannier,
Guilherme Penedo
Abstract:
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurr…
▽ More
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.
△ Less
Submitted 29 November, 2023; v1 submitted 28 November, 2023;
originally announced November 2023.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Authors:
Guilherme Penedo,
Quentin Malartic,
Daniel Hesslow,
Ruxandra Cojocaru,
Alessandro Cappelli,
Hamza Alobeidli,
Baptiste Pannier,
Ebtesam Almazrouei,
Julien Launay
Abstract:
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclea…
▽ More
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Automatic segmentation of the Foveal Avascular Zone in ophthalmological OCT-A images
Authors:
Macarena Díaz,
Jorge Novo,
Paula Cutrín,
Francisco Gómez-Ulla,
Manuel G. Penedo,
Marcos Ortega
Abstract:
Angiography by Optical Coherence Tomography is a non-invasive retinal imaging modality of recent appearance that allows the visualization of the vascular structure at predefined depths based on the detection of the blood movement. OCT-A images constitute a suitable scenario to analyse the retinal vascular properties of regions of interest, measuring the characteristics of the foveal vascular and a…
▽ More
Angiography by Optical Coherence Tomography is a non-invasive retinal imaging modality of recent appearance that allows the visualization of the vascular structure at predefined depths based on the detection of the blood movement. OCT-A images constitute a suitable scenario to analyse the retinal vascular properties of regions of interest, measuring the characteristics of the foveal vascular and avascular zones. Extracted parameters of this region can be used as prognostic factors that determine if the patient suffers from certain pathologies, indicating the associated pathological degree. The manual extraction of these biomedical parameters is a long, tedious and subjective process, introducing a significant intra and inter-expert variability, which penalizes the utility of the measurements. In addition, the absence of tools that automatically facilitate these calculations encourages the creation of computer-aided diagnosis frameworks that ease the doctor's work, increasing their productivity and making viable the use of this type of vascular biomarkers.
We propose a fully automatic system that identifies and precisely segments the region of the foveal avascular zone (FAZ) using a novel ophthalmological image modality as is OCT-A. The system combines different image processing techniques to firstly identify the region where the FAZ is contained and, secondly, proceed with the extraction of its precise contour. The system was validated using a representative set of 168 OCT-A images, providing accurate results with the best correlation with the manual measurements of two experts clinician of 0.93 as well as a Jaccard's index of 0.82 of the best experimental case. This tool provides an accurate FAZ measurement with the desired objectivity and reproducibility, being very useful for the analysis of relevant vascular diseases through the study of the retinal microcirculation.
△ Less
Submitted 26 November, 2018;
originally announced November 2018.