Skip to main content

Showing 1–4 of 4 results for author: Penedo, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17557  [pdf, other

    cs.CL

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Authors: Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf

    Abstract: The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produ… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2311.16867  [pdf, other

    cs.CL cs.AI

    The Falcon Series of Open Language Models

    Authors: Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo

    Abstract: We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurr… ▽ More

    Submitted 29 November, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

  3. arXiv:2306.01116  [pdf, other

    cs.CL cs.AI

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Authors: Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay

    Abstract: Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclea… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  4. Automatic segmentation of the Foveal Avascular Zone in ophthalmological OCT-A images

    Authors: Macarena Díaz, Jorge Novo, Paula Cutrín, Francisco Gómez-Ulla, Manuel G. Penedo, Marcos Ortega

    Abstract: Angiography by Optical Coherence Tomography is a non-invasive retinal imaging modality of recent appearance that allows the visualization of the vascular structure at predefined depths based on the detection of the blood movement. OCT-A images constitute a suitable scenario to analyse the retinal vascular properties of regions of interest, measuring the characteristics of the foveal vascular and a… ▽ More

    Submitted 26 November, 2018; originally announced November 2018.