Skip to main content

Showing 1–2 of 2 results for author: Kydlíček, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17557  [pdf, other

    cs.CL

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Authors: Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf

    Abstract: The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produ… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2307.10666  [pdf, other

    cs.CL

    A Dataset and Strong Baselines for Classification of Czech News Texts

    Authors: Hynek Kydlíček, **dřich Libovický

    Abstract: Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of ne… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: 12 pages, Accepted to Text, Speech and Dialogue (TSD) 2023