Showing 1–1 of 1 results for author: Chaperon, G

Search v0.5.6 released 2020-02-24

arXiv:2308.02976 [pdf, ps, other]

cs.CL cs.AI cs.LG

Spanish Pre-trained BERT Model and Evaluation Data

Authors: José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Ho** Kang, Jorge Pérez

Abstract: The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single re… ▽ More The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: Published as workshop paper at Practical ML for Develo** Countries Workshop @ ICLR 2020

Search v0.5.6 released 2020-02-24