-
MERA: A Comprehensive LLM Evaluation in Russian
Authors:
Alena Fenogenova,
Artem Chervyakov,
Nikita Martynov,
Anastasia Kozlova,
Maria Tikhonova,
Albina Akhmetgareeva,
Anton Emelyanov,
Denis Shevelev,
Pavel Lebedev,
Leonid Sinev,
Ulyana Isaeva,
Katerina Kolomeytseva,
Daniil Moskovskiy,
Elizaveta Goncharova,
Nikita Savushkin,
Polina Mikhailova,
Denis Dimitrov,
Alexander Panchenko,
Sergei Markov
Abstract:
Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitatio…
▽ More
Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find that they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential societal drawbacks.
△ Less
Submitted 12 January, 2024; v1 submitted 9 January, 2024;
originally announced January 2024.
-
A Family of Pretrained Transformer Language Models for Russian
Authors:
Dmitry Zmitrovich,
Alexander Abramov,
Andrey Kalmykov,
Maria Tikhonova,
Ekaterina Taktasheva,
Danil Astafurov,
Mark Baushenko,
Artem Snegirev,
Vitalii Kadulin,
Sergey Markov,
Tatiana Shavrina,
Vladislav Mikhailov,
Alena Fenogenova
Abstract:
Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, develo** such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) archite…
▽ More
Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, develo** such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
△ Less
Submitted 18 April, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
TAPE: Assessing Few-shot Russian Language Understanding
Authors:
Ekaterina Taktasheva,
Tatiana Shavrina,
Alena Fenogenova,
Denis Shevelev,
Nadezhda Katricheva,
Maria Tikhonova,
Albina Akhmetgareeva,
Oleg Zinkevich,
Anastasiia Bashmakova,
Svetlana Iordanskaia,
Alena Spiridonova,
Valentina Kurenshchikova,
Ekaterina Artemova,
Vladislav Mikhailov
Abstract:
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six mo…
▽ More
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
mGPT: Few-Shot Learners Go Multilingual
Authors:
Oleh Shliazhko,
Alena Fenogenova,
Maria Tikhonova,
Vladislav Mikhailov,
Anastasia Kozlova,
Tatiana Shavrina
Abstract:
Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean…
▽ More
Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. source code and the mGPT XL model are publicly released.
△ Less
Submitted 12 October, 2023; v1 submitted 15 April, 2022;
originally announced April 2022.
-
Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models
Authors:
Alena Fenogenova,
Maria Tikhonova,
Vladislav Mikhailov,
Tatiana Shavrina,
Anton Emelyanov,
Denis Shevelev,
Alexandr Kukushkin,
Valentin Malykh,
Ekaterina Artemova
Abstract:
In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks.
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological impro…
▽ More
In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks.
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological improvements, including fixes of the benchmark vulnerabilities unresolved in the previous version: novel and improved tests for understanding the meaning of a word in context (RUSSE) along with reading comprehension and common sense reasoning (DaNetQA, RuCoS, MuSeRC). Together with the release of the updated datasets, we improve the benchmark toolkit based on \texttt{jiant} framework for consistent training and evaluation of NLP-models of various architectures which now supports the most recent models for Russian. Finally, we provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO (MOdel ResOurCe COmparison), in which the models are evaluated according to the weighted average metric over all tasks, the inference speed, and the occupied amount of RAM. Russian SuperGLUE is publicly available at https://russiansuperglue.com/.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
MOROCCO: Model Resource Comparison Framework
Authors:
Valentin Malykh,
Alexander Kukushkin,
Ekaterina Artemova,
Vladislav Mikhailov,
Maria Tikhonova,
Tatiana Shavrina
Abstract:
The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to comp…
▽ More
The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to compare language models compatible with \texttt{jiant} environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites. We demonstrate its applicability for two GLUE-like suites in different languages.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Authors:
Tatiana Shavrina,
Alena Fenogenova,
Anton Emelyanov,
Denis Shevelev,
Ekaterina Artemova,
Valentin Malykh,
Vladislav Mikhailov,
Maria Tikhonova,
Andrey Chertok,
Andrey Evlampiev
Abstract:
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logi…
▽ More
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.
△ Less
Submitted 2 November, 2020; v1 submitted 29 October, 2020;
originally announced October 2020.