-
A Family of Pretrained Transformer Language Models for Russian
Authors:
Dmitry Zmitrovich,
Alexander Abramov,
Andrey Kalmykov,
Maria Tikhonova,
Ekaterina Taktasheva,
Danil Astafurov,
Mark Baushenko,
Artem Snegirev,
Vitalii Kadulin,
Sergey Markov,
Tatiana Shavrina,
Vladislav Mikhailov,
Alena Fenogenova
Abstract:
Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, develo** such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) archite…
▽ More
Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, develo** such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
△ Less
Submitted 18 April, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Authors:
BigScience Workshop,
:,
Teven Le Scao,
Angela Fan,
Christopher Akiki,
Ellie Pavlick,
Suzana Ilić,
Daniel Hesslow,
Roman Castagné,
Alexandra Sasha Luccioni,
François Yvon,
Matthias Gallé,
Jonathan Tow,
Alexander M. Rush,
Stella Biderman,
Albert Webson,
Pawan Sasanka Ammanamanchi,
Thomas Wang,
Benoît Sagot,
Niklas Muennighoff,
Albert Villanova del Moral,
Olatunji Ruwase,
Rachel Bawden,
Stas Bekman,
Angelina McMillan-Major
, et al. (369 additional authors not shown)
Abstract:
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access…
▽ More
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
△ Less
Submitted 27 June, 2023; v1 submitted 9 November, 2022;
originally announced November 2022.
-
Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation
Authors:
Oleg Serikov,
Vitaly Protasov,
Ekaterina Voloshina,
Viktoria Knyazkova,
Tatiana Shavrina
Abstract:
Linguistic analysis of language models is one of the ways to explain and describe their reasoning, weaknesses, and limitations. In the probing part of the model interpretability research, studies concern individual languages as well as individual linguistic structures. The question arises: are the detected regularities linguistically coherent, or on the contrary, do they dissonate at the typologic…
▽ More
Linguistic analysis of language models is one of the ways to explain and describe their reasoning, weaknesses, and limitations. In the probing part of the model interpretability research, studies concern individual languages as well as individual linguistic structures. The question arises: are the detected regularities linguistically coherent, or on the contrary, do they dissonate at the typological scale? Moreover, the majority of studies address the inherent set of languages and linguistic structures, leaving the actual typological diversity knowledge out of scope. In this paper, we present and apply the GUI-assisted framework allowing us to easily probe a massive number of languages for all the morphosyntactic features present in the Universal Dependencies data. We show that reflecting the anglo-centric trend in NLP over the past years, most of the regularities revealed in the mBERT model are typical for the western-European languages. Our framework can be integrated with the existing probing toolboxes, model cards, and leaderboards, allowing practitioners to use and share their standard probing methods to interpret multilingual models. Thus we propose a toolkit to systematize the multilingual flaws in multilingual models, providing a reproducible experimental setup for 104 languages and 80 morphosyntactic features. https://github.com/AIRI-Institute/Probing_framework
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
TAPE: Assessing Few-shot Russian Language Understanding
Authors:
Ekaterina Taktasheva,
Tatiana Shavrina,
Alena Fenogenova,
Denis Shevelev,
Nadezhda Katricheva,
Maria Tikhonova,
Albina Akhmetgareeva,
Oleg Zinkevich,
Anastasiia Bashmakova,
Svetlana Iordanskaia,
Alena Spiridonova,
Valentina Kurenshchikova,
Ekaterina Artemova,
Vladislav Mikhailov
Abstract:
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six mo…
▽ More
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
Vote'n'Rank: Revision of Benchmarking with Social Choice Theory
Authors:
Mark Rofin,
Vladislav Mikhailov,
Mikhail Florinskiy,
Andrey Kravchenko,
Elena Tutubalina,
Tatiana Shavrina,
Daniel Karabekyan,
Ekaterina Artemova
Abstract:
The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular in…
▽ More
The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote'n'Rank's procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.
△ Less
Submitted 12 February, 2023; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Is neural language acquisition similar to natural? A chronological probing study
Authors:
Ekaterina Voloshina,
Oleg Serikov,
Tatiana Shavrina
Abstract:
The probing methodology allows one to obtain a partial representation of linguistic phenomena stored in the inner layers of the neural network, using external classifiers and statistical analysis. Pre-trained transformer-based language models are widely used both for natural language understanding (NLU) and natural language generation (NLG) tasks making them most commonly used for downstream appli…
▽ More
The probing methodology allows one to obtain a partial representation of linguistic phenomena stored in the inner layers of the neural network, using external classifiers and statistical analysis. Pre-trained transformer-based language models are widely used both for natural language understanding (NLU) and natural language generation (NLG) tasks making them most commonly used for downstream applications. However, little analysis was carried out, whether the models were pre-trained enough or contained knowledge correlated with linguistic theory. We are presenting the chronological probing study of transformer English models such as MultiBERT and T5. We sequentially compare the information about the language learned by the models in the process of training on corpora. The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language, including morphology, syntax, and even discourse, while they also can inconsistently fail on tasks that are perceived as easy. We also introduce the open-source framework for chronological probing research, compatible with other transformer-based models. https://github.com/EkaterinaVoloshina/chronological_probing
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian
Authors:
Tatiana Shamardina,
Vladislav Mikhailov,
Daniil Chernianskii,
Alena Fenogenova,
Marat Saidov,
Anastasiya Valeeva,
Tatiana Shavrina,
Ivan Smurov,
Elena Tutubalina,
Ekaterina Artemova
Abstract:
We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022. The shared task dataset includes texts from 14 text generators, i.e., one human writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text si…
▽ More
We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022. The shared task dataset includes texts from 14 text generators, i.e., one human writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text simplification. We also consider back-translation and zero-shot generation approaches. The human-written texts are collected from publicly available resources across multiple domains. The shared task consists of two sub-tasks: (i) to determine if a given text is automatically generated or written by a human; (ii) to identify the author of a given text. The first task is framed as a binary classification problem. The second task is a multi-class classification problem. We provide count-based and BERT-based baselines, along with the human evaluation on the first sub-task. A total of 30 and 8 systems have been submitted to the binary and multi-class sub-tasks, correspondingly. Most teams outperform the baselines by a wide margin. We publicly release our codebase, human evaluation results, and other materials in our GitHub repository (https://github.com/dialogue-evaluation/RuATD).
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
WikiOmnia: generative QA corpus on the whole Russian Wikipedia
Authors:
Dina Pisarevskaya,
Tatiana Shavrina
Abstract:
The General QA field has been develo** the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia art…
▽ More
The General QA field has been develo** the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).
△ Less
Submitted 9 November, 2022; v1 submitted 17 April, 2022;
originally announced April 2022.
-
mGPT: Few-Shot Learners Go Multilingual
Authors:
Oleh Shliazhko,
Alena Fenogenova,
Maria Tikhonova,
Vladislav Mikhailov,
Anastasia Kozlova,
Tatiana Shavrina
Abstract:
Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean…
▽ More
Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. source code and the mGPT XL model are publicly released.
△ Less
Submitted 12 October, 2023; v1 submitted 15 April, 2022;
originally announced April 2022.
-
RuCLIP -- new models and experiments: a technical report
Authors:
Alex Shonenkov,
Andrey Kuznetsov,
Denis Dimitrov,
Tatyana Shavrina,
Daniil Chesakov,
Anastasia Maltseva,
Alena Fenogenova,
Igor Pavlov,
Anton Emelyanov,
Sergey Markov,
Daria Bakshandaeva,
Vera Shybaeva,
Andrey Chertok
Abstract:
In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and co…
▽ More
In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and concentrate on the conducted experiments. Inference execution time comparison is also presented in the report.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models
Authors:
Alena Fenogenova,
Maria Tikhonova,
Vladislav Mikhailov,
Tatiana Shavrina,
Anton Emelyanov,
Denis Shevelev,
Alexandr Kukushkin,
Valentin Malykh,
Ekaterina Artemova
Abstract:
In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks.
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological impro…
▽ More
In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks.
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological improvements, including fixes of the benchmark vulnerabilities unresolved in the previous version: novel and improved tests for understanding the meaning of a word in context (RUSSE) along with reading comprehension and common sense reasoning (DaNetQA, RuCoS, MuSeRC). Together with the release of the updated datasets, we improve the benchmark toolkit based on \texttt{jiant} framework for consistent training and evaluation of NLP-models of various architectures which now supports the most recent models for Russian. Finally, we provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO (MOdel ResOurCe COmparison), in which the models are evaluated according to the weighted average metric over all tasks, the inference speed, and the occupied amount of RAM. Russian SuperGLUE is publicly available at https://russiansuperglue.com/.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
MOROCCO: Model Resource Comparison Framework
Authors:
Valentin Malykh,
Alexander Kukushkin,
Ekaterina Artemova,
Vladislav Mikhailov,
Maria Tikhonova,
Tatiana Shavrina
Abstract:
The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to comp…
▽ More
The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to compare language models compatible with \texttt{jiant} environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites. We demonstrate its applicability for two GLUE-like suites in different languages.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
Domain-Transferable Method for Named Entity Recognition Task
Authors:
Vladislav Mikhailov,
Tatiana Shavrina
Abstract:
Named Entity Recognition (NER) is a fundamental task in the fields of natural language processing and information extraction. NER has been widely used as a standalone tool or an essential component in a variety of applications such as question answering, dialogue assistants and knowledge graphs development. However, training reliable NER models requires a large amount of labelled data which is exp…
▽ More
Named Entity Recognition (NER) is a fundamental task in the fields of natural language processing and information extraction. NER has been widely used as a standalone tool or an essential component in a variety of applications such as question answering, dialogue assistants and knowledge graphs development. However, training reliable NER models requires a large amount of labelled data which is expensive to obtain, particularly in specialized domains. This paper describes a method to learn a domain-specific NER model for an arbitrary set of named entities when domain-specific supervision is not available. We assume that the supervision can be obtained with no human effort, and neural models can learn from each other. The code, data and models are publicly available.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Authors:
Tatiana Shavrina,
Alena Fenogenova,
Anton Emelyanov,
Denis Shevelev,
Ekaterina Artemova,
Valentin Malykh,
Vladislav Mikhailov,
Maria Tikhonova,
Andrey Chertok,
Andrey Evlampiev
Abstract:
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logi…
▽ More
In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.
△ Less
Submitted 2 November, 2020; v1 submitted 29 October, 2020;
originally announced October 2020.
-
DaNetQA: a yes/no Question Answering Dataset for the Russian Language
Authors:
Taisia Glushkova,
Alexey Machnev,
Alena Fenogenova,
Tatiana Shavrina,
Ekaterina Artemova,
Dmitry I. Ignatov
Abstract:
DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper, we present a reproducible approach to…
▽ More
DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper, we present a reproducible approach to DaNetQA creation and investigate transfer learning methods for task and language transferring. For task transferring we leverage three similar sentence modelling tasks: 1) a corpus of paraphrases, Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3) another question answering task, SberQUAD. For language transferring we use English to Russian translation together with multilingual language fine-tuning.
△ Less
Submitted 15 October, 2020; v1 submitted 6 October, 2020;
originally announced October 2020.
-
AGRR-2019: A Corpus for Gap** Resolution in Russian
Authors:
Maria Ponomareva,
Kira Droganova,
Ivan Smurov,
Tatiana Shavrina
Abstract:
This paper provides a comprehensive overview of the gap** dataset for Russian that consists of 7.5k sentences with gap** (as well as 15k relevant negative sentences) and comprises data from various genres: news, fiction, social media and technical texts. The dataset was prepared for the Automatic Gap** Resolution Shared Task for Russian (AGRR-2019) - a competition aimed at stimulating the de…
▽ More
This paper provides a comprehensive overview of the gap** dataset for Russian that consists of 7.5k sentences with gap** (as well as 15k relevant negative sentences) and comprises data from various genres: news, fiction, social media and technical texts. The dataset was prepared for the Automatic Gap** Resolution Shared Task for Russian (AGRR-2019) - a competition aimed at stimulating the development of NLP tools and methods for processing of ellipsis.
In this paper, we pay special attention to the gap** resolution methods that were introduced within the shared task as well as an alternative test set that illustrates that our corpus is a diverse and representative subset of Russian language gap** sufficient for effective utilization of machine learning techniques.
△ Less
Submitted 10 June, 2019;
originally announced June 2019.