Skip to main content

Showing 1–8 of 8 results for author: Yangarber, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02335  [pdf, other

    cs.CL

    Probing the Category of Verbal Aspect in Transformer Language Models

    Authors: Anisia Katinskaia, Roman Yangarber

    Abstract: We investigate how pretrained language models (PLM) encode the grammatical category of verbal aspect in Russian. Encoding of aspect in transformer LMs has not been studied previously in any language. A particular challenge is posed by "alternative contexts": where either the perfective or the imperfective aspect is suitable grammatically and semantically. We perform probing using BERT and RoBERTa… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2405.08469  [pdf, other

    cs.CL cs.AI

    GPT-3.5 for Grammatical Error Correction

    Authors: Anisia Katinskaia, Roman Yangarber

    Abstract: This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality wi… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  3. arXiv:2404.14270  [pdf, other

    cs.CL cs.LG

    What do Transformers Know about Government?

    Authors: Jue Hou, Anisia Katinskaia, Lari Kotilainen, Sathianpong Trangcasanchai, Anh-Duc Vu, Roman Yangarber

    Abstract: This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language models.In particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show t… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  4. arXiv:2404.00482  [pdf, other

    cs.CL cs.AI cs.LG

    Cross-lingual Named Entity Corpus for Slavic Languages

    Authors: Jakub Piskorski, Michał Marcińczuk, Roman Yangarber

    Abstract: This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes… ▽ More

    Submitted 7 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Published in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

  5. arXiv:2305.05480  [pdf, other

    cs.CL cs.AI cs.LG

    Effects of sub-word segmentation on performance of transformer language models

    Authors: Jue Hou, Anisia Katinskaia, Anh-Duc Vu, Roman Yangarber

    Abstract: Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms… ▽ More

    Submitted 26 October, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: This submission published in EMNLP 2023

    Journal ref: This submission published in EMNLP 2023

  6. arXiv:2212.01711  [pdf, other

    cs.CL

    Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

    Authors: Anisia Katinskaia, Jue Hou, Anh-Duc Vu, Roman Yangarber

    Abstract: This paper presents the development of an AI-based language learning platform Revita. It is a freely available intelligent online tutor, developed to support learners of multiple languages, from low-intermediate to advanced levels. It has been in pilot use by hundreds of students at several universities, whose feedback and needs are sha** the development. One of the main emerging features of Rev… ▽ More

    Submitted 3 December, 2022; originally announced December 2022.

    ACM Class: K.3

  7. arXiv:2211.13794  [pdf, other

    cs.CL

    Question Answering and Question Generation for Finnish

    Authors: Ilmari Kylliäinen, Roman Yangarber

    Abstract: Recent advances in the field of language modeling have improved the state-of-the-art in question answering (QA) and question generation (QG). However, the development of modern neural models, their benchmarks, and datasets for training them has mainly focused on English. Finnish, like many other languages, faces a shortage of large QA/QG model training resources, which has prevented experimenting… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

  8. arXiv:2007.06104  [pdf, ps, other

    cs.CL

    Neural disambiguation of lemma and part of speech in morphologically rich languages

    Authors: José María Hoya Quecedo, Maximilian W. Koppatz, Giacomo Furlan, Roman Yangarber

    Abstract: We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages. We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser -- with no manual disambiguation or data annotation. We assume that the morphological analyser produces multiple analyses for ambiguous… ▽ More

    Submitted 12 July, 2020; originally announced July 2020.

    Comments: This paper contains corrigenda to a previously published paper (Hoya Quecedo et al., 2020). It corrects a mistake in the original evaluation setup, and the results reported in Section 6., in Tables 5, 6, and 7

    Journal ref: Proceedings of LREC-2020: the 12th Conference on Language Resources and Evaluation