Skip to main content

Showing 1–12 of 12 results for author: Ács, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.03555  [pdf, other

    cs.CL

    From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization

    Authors: Botond Barta, Dorina Lakatos, Attila Nagy, Milán Konor Nyist, Judit Ács

    Abstract: Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces HunSum-2 an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl… ▽ More

    Submitted 12 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  2. arXiv:2311.02355  [pdf, other

    cs.CL

    TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swap**

    Authors: Attila Nagy, Dorina Lakatos, Botond Barta, Judit Ács

    Abstract: Data augmentation methods for neural machine translation are particularly useful when limited amount of training data is available, which is often the case when dealing with low-resource languages. We introduce a novel augmentation method, which generates new sentences by swap** objects and subjects across bisentences. This is performed simultaneously based on the dependency parse trees of the s… ▽ More

    Submitted 4 November, 2023; originally announced November 2023.

  3. arXiv:2307.07025  [pdf, other

    cs.CL

    Data Augmentation for Machine Translation via Dependency Subtree Swap**

    Authors: Attila Nagy, Dorina Petra Lakatos, Botond Barta, Patrick Nanys, Judit Ács

    Abstract: We present a generic framework for data augmentation via dependency subtree swap** that is applicable to machine translation. We extract corresponding subtrees from the dependency parse trees of the source and target sentences and swap these across bisentences to create augmented samples. We perform thorough filtering based on graphbased similarities of the dependency trees and additional heuris… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

  4. Morphosyntactic probing of multilingual BERT models

    Authors: Judit Acs, Endre Hamerlik, Roy Schwartz, Noah A. Smith, Andras Kornai

    Abstract: We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived from the Universal Dependencies treebanks. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain st… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

    Comments: to appear in the Journal of Natural Language Engineering

  5. arXiv:2302.00455  [pdf, other

    cs.CL

    HunSum-1: an Abstractive Summarization Dataset for Hungarian

    Authors: Botond Barta, Dorina Lakatos, Attila Nagy, Milán Konor Nyist, Judit Ács

    Abstract: We introduce HunSum-1: a dataset for Hungarian abstractive summarization, consisting of 1.14M news articles. The dataset is built by collecting, cleaning and deduplicating data from 9 major Hungarian news sites through CommonCrawl. Using this dataset, we build abstractive summarizer models based on huBERT and mT5. We demonstrate the value of the created dataset by performing a quantitative and qua… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

  6. arXiv:2201.06876  [pdf, other

    cs.CL cs.LG

    Syntax-based data augmentation for Hungarian-English machine translation

    Authors: Attila Nagy, Patrick Nanys, Balázs Frey Konrád, Bence Bial, Judit Ács

    Abstract: We train Transformer-based neural machine translation models for Hungarian-English and English-Hungarian using the Hunglish2 corpus. Our best models achieve a BLEU score of 40.0 on HungarianEnglish and 33.4 on English-Hungarian. Furthermore, we present results on an ongoing work about syntax-based augmentation for neural machine translation. Both our code and models are publicly available.

    Submitted 18 January, 2022; originally announced January 2022.

  7. arXiv:2109.07006  [pdf, other

    cs.CL cs.AI

    A Three Step Training Approach with Data Augmentation for Morphological Inflection

    Authors: Gabor Szolnok, Botond Barta, Dorina Lakatos, Judit Acs

    Abstract: We present the BME submission for the SIGMORPHON 2021 Task 0 Part 1, Generalization Across Typologically Diverse Languages shared task. We use an LSTM encoder-decoder model with three step training that is first trained on all languages, then fine-tuned on each language families and finally finetuned on individual languages. We use a different type of data augmentation technique in the first two s… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    MSC Class: 68T50 ACM Class: I.2; D.0

  8. arXiv:2109.06327  [pdf, other

    cs.CL

    Evaluating Transferability of BERT Models on Uralic Languages

    Authors: Judit Ács, Dániel Lévai, András Kornai

    Abstract: Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hunga… ▽ More

    Submitted 23 November, 2021; v1 submitted 13 September, 2021; originally announced September 2021.

    Comments: Seventh International Workshop for Computational Linguistics of Uralic Languages (IWCLUL 2021)

  9. arXiv:2102.10864  [pdf, other

    cs.CL

    Subword Pooling Makes a Difference

    Authors: Judit Ács, Ákos Kádár, András Kornai

    Abstract: Contextual word-representations became a standard in modern natural language processing systems. These models use subword tokenization to handle large vocabularies and unknown words. Word-level usage of such systems requires a way of pooling multiple subwords that correspond to a single word. In this paper we investigate how the choice of subword pooling affects the downstream performance on three… ▽ More

    Submitted 29 March, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

    Journal ref: EACL2021

  10. arXiv:2102.10848  [pdf, other

    cs.CL

    Evaluating Contextualized Language Models for Hungarian

    Authors: Judit Ács, Dániel Lévai, Dávid Márk Nemeskey, András Kornai

    Abstract: We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (ty… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Journal ref: Hungarian NLP Conference (MSZNY2021)

  11. arXiv:2101.07343  [pdf, other

    cs.CL

    Automatic punctuation restoration with BERT models

    Authors: Attila Nagy, Bence Bial, Judit Ács

    Abstract: We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged $F_1$-score of 79.8 in English and 82.2 in Hungarian. Our code is publ… ▽ More

    Submitted 18 January, 2021; originally announced January 2021.

    Comments: 11 pages, 6 figures, source code at https://github.com/attilanagy234/neural-punctuator

  12. arXiv:2012.04575  [pdf, other

    cs.CL

    The Role of Interpretable Patterns in Deep Learning for Morphology

    Authors: Judit Acs, Andras Kornai

    Abstract: We examine the role of character patterns in three tasks: morphological analysis, lemmatization and copy. We use a modified version of the standard sequence-to-sequence model, where the encoder is a pattern matching network. Each pattern scores all possible N character long subwords (substrings) on the source side, and the highest scoring subword's score is used to initialize the decoder as well a… ▽ More

    Submitted 8 December, 2020; originally announced December 2020.

    Comments: Best paper at the Hungarian NLP conference (MSZNY2020)

    Journal ref: XVI. Magyar Számítógépes Nyelvészeti Konferencia, 2020, page 171-179 (MSZNY2020)