Skip to main content

Showing 1–10 of 10 results for author: Virpioja, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2304.04726  [pdf, other

    cs.CL

    Uncertainty-Aware Natural Language Inference with Stochastic Weight Averaging

    Authors: Aarne Talman, Hande Celikkanat, Sami Virpioja, Markus Heinonen, Jörg Tiedemann

    Abstract: This paper introduces Bayesian uncertainty modeling using Stochastic Weight Averaging-Gaussian (SWAG) in Natural Language Understanding (NLU) tasks. We apply the approach to standard tasks in natural language inference (NLI) and demonstrate the effectiveness of the method in terms of prediction accuracy and correlation with human annotation disagreements. We argue that the uncertainty representati… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

    Comments: NoDaLiDa 2023 camera ready

  2. arXiv:2212.01936  [pdf, other

    cs.CL

    Democratizing Neural Machine Translation with OPUS-MT

    Authors: Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raul Vazquez, Sami Virpioja

    Abstract: This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-opt… ▽ More

    Submitted 4 July, 2023; v1 submitted 4 December, 2022; originally announced December 2022.

  3. arXiv:2008.08315  [pdf, other

    cs.CL

    FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics

    Authors: Katri Leino, Juho Leinonen, Mittul Singh, Sami Virpioja, Mikko Kurimo

    Abstract: Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development; otherwise, comparing the models would require resource-expensive human evaluation. While chatbot challenges have recently managed to provide a plethora of such resources f… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

  4. arXiv:2007.11648  [pdf, other

    cs.CL

    Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models

    Authors: Mittul Singh, Peter Smit, Sami Virpioja, Mikko Kurimo

    Abstract: Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, cross-lingual transfer has improved multi-character NNLM performance by allowing information… ▽ More

    Submitted 22 July, 2020; originally announced July 2020.

  5. Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search

    Authors: Mittul Singh, Sami Virpioja, Peter Smit, Mikko Kurimo

    Abstract: In spoken Keyword Search, the query may contain out-of-vocabulary (OOV) words not observed when training the speech recognition system. Using subword language models (LMs) in the first-pass recognition makes it possible to recognize the OOV words, but even the subword n-gram LMs suffer from data sparsity. Recurrent Neural Network (RNN) LMs alleviate the sparsity problems but are not suitable for f… ▽ More

    Submitted 10 September, 2020; v1 submitted 28 May, 2020; originally announced May 2020.

    Comments: INTERSPEECH 2019

  6. arXiv:2004.04002  [pdf, other

    cs.CL

    Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

    Authors: Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo

    Abstract: There are several approaches for improving neural machine translation for low-resource languages: Monolingual data can be exploited via pretraining or data augmentation; Parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; Subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary.… ▽ More

    Submitted 9 December, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

    Comments: 24 pages, 12 tables, 7 figures. Accepted (Nov 2020) for publication in the Machine Translation journal Special Issue on Machine Translation for Low-Resource Languages (Springer)

  7. arXiv:2003.03131  [pdf, other

    cs.CL

    Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning

    Authors: Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo

    Abstract: Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss… ▽ More

    Submitted 6 March, 2020; originally announced March 2020.

    Comments: Accepted for publication in LREC 2020

  8. arXiv:1906.04040  [pdf, other

    cs.CL

    The University of Helsinki submissions to the WMT19 news translation task

    Authors: Aarne Talman, Umut Sulubacak, Raúl Vázquez, Yves Scherrer, Sami Virpioja, Alessandro Raganato, Arvi Hurskainen, Jörg Tiedemann

    Abstract: In this paper, we present the University of Helsinki submissions to the WMT 2019 shared task on news translation in three language pairs: English-German, English-Finnish and Finnish-English. This year, we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German, we trained both senten… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

    Comments: To appear in WMT19

  9. arXiv:1808.10791  [pdf, other

    cs.CL

    Cognate-aware morphological segmentation for multilingual neural translation

    Authors: Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo

    Abstract: This article describes the Aalto University entry to the WMT18 News Translation Shared Task. We participate in the multilingual subtrack with a system trained under the constrained condition to translate from English to both Finnish and Estonian. The system is based on the Transformer model. We focus on improving the consistency of morphological segmentation for words that are similar orthographic… ▽ More

    Submitted 31 August, 2018; originally announced August 2018.

    Comments: To appear in WMT18

  10. Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies

    Authors: Seppo Enarvi, Peter Smit, Sami Virpioja, Mikko Kurimo

    Abstract: Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variat… ▽ More

    Submitted 29 September, 2017; v1 submitted 13 July, 2017; originally announced July 2017.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2085-2097, November 2017