Skip to main content

Showing 1–16 of 16 results for author: Silfverberg, M

.
  1. arXiv:2406.11085  [pdf, other

    cs.CL

    Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

    Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg

    Abstract: In this paper, we address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. We supplement models with translations at both the token and sentence level as well as leverage the extensive linguistic capability of modern LLMs. Our enhancements lead to an average absolute improvement of 5%-points in word-lev… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2403.08189

  2. arXiv:2403.08189  [pdf, other

    cs.CL

    Embedded Translations for Low-resource Automated Glossing

    Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg

    Abstract: We investigate automatic interlinear glossing in low-resource settings. We augment a hard-attentional neural model with embedded translation information extracted from interlinear glossed text. After encoding these translations using large language models, specifically BERT and T5, we introduce a character-level decoder for generating glossed output. Aided by these enhancements, our model demonstr… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  3. arXiv:2305.16581  [pdf, other

    cs.CL

    An Investigation of Noise in Morphological Inflection

    Authors: Adam Wiemerslage, Changbing Yang, Garrett Nicolai, Miikka Silfverberg, Katharina Kann

    Abstract: With a growing focus on morphological inflection systems for languages where high-quality data is scarce, training data noise is a serious but so far largely ignored concern. We aim at closing this gap by investigating the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion and its impact on morphological inflection systems: First, we propose an er… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  4. arXiv:2305.13658  [pdf, other

    cs.CL

    Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

    Authors: Farhan Samir, Miikka Silfverberg

    Abstract: Data augmentation techniques are widely used in low-resource automatic morphological inflection to overcome data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the prominent data augmentation strategy StemCorrupt (Silfverberg et al., 2017; Anastasopoulos and Neubig, 2019), a method that genera… ▽ More

    Submitted 23 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 camera-ready

  5. arXiv:2209.09742  [pdf, other

    cs.CL

    Yet Another Format of Universal Dependencies for Korean

    Authors: Yige Chen, Eunkyul Leah Jo, Yundong Yao, KyungTae Lim, Miikka Silfverberg, Francis M. Tyers, Jungyeul Park

    Abstract: In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automat… ▽ More

    Submitted 20 September, 2022; originally announced September 2022.

    Comments: COLING2022, Poster

  6. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  7. arXiv:2203.09632  [pdf, other

    cs.CL

    Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

    Authors: Clarissa Forbes, Farhan Samir, Bruce Harold Oliver, Changbing Yang, Edith Coates, Garrett Nicolai, Miikka Silfverberg

    Abstract: Recent progress in NLP is driven by pretrained models leveraging massive datasets and has predominantly benefited the world's political and economic superpowers. Technologically underserved languages are left behind because they lack such resources. Hundreds of underserved languages, nevertheless, have available data sources in the form of interlinear glossed text (IGT) from language documentation… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

  8. arXiv:2203.08909  [pdf, other

    cs.CL

    Morphological Processing of Low-Resource Languages: Where We Are and What's Next

    Authors: Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann

    Abstract: Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Findings of ACL 2022

  9. arXiv:2104.00789  [pdf, other

    cs.CL

    Do RNN States Encode Abstract Phonological Processes?

    Authors: Miikka Silfverberg, Francis Tyers, Garrett Nicolai, Mans Hulden

    Abstract: Sequence-to-sequence models have delivered impressive results in word formation tasks such as morphological inflection, often learning to model subtle morphophonological details with limited training data. Despite the performance, the opacity of neural models makes it difficult to determine whether complex generalizations are learned, or whether a kind of separate rote memorization of each morphop… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

  10. arXiv:2103.04225  [pdf, other

    cs.CL cs.AI cs.LG

    Translating the Unseen? Yoruba-English MT in Low-Resource, Morphologically-Unmarked Settings

    Authors: Ife Adebara, Muhammad Abdul-Mageed, Miikka Silfverberg

    Abstract: Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation. When translating into English which marks (in)definiteness morphologically, from Yorùbá which uses bare nouns but marks these features contextually, ambiguities arise. In this work, we perform fine-grained analysis… ▽ More

    Submitted 6 April, 2021; v1 submitted 6 March, 2021; originally announced March 2021.

    Comments: Accepted at AfricanNLP @ EACL 2021

  11. arXiv:2006.13343  [pdf, other

    cs.CL

    One Model to Pronounce Them All: Multilingual Grapheme-to-Phoneme Conversion With a Transformer Ensemble

    Authors: Kaili Vesik, Muhammad Abdul-Mageed, Miikka Silfverberg

    Abstract: The task of grapheme-to-phoneme (G2P) conversion is important for both speech recognition and synthesis. Similar to other speech and language processing tasks, in a scenario where only small-sized training data are available, learning G2P models is challenging. We describe a simple approach of exploiting model ensembles, based on multilingual Transformers and self-training, to develop a highly eff… ▽ More

    Submitted 23 June, 2020; originally announced June 2020.

    Comments: 7 pages, submitted to SIGMORPHON 2020 Shared Task 1

  12. arXiv:2006.11572  [pdf, other

    cs.CL

    SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

    Authors: Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff , et al. (3 additional authors not shown)

    Abstract: A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource… ▽ More

    Submitted 14 July, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

    Comments: 39 pages, SIGMORPHON

  13. The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

    Authors: Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, Mans Hulden

    Abstract: The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low… ▽ More

    Submitted 25 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Presented at SIGMORPHON 2019

    Journal ref: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (2019) 229-244

  14. A Finnish News Corpus for Named Entity Recognition

    Authors: Teemu Ruokolainen, Pekka Kauppinen, Miikka Silfverberg, Krister Lindén

    Abstract: We present a corpus of Finnish news articles with a manually prepared named entity annotation. The corpus consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event, and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source. The corpus is available for research purposes. We present… ▽ More

    Submitted 12 August, 2019; originally announced August 2019.

  15. arXiv:1810.07125  [pdf, other

    cs.CL

    The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

    Authors: Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a… ▽ More

    Submitted 25 February, 2020; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: CoNLL 2018. arXiv admin note: text overlap with arXiv:1706.09031

  16. Marrying Universal Dependencies and Universal Morphology

    Authors: Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky

    Abstract: The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. Wi… ▽ More

    Submitted 15 October, 2018; originally announced October 2018.

    Comments: UDW18

    Journal ref: Proceedings of the Second Workshop on Universal Dependencies (2018) 91-101