Skip to main content

Showing 1–15 of 15 results for author: Nicolai, G

.
  1. arXiv:2406.11085  [pdf, other

    cs.CL

    Multiple Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing

    Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg

    Abstract: In this paper, we address the data scarcity problem in automatic data-driven glossing for low-resource languages by coordinating multiple sources of linguistic expertise. We supplement models with translations at both the token and sentence level as well as leverage the extensive linguistic capability of modern LLMs. Our enhancements lead to an average absolute improvement of 5%-points in word-lev… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2403.08189

  2. arXiv:2403.08189  [pdf, other

    cs.CL

    Embedded Translations for Low-resource Automated Glossing

    Authors: Changbing Yang, Garrett Nicolai, Miikka Silfverberg

    Abstract: We investigate automatic interlinear glossing in low-resource settings. We augment a hard-attentional neural model with embedded translation information extracted from interlinear glossed text. After encoding these translations using large language models, specifically BERT and T5, we introduce a character-level decoder for generating glossed output. Aided by these enhancements, our model demonstr… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  3. arXiv:2307.05779  [pdf, other

    cs.CL

    Neural Machine Translation Data Generation and Augmentation using ChatGPT

    Authors: Wayne Yang, Garrett Nicolai

    Abstract: Neural models have revolutionized the field of machine translation, but creating parallel corpora is expensive and time-consuming. We investigate an alternative to manual parallel corpora - hallucinated parallel corpora created by generative language models. Although these models are themselves trained on parallel data, they can leverage a multilingual vector space to create data, and may be able… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

    Comments: 8 Pages, 4 Figures, 4 Tables

  4. arXiv:2305.16581  [pdf, other

    cs.CL

    An Investigation of Noise in Morphological Inflection

    Authors: Adam Wiemerslage, Changbing Yang, Garrett Nicolai, Miikka Silfverberg, Katharina Kann

    Abstract: With a growing focus on morphological inflection systems for languages where high-quality data is scarce, training data noise is a serious but so far largely ignored concern. We aim at closing this gap by investigating the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion and its impact on morphological inflection systems: First, we propose an er… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  5. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  6. arXiv:2203.09632  [pdf, other

    cs.CL

    Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

    Authors: Clarissa Forbes, Farhan Samir, Bruce Harold Oliver, Changbing Yang, Edith Coates, Garrett Nicolai, Miikka Silfverberg

    Abstract: Recent progress in NLP is driven by pretrained models leveraging massive datasets and has predominantly benefited the world's political and economic superpowers. Technologically underserved languages are left behind because they lack such resources. Hundreds of underserved languages, nevertheless, have available data sources in the form of interlinear glossed text (IGT) from language documentation… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

  7. arXiv:2203.08909  [pdf, other

    cs.CL

    Morphological Processing of Low-Resource Languages: Where We Are and What's Next

    Authors: Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann

    Abstract: Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Findings of ACL 2022

  8. arXiv:2104.00789  [pdf, other

    cs.CL

    Do RNN States Encode Abstract Phonological Processes?

    Authors: Miikka Silfverberg, Francis Tyers, Garrett Nicolai, Mans Hulden

    Abstract: Sequence-to-sequence models have delivered impressive results in word formation tasks such as morphological inflection, often learning to model subtle morphophonological details with limited training data. Despite the performance, the opacity of neural models makes it difficult to determine whether complex generalizations are learned, or whether a kind of separate rote memorization of each morphop… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

  9. arXiv:2006.11572  [pdf, other

    cs.CL

    SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

    Authors: Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff , et al. (3 additional authors not shown)

    Abstract: A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource… ▽ More

    Submitted 14 July, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

    Comments: 39 pages, SIGMORPHON

  10. arXiv:2005.13756  [pdf, other

    cs.CL

    The SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion

    Authors: Katharina Kann, Arya McCarthy, Garrett Nicolai, Mans Hulden

    Abstract: In this paper, we describe the findings of the SIGMORPHON 2020 shared task on unsupervised morphological paradigm completion (SIGMORPHON 2020 Task 2), a novel task in the field of inflectional morphology. Participants were asked to submit systems which take raw text and a list of lemmas as input, and output all inflected forms, i.e., the entire morphological paradigm, of each lemma. In order to si… ▽ More

    Submitted 27 May, 2020; originally announced May 2020.

    Comments: SIGMORPHON 2020

  11. arXiv:2005.00187  [pdf, other

    cs.CL

    Cross-Linguistic Syntactic Evaluation of Word Prediction Models

    Authors: Aaron Mueller, Garrett Nicolai, Panayiota Petrou-Zeniou, Natalia Talmina, Tal Linzen

    Abstract: A range of studies have concluded that neural word prediction models can distinguish grammatical from ungrammatical sentences with high accuracy. However, these studies are based primarily on monolingual evidence from English. To investigate how these models' ability to learn syntax varies by language, we introduce CLAMS (Cross-Linguistic Assessment of Models on Syntax), a syntactic evaluation sui… ▽ More

    Submitted 21 May, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: Accepted for presentation at ACL 2020

  12. arXiv:1910.12299  [pdf, other

    cs.CL cs.SD eess.AS

    Induced Inflection-Set Keyword Search in Speech

    Authors: Oliver Adams, Matthew Wiesner, Jan Trmal, Garrett Nicolai, David Yarowsky

    Abstract: We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants. Experimental results indicate how lexeme-set search performance changes with the number of hypothesized inflections, while ablation experiments highlight the relative importance of different components in the lexeme-set search pipeline and the value of using curated inflectional paradigms… ▽ More

    Submitted 21 May, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

    Comments: To appear in SIGMORPHON 2020

  13. The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

    Authors: Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, Mans Hulden

    Abstract: The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low… ▽ More

    Submitted 25 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Presented at SIGMORPHON 2019

    Journal ref: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (2019) 229-244

  14. arXiv:1810.07125  [pdf, other

    cs.CL

    The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

    Authors: Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a… ▽ More

    Submitted 25 February, 2020; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: CoNLL 2018. arXiv admin note: text overlap with arXiv:1706.09031

  15. arXiv:1809.07182  [pdf, other

    cs.CL

    String Transduction with Target Language Models and Insertion Handling

    Authors: Garrett Nicolai, Saeed Najafi, Grzegorz Kondrak

    Abstract: Many character-level tasks can be framed as sequence-to-sequence transduction, where the target is a word from a natural language. We show that leveraging target language models derived from unannotated target corpora, combined with a precise alignment of the training data, yields state-of-the art results on cognate projection, inflection generation, and phoneme-to-grapheme conversion.

    Submitted 19 September, 2018; originally announced September 2018.

    Comments: 8 pages + 1 page Appendix, 4 figures and 8 tables, plus an additional 1 figure and 2 tables in appendix; to appear at SIGMORPHON 2018, October, 2018