Skip to main content

Showing 1–15 of 15 results for author: Kirov, C

.
  1. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

    Authors: Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson , et al. (2 additional authors not shown)

    Abstract: Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot;… ▽ More

    Submitted 24 May, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

  2. arXiv:2303.03457  [pdf, other

    cs.CL

    Spelling convention sensitivity in neural language models

    Authors: Elizabeth Nielsen, Christo Kirov, Brian Roark

    Abstract: We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consis… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

    Journal ref: EACL Findings 2023

  3. arXiv:2110.01140  [pdf, other

    cs.CL

    Structured abbreviation expansion in context

    Authors: Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat

    Abstract: Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The problem is related to, but distinct from, spelling correction, in that ad hoc abbreviations are intentional and may involve substantial differences from the orig… ▽ More

    Submitted 3 October, 2021; originally announced October 2021.

    Comments: Accepted to Findings of EMNLP 2021

  4. arXiv:2007.01176  [pdf

    cs.CL

    Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

    Authors: Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall

    Abstract: This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and s… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

    Comments: Published at LREC 2020

  5. arXiv:2006.11572  [pdf, other

    cs.CL

    SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

    Authors: Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff , et al. (3 additional authors not shown)

    Abstract: A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource… ▽ More

    Submitted 14 July, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

    Comments: 39 pages, SIGMORPHON

  6. arXiv:2005.05477  [pdf, other

    cs.CL

    Neural Polysynthetic Language Modelling

    Authors: Lane Schwartz, Francis Tyers, Lori Levin, Christo Kirov, Patrick Littell, Chi-kiu Lo, Emily Prud'hommeaux, Hyunji Hayley Park, Kenneth Steimel, Rebecca Knowles, Jeffrey Micher, Lonny Strunk, Han Liu, Coleman Haley, Katherine J. Zhang, Robbie Jimmerson, Vasilisa Andriyanets, Aldrian Obaja Muis, Naoki Otani, Jong Hyuk Park, Zhisong Zhang

    Abstract: Research in natural language processing commonly assumes that approaches that work well for English and and other widely-used languages are "language agnostic". In high-resource languages, especially those that are analytic, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types. This assumes, that there are limited morphological infle… ▽ More

    Submitted 13 May, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

  7. The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection

    Authors: Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, Mans Hulden

    Abstract: The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years' inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low… ▽ More

    Submitted 25 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Presented at SIGMORPHON 2019

    Journal ref: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology (2019) 229-244

  8. arXiv:1810.11101  [pdf, other

    cs.CL

    UniMorph 2.0: Universal Morphology

    Authors: Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema.… ▽ More

    Submitted 25 February, 2020; v1 submitted 25 October, 2018; originally announced October 2018.

    Comments: LREC 2018

  9. arXiv:1810.07125  [pdf, other

    cs.CL

    The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

    Authors: Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a… ▽ More

    Submitted 25 February, 2020; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: CoNLL 2018. arXiv admin note: text overlap with arXiv:1706.09031

  10. arXiv:1807.04783  [pdf, other

    cs.CL

    Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate

    Authors: Christo Kirov, Ryan Cotterell

    Abstract: Can advances in NLP help advance cognitive modeling? We examine the role of artificial neural networks, the current state of the art in many common NLP tasks, by returning to a classic case study. In 1986, Rumelhart and McClelland famously introduced a neural architecture that learned to transduce English verb stems to their past tense forms. Shortly thereafter, Pinker & Prince (1988) presented a… ▽ More

    Submitted 26 June, 2019; v1 submitted 12 July, 2018; originally announced July 2018.

    Comments: TACL 2018

  11. arXiv:1807.02747  [pdf, other

    cs.CL

    On the Complexity and Typology of Inflectional Morphological Systems

    Authors: Ryan Cotterell, Christo Kirov, Mans Hulden, Jason Eisner

    Abstract: We quantify the linguistic complexity of different languages' morphological systems. We verify that there is an empirical trade-off between paradigm size and irregularity: a language's inflectional paradigms may be either large in size or highly irregular, but never both. Our methodology measures paradigm irregularity as the entropy of the surface realization of a paradigm -- how hard it is to joi… ▽ More

    Submitted 7 July, 2018; originally announced July 2018.

    Comments: TACL 2018

  12. arXiv:1806.03740  [pdf, other

    cs.CL

    Unsupervised Disambiguation of Syncretism in Inflected Lexicons

    Authors: Ryan Cotterell, Christo Kirov, Sabrina J. Mielke, Jason Eisner

    Abstract: Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bu… ▽ More

    Submitted 25 February, 2020; v1 submitted 10 June, 2018; originally announced June 2018.

    Comments: Published at NAACL 2018

  13. arXiv:1804.08262  [pdf, other

    cs.CL

    On the Diachronic Stability of Irregularity in Inflectional Morphology

    Authors: Ryan Cotterell, Christo Kirov, Mans Hulden, Jason Eisner

    Abstract: Many languages' inflectional morphological systems are replete with irregulars, i.e., words that do not seem to follow standard inflectional rules. In this work, we quantitatively investigate the conditions under which irregulars can survive in a language over the course of time. Using recurrent neural networks to simulate language learners, we test the diachronic relation between frequency of wor… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

    Comments: accepted to NAACL 2018; withdrawn in order to add more thorough experiments (coming in next version)

  14. arXiv:1708.09151  [pdf, ps, other

    cs.CL

    Paradigm Completion for Derivational Morphology

    Authors: Ryan Cotterell, Ekaterina Vylomova, Huda Khayrallah, Christo Kirov, David Yarowsky

    Abstract: The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, a… ▽ More

    Submitted 30 August, 2017; originally announced August 2017.

    Comments: EMNLP 2017

  15. arXiv:1706.09031  [pdf, other

    cs.CL

    CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

    Authors: Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by… ▽ More

    Submitted 4 July, 2017; v1 submitted 27 June, 2017; originally announced June 2017.

    Comments: CoNLL 2017