Skip to main content

Showing 1–50 of 50 results for author: Kann, K

.
  1. arXiv:2310.18502  [pdf, other

    cs.CL

    On the Automatic Generation and Simplification of Children's Stories

    Authors: Maria Valentini, Jennifer Weber, Jesus Salcido, Téa Wright, Eliana Colunga, Katharina Kann

    Abstract: With recent advances in large language models (LLMs), the concept of automatically generating children's educational materials has become increasingly realistic. Working toward the goal of age-appropriate simplicity in generated educational texts, we first examine the ability of several popular LLMs to generate stories with properly adjusted lexical and readability levels. We find that, in spite o… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 (main conference)

  2. arXiv:2306.06804  [pdf, other

    cs.CL stat.ML

    Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

    Authors: Manuel Mager, Rajat Bhatnagar, Graham Neubig, Ngoc Thang Vu, Katharina Kann

    Abstract: Neural models have drastically advanced state of the art for machine translation (MT) between high-resource languages. Traditionally, these models rely on large amounts of training data, but many language pairs lack these resources. However, an important part of the languages in the world do not have this amount of data. Most languages from the Americas are among them, having a limited amount of p… ▽ More

    Submitted 11 June, 2023; originally announced June 2023.

    Comments: Accepted to AmericasNLP 2023

  3. arXiv:2305.19474  [pdf, other

    cs.CL

    Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

    Authors: Manuel Mager, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

    Abstract: In recent years machine translation has become very successful for high-resource language pairs. This has also sparked new interest in research on the automatic translation of low-resource languages, including Indigenous languages. However, the latter are deeply related to the ethnic and cultural groups that speak (or used to speak) them. The data collection, modeling and deploying machine transla… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL2023 Main Conference

  4. arXiv:2305.16581  [pdf, other

    cs.CL

    An Investigation of Noise in Morphological Inflection

    Authors: Adam Wiemerslage, Changbing Yang, Garrett Nicolai, Miikka Silfverberg, Katharina Kann

    Abstract: With a growing focus on morphological inflection systems for languages where high-quality data is scarce, training data noise is a serious but so far largely ignored concern. We aim at closing this gap by investigating the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion and its impact on morphological inflection systems: First, we propose an er… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  5. arXiv:2302.07912  [pdf, other

    cs.CL

    Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

    Authors: Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John E. Ortega, Gustavo A. Giménez-Lugo, Rolando Coto-Solano, Katharina Kann

    Abstract: Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribu… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: EACL 2023

  6. arXiv:2212.09252  [pdf, other

    cs.CL cs.LG

    Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems

    Authors: Sagi Shaier, Lawrence Hunter, Katharina Kann

    Abstract: Many dialogue systems (DSs) lack characteristics humans have, such as emotion perception, factuality, and informativeness. Enhancing DSs with knowledge alleviates this problem, but, as many ways of doing so exist, kee** track of all proposed methods is difficult. Here, we present the first survey of knowledge-enhanced DSs. We define three categories of systems - internal, external, and hybrid -… ▽ More

    Submitted 20 December, 2022; v1 submitted 19 December, 2022; originally announced December 2022.

  7. arXiv:2211.16858  [pdf, other

    cs.CL

    A Major Obstacle for NLP Research: Let's Talk about Time Allocation!

    Authors: Katharina Kann, Shiran Dudy, Arya D. McCarthy

    Abstract: The field of natural language processing (NLP) has grown over the last few years: conferences have become larger, we have published an incredible amount of papers, and state-of-the-art research has been implemented in a large variety of customer-facing products. However, this paper argues that we have been less successful than we should have been and reflects on where and how the field fails to ta… ▽ More

    Submitted 30 November, 2022; originally announced November 2022.

    Comments: To appear at EMNLP 2022

  8. arXiv:2210.12321  [pdf, other

    cs.CL

    A Comprehensive Comparison of Neural Networks as Cognitive Models of Inflection

    Authors: Adam Wiemerslage, Shiran Dudy, Katharina Kann

    Abstract: Neural networks have long been at the center of a debate around the cognitive mechanism by which humans process inflectional morphology. This debate has gravitated into NLP by way of the question: Are neural networks a feasible account for human behavior in morphological inflection? We address that question by measuring the correlation between human judgments and neural network probabilities for u… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

  9. arXiv:2203.10753  [pdf, other

    cs.CL

    Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

    Authors: Yoshinari Fu**uma, Jordan Boyd-Graber, Katharina Kann

    Abstract: Pretrained multilingual models enable zero-shot learning even for unseen languages, and that performance can be further improved via adaptation prior to finetuning. However, it is unclear how the number of pretraining languages influences a model's zero-shot learning for languages unseen during pretraining. To fill this gap, we ask the following research questions: (1) How does the number of pretr… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: ACL 2022 camera ready

  10. arXiv:2203.08954  [pdf, other

    cs.CL cs.AI

    BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

    Authors: Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

    Abstract: Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically insp… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Accepted to Findings of ACL 2022

  11. arXiv:2203.08909  [pdf, other

    cs.CL

    Morphological Processing of Low-Resource Languages: Where We Are and What's Next

    Authors: Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann

    Abstract: Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Findings of ACL 2022

  12. arXiv:2110.08182  [pdf, other

    cs.CL cs.CV

    The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color

    Authors: Cory Paik, Stéphane Aroca-Ouellette, Alessandro Roncone, Katharina Kann

    Abstract: Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-per… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

    Comments: Accepted to EMNLP 2021, 9 Pages

  13. arXiv:2108.06598  [pdf, other

    cs.CL

    Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages

    Authors: Atul Kr. Ojha, Chao-Hong Liu, Katharina Kann, John Ortega, Sheetal Shatam, Theodorus Fransen

    Abstract: We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following di… ▽ More

    Submitted 18 August, 2021; v1 submitted 14 August, 2021; originally announced August 2021.

    Comments: 10 pages

  14. arXiv:2107.03690  [pdf, other

    cs.LG

    Proceedings of the First Workshop on Weakly Supervised Learning (WeaSuL)

    Authors: Michael A. Hedderich, Benjamin Roth, Katharina Kann, Barbara Plank, Alex Ratner, Dietrich Klakow

    Abstract: Welcome to WeaSuL 2021, the First Workshop on Weakly Supervised Learning, co-located with ICLR 2021. In this workshop, we want to advance theory, methods and tools for allowing experts to express prior coded knowledge for automatic data annotations that can be used to train arbitrary deep neural networks for prediction. The ICLR 2021 Workshop on Weak Supervision aims at advancing methods that help… ▽ More

    Submitted 8 July, 2021; originally announced July 2021.

  15. arXiv:2106.06875  [pdf, other

    cs.CL

    Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

    Authors: Rajat Bhatnagar, Ananya Ganesh, Katharina Kann

    Abstract: High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in co… ▽ More

    Submitted 12 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure, ACL-IJCNLP 2021 submission, Natural Language Processing, Data Collection, Monolingual Speakers, Machine Translation, GIFs, Images

    ACM Class: I.2.7

  16. arXiv:2106.05249  [pdf, other

    cs.CL

    What Would a Teacher Do? Predicting Future Talk Moves

    Authors: Ananya Ganesh, Martha Palmer, Katharina Kann

    Abstract: Recent advances in natural language processing (NLP) have the ability to transform how classroom learning takes place. Combined with the increasing integration of technology in today's classrooms, NLP systems leveraging question answering and dialog processing techniques can serve as private tutors or participants in classroom discussions to increase student engagement and learning. To progress to… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: 13 pages, 3 figures; To appear in Findings of ACL 2021

  17. arXiv:2106.03634  [pdf, other

    cs.CL

    PROST: Physical Reasoning of Objects through Space and Time

    Authors: Stéphane Aroca-Ouellette, Cory Paik, Alessandro Roncone, Katharina Kann

    Abstract: We present a new probing dataset named PROST: Physical Reasoning about Objects Through Space and Time. This dataset contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts. All questions are designed to probe both causal and masked language models in a zero-shot setting. We conduct an extensive analysis which demonstrates that stat… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL-Findings 2021, 9 Pages

  18. arXiv:2106.02124  [pdf, other

    cs.CL

    How to Adapt Your Pretrained Multilingual Model to 1600 Languages

    Authors: Abteen Ebrahimi, Katharina Kann

    Abstract: Pretrained multilingual models (PMMs) enable zero-shot learning via cross-lingual transfer, performing best for languages seen during pretraining. While methods exist to improve performance for unseen languages, they have almost exclusively been evaluated using amounts of raw text only available for a small fraction of the world's languages. In this paper, we evaluate the performance of existing m… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL 2021

  19. arXiv:2104.08726  [pdf, other

    cs.CL

    AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages

    Authors: Abteen Ebrahimi, Manuel Mager, Arturo Oncevay, Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan Meza-Ruiz, Gustavo A. Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando Coto-Solano, Ngoc Thang Vu, Katharina Kann

    Abstract: Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we… ▽ More

    Submitted 16 March, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: Accepted to ACL 2022

  20. arXiv:2101.11131  [pdf, other

    cs.CL

    CLiMP: A Benchmark for Chinese Language Model Evaluation

    Authors: Beilei Xiang, Changbing Yang, Yu Li, Alex Warstadt, Katharina Kann

    Abstract: Linguistically informed analyses of language models (LMs) contribute to the understanding and improvement of these models. Here, we introduce the corpus of Chinese linguistic minimal pairs (CLiMP), which can be used to investigate what knowledge Chinese LMs acquire. CLiMP consists of sets of 1,000 minimal pairs (MPs) for 16 syntactic contrasts in Mandarin, covering 9 major Mandarin linguistic phen… ▽ More

    Submitted 26 January, 2021; originally announced January 2021.

  21. arXiv:2101.10565  [pdf, other

    cs.CL

    Coloring the Black Box: What Synesthesia Tells Us about Character Embeddings

    Authors: Katharina Kann, Mauro M. Monsalve-Mercado

    Abstract: In contrast to their word- or sentence-level counterparts, character embeddings are still poorly understood. We aim at closing this gap with an in-depth study of English character embeddings. For this, we use resources from research on grapheme-color synesthesia -- a neuropsychological phenomenon where letters are associated with colors, which give us insight into which characters are similar for… ▽ More

    Submitted 26 January, 2021; originally announced January 2021.

    Comments: EACL 2021

  22. arXiv:2010.02804  [pdf, other

    cs.CL cs.AI stat.ML

    Tackling the Low-resource Challenge for Canonical Segmentation

    Authors: Manuel Mager, Özlem Çetinoğlu, Katharina Kann

    Abstract: Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua.… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: Accepted to EMNLP 2020

  23. arXiv:2010.02239  [pdf, other

    cs.CL

    Acrostic Poem Generation

    Authors: Rajat Agarwal, Katharina Kann

    Abstract: We propose a new task in the area of computational creativity: acrostic poem generation in English. Acrostic poems are poems that contain a hidden message; typically, the first letter of each line spells out a word or short phrase. We define the task as a generation task with multiple constraints: given an input word, 1) the initial letters of each line should spell out the provided word, 2) the p… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: EMNLP 2020

  24. arXiv:2006.11830  [pdf, ps, other

    cs.CL cs.LG

    The NYU-CUBoulder Systems for SIGMORPHON 2020 Task 0 and Task 2

    Authors: Assaf Singer, Katharina Kann

    Abstract: We describe the NYU-CUBoulder systems for the SIGMORPHON 2020 Task 0 on typologically diverse morphological inflection and Task 2 on unsupervised morphological paradigm completion. The former consists of generating morphological inflections from a lemma and a set of morphosyntactic features describing the target form. The latter requires generating entire paradigms for a set of given lemmas from r… ▽ More

    Submitted 21 June, 2020; originally announced June 2020.

    Comments: 8 pages, 2 figures

    ACM Class: I.2.7; I.2.6

  25. arXiv:2005.13756  [pdf, other

    cs.CL

    The SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion

    Authors: Katharina Kann, Arya McCarthy, Garrett Nicolai, Mans Hulden

    Abstract: In this paper, we describe the findings of the SIGMORPHON 2020 shared task on unsupervised morphological paradigm completion (SIGMORPHON 2020 Task 2), a novel task in the field of inflectional morphology. Participants were asked to submit systems which take raw text and a list of lemmas as input, and output all inflected forms, i.e., the entire morphological paradigm, of each lemma. In order to si… ▽ More

    Submitted 27 May, 2020; originally announced May 2020.

    Comments: SIGMORPHON 2020

  26. arXiv:2005.13455  [pdf, other

    cs.CL

    Self-Training for Unsupervised Parsing with PRPN

    Authors: Anhad Mohananey, Katharina Kann, Samuel R. Bowman

    Abstract: Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models: we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model's predictions during training, we extend a recent… ▽ More

    Submitted 27 May, 2020; originally announced May 2020.

    Comments: Accepted for publication at the 16th International Conference on Parsing Technologies (IWPT), 2020

  27. arXiv:2005.13013  [pdf, other

    cs.CL

    English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

    Authors: Jason Phang, Iacer Calixto, Phu Mon Htut, Yada Pruksachatkun, Haokun Liu, Clara Vania, Katharina Kann, Samuel R. Bowman

    Abstract: Intermediate-task training---fine-tuning a pretrained model on an intermediate task before fine-tuning again on the target task---often improves model performance substantially on language understanding tasks in monolingual English settings. We investigate whether English intermediate-task training is still helpful on non-English target tasks. Using nine intermediate language-understanding tasks,… ▽ More

    Submitted 30 September, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

  28. arXiv:2005.12411  [pdf, other

    cs.CL cs.LG

    The IMS-CUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion

    Authors: Manuel Mager, Katharina Kann

    Abstract: In this paper, we present the systems of the University of Stuttgart IMS and the University of Colorado Boulder (IMS-CUBoulder) for SIGMORPHON 2020 Task 2 on unsupervised morphological paradigm completion (Kann et al., 2020). The task consists of generating the morphological paradigms of a set of lemmas, given only the lemmas themselves and unlabeled text. Our proposed system is a modified version… ▽ More

    Submitted 25 May, 2020; originally announced May 2020.

  29. arXiv:2005.00970  [pdf, other

    cs.CL

    Unsupervised Morphological Paradigm Completion

    Authors: Huiming **, Liwei Cai, Yihui Peng, Chen Xia, Arya D. McCarthy, Katharina Kann

    Abstract: We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or… ▽ More

    Submitted 20 May, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted by ACL 2020

  30. arXiv:2005.00628  [pdf, other

    cs.CL

    Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

    Authors: Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman

    Abstract: While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large… ▽ More

    Submitted 9 May, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: ACL 2020

  31. arXiv:2004.13305  [pdf, ps, other

    cs.CL

    Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource Languages

    Authors: Katharina Kann, Ophélie Lacroix, Anders Søgaard

    Abstract: Part-of-speech (POS) taggers for low-resource languages which are exclusively based on various forms of weak supervision - e.g., cross-lingual transfer, type-level supervision, or a combination thereof - have been reported to perform almost as well as supervised ones. However, weakly supervised POS taggers are commonly only evaluated on languages that are very different from truly low-resource lan… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

    Comments: AAAI 2020

  32. arXiv:2004.13304  [pdf, other

    cs.CL

    Learning to Learn Morphological Inflection for Resource-Poor Languages

    Authors: Katharina Kann, Samuel R. Bowman, Kyunghyun Cho

    Abstract: We propose to cast the task of morphological inflection - map** a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters that can serve as a strong initialization point for fine-tuning on a resource-poor target language. Experiments… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

    Comments: AAAI 2020

  33. arXiv:1910.09729  [pdf, other

    cs.CL

    Grammatical Gender, Neo-Whorfianism, and Word Embeddings: A Data-Driven Approach to Linguistic Relativity

    Authors: Katharina Kann

    Abstract: The relation between language and thought has occupied linguists for at least a century. Neo-Whorfianism, a weak version of the controversial Sapir-Whorf hypothesis, holds that our thoughts are subtly influenced by the grammatical structures of our native language. One area of investigation in this vein focuses on how the grammatical gender of nouns affects the way we perceive the corresponding ob… ▽ More

    Submitted 21 October, 2019; originally announced October 2019.

  34. arXiv:1910.05456  [pdf, ps, other

    cs.CL

    Acquisition of Inflectional Morphology in Artificial Neural Networks With Prior Knowledge

    Authors: Katharina Kann

    Abstract: How does knowledge of one language's morphology influence learning of inflection rules in a second one? In order to investigate this question in artificial neural network models, we perform experiments with a sequence-to-sequence architecture, which we train on different combinations of eight source and three target languages. A detailed analysis of the model outputs suggests the following conclus… ▽ More

    Submitted 11 October, 2019; originally announced October 2019.

    Comments: SCiL 2020

  35. arXiv:1909.01522  [pdf, other

    cs.CL

    Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

    Authors: Katharina Kann, Kyunghyun Cho, Samuel R. Bowman

    Abstract: Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions: Does using a development set for early stop** in the low-resource sett… ▽ More

    Submitted 14 September, 2019; v1 submitted 3 September, 2019; originally announced September 2019.

    Comments: EMNLP 2019

  36. arXiv:1908.06136  [pdf, other

    cs.CL

    Transductive Auxiliary Task Self-Training for Neural Multi-Task Models

    Authors: Johannes Bjerva, Katharina Kann, Isabelle Augenstein

    Abstract: Multi-task learning and self-training are two common ways to improve a machine learning model's performance in settings with limited training data. Drawing heavily on ideas from those two approaches, we suggest transductive auxiliary task self-training: training a multi-task model on (i) a combination of main and auxiliary task training data, and (ii) test instances with auxiliary task labels whic… ▽ More

    Submitted 22 September, 2019; v1 submitted 16 August, 2019; originally announced August 2019.

    Comments: Camera ready version, to appear at DeepLo 2019 (EMNLP workshop)

  37. arXiv:1906.03608  [pdf, other

    cs.CL cs.LG

    Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings

    Authors: Yadollah Yaghoobzadeh, Katharina Kann, Timothy J. Hazen, Eneko Agirre, Hinrich Schütze

    Abstract: Word embeddings typically represent different meanings of a word in a single conflated vector. Empirical analysis of embeddings of ambiguous words is currently limited by the small size of manually annotated resources and by the fact that word senses are treated as unrelated individual concepts. We present a large dataset based on manual Wikipedia annotations and word senses, where word senses fro… ▽ More

    Submitted 9 June, 2019; originally announced June 2019.

    Comments: 14 pages, Accepted at ACL 2019

  38. arXiv:1904.01989  [pdf, other

    cs.CL cs.LG

    Subword-Level Language Identification for Intra-Word Code-Switching

    Authors: Manuel Mager, Özlem Çetinoğlu, Katharina Kann

    Abstract: Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the… ▽ More

    Submitted 3 April, 2019; originally announced April 2019.

    Comments: NAACL-HLT 2019

  39. arXiv:1811.10773  [pdf, other

    cs.CL

    Verb Argument Structure Alternations in Word and Sentence Embeddings

    Authors: Katharina Kann, Alex Warstadt, Adina Williams, Samuel R. Bowman

    Abstract: Verbs occur in different syntactic environments, or frames. We investigate whether artificial neural networks encode grammatical distinctions necessary for inferring the idiosyncratic frame-selectional properties of verbs. We introduce five datasets, collectively called FAVA, containing in aggregate nearly 10k sentences labeled for grammatical acceptability, illustrating different verbal argument… ▽ More

    Submitted 26 November, 2018; originally announced November 2018.

    Comments: Accepted to SCiL 2019

  40. arXiv:1810.07125  [pdf, other

    cs.CL

    The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

    Authors: Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

    Abstract: The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a… ▽ More

    Submitted 25 February, 2020; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: CoNLL 2018. arXiv admin note: text overlap with arXiv:1706.09031

  41. arXiv:1809.08733  [pdf, other

    cs.CL

    Neural Transductive Learning and Beyond: Morphological Generation in the Minimal-Resource Setting

    Authors: Katharina Kann, Hinrich Schütze

    Abstract: Neural state-of-the-art sequence-to-sequence (seq2seq) models often do not perform well for small training sets. We address paradigm completion, the morphological task of, given a partial paradigm, generating all missing forms. We propose two new methods for the minimal-resource setting: (i) Paradigm transduction: Since we assume only few paradigms available for training, neural seq2seq models are… ▽ More

    Submitted 9 May, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  42. arXiv:1809.08731  [pdf, ps, other

    cs.CL

    Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

    Authors: Katharina Kann, Sascha Rothe, Katja Filippova

    Abstract: Motivated by recent findings on the probabilistic modeling of acceptability judgments, we propose syntactic log-odds ratio (SLOR), a normalized language model score, as a metric for referenceless fluency evaluation of natural language generation output at the sentence level. We further introduce WPSLOR, a novel WordPiece-based version, which harnesses a more compact language model. Even though wor… ▽ More

    Submitted 23 September, 2018; originally announced September 2018.

    Comments: Accepted to CoNLL 2018

  43. arXiv:1807.07186  [pdf, other

    cs.CL cs.AI

    Evaluating Word Embeddings in Multi-label Classification Using Fine-grained Name Ty**

    Authors: Yadollah Yaghoobzadeh, Katharina Kann, Hinrich Schütze

    Abstract: Embedding models typically associate each word with a single real-valued vector, representing its different properties. Evaluation methods, therefore, need to analyze the accuracy and completeness of these properties in embeddings. This requires fine-grained analysis of embedding subspaces. Multi-label classification is an appropriate way to do so. We propose a new evaluation method for word embed… ▽ More

    Submitted 18 July, 2018; originally announced July 2018.

    Comments: 6 pages, The 3rd Workshop on Representation Learning for NLP (RepL4NLP @ ACL2018)

  44. arXiv:1807.00286  [pdf, ps, other

    cs.CL

    Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages

    Authors: Manuel Mager, Elisabeth Mager, Alfonso Medina-Urrea, Ivan Meza, Katharina Kann

    Abstract: Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we stu… ▽ More

    Submitted 1 July, 2018; originally announced July 2018.

    Comments: To appear in "All Together Now? Computational Modeling of Polysynthetic Languages" Workshop, at COLING 2018

  45. arXiv:1804.06024  [pdf, ps, other

    cs.CL

    Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages

    Authors: Katharina Kann, Manuel Mager, Ivan Meza-Ruiz, Hinrich Schütze

    Abstract: Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performan… ▽ More

    Submitted 16 April, 2018; originally announced April 2018.

    Comments: Long Paper, 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  46. arXiv:1705.06106  [pdf, other

    cs.CL

    Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models

    Authors: Katharina Kann, Hinrich Schütze

    Abstract: We present a semi-supervised way of training a character-based encoder-decoder recurrent neural network for morphological reinflection, the task of generating one inflected word form from another. This is achieved by using unlabeled tokens or random strings as training data for an autoencoding task, adapting a network for morphological reinflection, and performing multi-task training. We thus use… ▽ More

    Submitted 21 July, 2017; v1 submitted 17 May, 2017; originally announced May 2017.

    Comments: Accepted at SCLeM 2017

  47. arXiv:1704.00052  [pdf, other

    cs.CL

    One-Shot Neural Cross-Lingual Transfer for Paradigm Completion

    Authors: Katharina Kann, Ryan Cotterell, Hinrich Schütze

    Abstract: We present a novel cross-lingual transfer method for paradigm completion, the task of map** a lemma to its inflected forms, using a neural encoder-decoder model, the state of the art for the monolingual task. We use labeled data from a high-resource language to increase performance on a low-resource language. In experiments on 21 language pairs from four different language families, we obtain up… ▽ More

    Submitted 31 March, 2017; originally announced April 2017.

    Comments: Accepted at ACL 2017

  48. arXiv:1702.01923  [pdf, other

    cs.CL

    Comparative Study of CNN and RNN for Natural Language Processing

    Authors: Wenpeng Yin, Katharina Kann, Mo Yu, Hinrich Schütze

    Abstract: Deep neural networks (DNN) have revolutionized the field of natural language processing (NLP). Convolutional neural network (CNN) and recurrent neural network (RNN), the two main types of DNN architectures, are widely explored to handle various NLP tasks. CNN is supposed to be good at extracting position-invariant features and RNN at modeling units in sequence. The state of the art on many NLP tas… ▽ More

    Submitted 7 February, 2017; originally announced February 2017.

    Comments: 7 pages, 11 figures

  49. arXiv:1612.06027  [pdf, other

    cs.CL

    Neural Multi-Source Morphological Reinflection

    Authors: Katharina Kann, Ryan Cotterell, Hinrich Schütze

    Abstract: We explore the task of multi-source morphological reinflection, which generalizes the standard, single-source version. The input consists of (i) a target tag and (ii) multiple pairs of source form and source tag for a lemma. The motivation is that it is beneficial to have access to more than one source form since different source forms can provide complementary information, e.g., different stems.… ▽ More

    Submitted 22 January, 2017; v1 submitted 18 December, 2016; originally announced December 2016.

    Comments: Accepted at EACL 2017. Camera Ready Version

  50. arXiv:1606.00589  [pdf, other

    cs.CL

    Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection

    Authors: Katharina Kann, Hinrich Schütze

    Abstract: Morphological reinflection is the task of generating a target form given a source form, a source tag and a target tag. We propose a new way of modeling this task with neural encoder-decoder models. Our approach reduces the amount of required training data for this architecture and achieves state-of-the-art results, making encoder-decoder models applicable to morphological reinflection even for low… ▽ More

    Submitted 2 June, 2016; originally announced June 2016.

    Comments: Accepted at ACL 2016