Skip to main content

Showing 1–35 of 35 results for author: Mortensen, D R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16521  [pdf, other

    cs.CL cs.AI

    Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback

    Authors: Jimin Sohn, Jeihee Cho, Junyong Lee, Songmu Heo, Ji-Eun Han, David R. Mortensen

    Abstract: Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative f… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 10 pages, 8 figures

  2. arXiv:2406.16030  [pdf, other

    cs.CL cs.AI

    Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

    Authors: Jimin Sohn, Haeji Jung, Alex Cheng, Jooeon Kang, Yilin Du, David R. Mortensen

    Abstract: Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-resource languages. In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages. Our experiments show that our method significa… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 7 pages, 5 figures, 5 tables

  3. arXiv:2406.05930  [pdf

    cs.CL

    Semisupervised Neural Proto-Language Reconstruction

    Authors: Liang Lu, Peirong Xie, David R. Mortensen

    Abstract: Existing work implementing comparative reconstruction of ancestral languages (proto-languages) has usually required full supervision. However, historical reconstruction models are only of practical value if they can be trained with a limited amount of labeled data. We propose a semisupervised historical reconstruction task in which the model is trained on only a small amount of labeled data (cogna… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024

  4. arXiv:2404.15690  [pdf, other

    cs.CL cs.LG

    Neural Proto-Language Reconstruction

    Authors: Chenxuan Cui, Ying Chen, Qinxin Wang, David R. Mortensen

    Abstract: Proto-form reconstruction has been a painstaking process for linguists. Recently, computational models such as RNN and Transformers have been proposed to automate this process. We take three different approaches to improve upon previous methods, including data augmentation to recover missing reflexes, adding a VAE structure to the Transformer model for proto-to-language prediction, and using a neu… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  5. arXiv:2403.18769  [pdf

    cs.CL

    Improved Neural Protoform Reconstruction via Reflex Prediction

    Authors: Liang Lu, **gzhi Wang, David R. Mortensen

    Abstract: Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computatio… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted to LREC-COLING 2024

  6. arXiv:2403.17856  [pdf, other

    cs.CL

    Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

    Authors: David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, Hinrich Schütze, Leonie Weissweiler

    Abstract: Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to w… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  7. arXiv:2403.17760  [pdf, other

    cs.CL

    Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons

    Authors: Shijia Zhou, Leonie Weissweiler, Taiqi He, Hinrich Schütze, David R. Mortensen, Lori Levin

    Abstract: In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort… ▽ More

    Submitted 29 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  8. arXiv:2403.13169  [pdf, other

    cs.CL

    Wav2Gloss: Generating Interlinear Glossed Text from Speech

    Authors: Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

    Abstract: Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free transl… ▽ More

    Submitted 5 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: ACL 2024 camera ready version

  9. arXiv:2402.14279  [pdf, other

    cs.CL cs.AI

    Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

    Authors: Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, **kyu Kim, David R. Mortensen

    Abstract: Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provi… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  10. arXiv:2402.12998  [pdf

    cs.CL

    Phonotactic Complexity across Dialects

    Authors: Ryan Soh-Eun Shim, Kalvin Chang, David R. Mortensen

    Abstract: Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites),… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to COLING-LREC 2024

  11. arXiv:2402.01582  [pdf

    cs.CL

    Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study

    Authors: Kalvin Chang, Nathaniel R. Robinson, Anna Cai, Ting Chen, Annie Zhang, David R. Mortensen

    Abstract: We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a map** from phones to their articulatory features and (3) a typological database of sound changes. We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound c… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to LChange 2023

  12. arXiv:2311.00835  [pdf, other

    cs.CL

    Calibrated Seq2seq Models for Efficient and Generalizable Ultra-fine Entity Ty**

    Authors: Yanlin Feng, Adithya Pratapa, David R Mortensen

    Abstract: Ultra-fine entity ty** plays a crucial role in information extraction by predicting fine-grained semantic types for entity mentions in text. However, this task poses significant challenges due to the massive number of entity types in the output space. The current state-of-the-art approaches, based on standard multi-label classifiers or cross-encoder models, suffer from poor generalization perfor… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  13. arXiv:2310.15113  [pdf

    cs.CL

    Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

    Authors: Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen

    Abstract: Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (i… ▽ More

    Submitted 26 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  14. arXiv:2309.07423  [pdf, other

    cs.CL

    ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

    Authors: Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, Graham Neubig

    Abstract: Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's dive… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: 27 pages, 9 figures, 14 tables

  15. arXiv:2305.13707  [pdf, other

    cs.CL

    Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

    Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov

    Abstract: Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. What constitutes a token, however, is training data… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  16. arXiv:2302.02178  [pdf, other

    cs.CL

    Construction Grammar Provides Unique Insight into Neural Language Models

    Authors: Leonie Weissweiler, Taiqi He, Naoki Otani, David R. Mortensen, Lori Levin, Hinrich Schütze

    Abstract: Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mi… ▽ More

    Submitted 4 February, 2023; originally announced February 2023.

    Comments: GURT 2023

  17. arXiv:2209.06295  [pdf, other

    cs.CL

    Data-adaptive Transfer Learning for Translation: A Case Study in Haitian and Jamaican

    Authors: Nathaniel R. Robinson, Cameron J. Hogan, Nancy Fulda, David R. Mortensen

    Abstract: Multilingual transfer techniques often improve low-resource machine translation (MT). Many of these techniques are applied without considering data characteristics. We show in the context of Haitian-to-English translation that transfer effectiveness is correlated with amount of training data and relationships between knowledge-sharing languages. Our experiments suggest that for some languages beyo… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  18. arXiv:2209.02842  [pdf, other

    cs.CL

    ASR2K: Speech Recognition for Around 2000 Languages without Audio

    Authors: Xinjian Li, Florian Metze, David R Mortensen, Alan W Black, Shinji Watanabe

    Abstract: Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronu… ▽ More

    Submitted 6 September, 2022; originally announced September 2022.

    Comments: INTERSPEECH 2022

  19. arXiv:2207.09889  [pdf, other

    cs.CL cs.SD eess.AS

    When Is TTS Augmentation Through a Pivot Language Useful?

    Authors: Nathaniel Robinson, Perez Ogayo, Swetha Gangu, David R. Mortensen, Shinji Watanabe

    Abstract: Develo** Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data. For many such languages, audio and text are available separately, but not audio with transcriptions. Using text, speech can be synthetically produced via text-to-speech (TTS) systems. However, many low-resource languages do not have quality TTS systems either.… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

  20. arXiv:2204.04080  [pdf

    cs.CL cs.AI

    Learning the Ordering of Coordinate Compounds and Elaborate Expressions in Hmong, Lahu, and Chinese

    Authors: Chenxuan Cui, Katherine J. Zhang, David R. Mortensen

    Abstract: Coordinate compounds (CCs) and elaborate expressions (EEs) are coordinate constructions common in languages of East and Southeast Asia. Mortensen (2006) claims that (1) the linear ordering of EEs and CCs in Hmong, Lahu, and Chinese can be predicted via phonological hierarchies and (2) these phonological hierarchies lack a clear phonetic rationale. These claims are significant because morphosyntax… ▽ More

    Submitted 3 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: NAACL2022 Oral

  21. arXiv:2203.13901  [pdf, other

    cs.CL

    AUTOLEX: An Automatic Framework for Linguistic Exploration

    Authors: Aditi Chaudhary, Zaid Sheikh, David R Mortensen, Antonios Anastasopoulos, Graham Neubig

    Abstract: Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: 9 pages

  22. arXiv:2107.11628  [pdf, other

    cs.CL cs.SD eess.AS

    Differentiable Allophone Graphs for Language-Universal Speech Recognition

    Authors: Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

    Abstract: Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supe… ▽ More

    Submitted 24 July, 2021; originally announced July 2021.

    Comments: INTERSPEECH 2021. Contains additional studies on phone recognition for unseen languages

  23. arXiv:2104.00824  [pdf

    cs.CL cs.SD eess.AS

    Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments

    Authors: David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu

    Abstract: There is growing interest in ASR systems that can recognize phones in a language-independent fashion. There is additionally interest in building language technologies for low-resource and endangered languages. However, there is a paucity of realistic data that can be used to test such systems and technologies. This paper presents a publicly available, phonetically transcribed corpus of 2255 uttera… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

    Comments: 4 pages, 3 figures

  24. arXiv:2103.16590  [pdf, other

    cs.CL

    Evaluating the Morphosyntactic Well-formedness of Generated Texts

    Authors: Adithya Pratapa, Antonios Anastasopoulos, Shruti Rijhwani, Aditi Chaudhary, David R. Mortensen, Graham Neubig, Yulia Tsvetkov

    Abstract: Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various… ▽ More

    Submitted 9 September, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: EMNLP 2021 camera-ready

  25. arXiv:2010.01160  [pdf, other

    cs.CL

    Automatic Extraction of Rules Governing Morphological Agreement

    Authors: Aditi Chaudhary, Antonios Anastasopoulos, Adithya Pratapa, David R. Mortensen, Zaid Sheikh, Yulia Tsvetkov, Graham Neubig

    Abstract: Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focu… ▽ More

    Submitted 5 October, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: Accepted at EMNLP 2020

  26. arXiv:2006.09336  [pdf, other

    cs.CL

    Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks

    Authors: Jimin Sun, Hwijeen Ahn, Chan Young Park, Yulia Tsvetkov, David R. Mortensen

    Abstract: Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages. We hypothesize that these measures of linguistic proximity are not enough when working with pragmatically-motivated tasks, such as sentiment analysis. As an alternative, we introduce three linguistic… ▽ More

    Submitted 8 April, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: EACL 2021

  27. arXiv:2006.04334  [pdf, other

    cs.SI cs.CL

    Characterizing Sociolinguistic Variation in the Competing Vaccination Communities

    Authors: Shahan Ali Memon, Aman Tyagi, David R. Mortensen, Kathleen M. Carley

    Abstract: Public health practitioners and policy makers grapple with the challenge of devising effective message-based interventions for debunking public health misinformation in cyber communities. "Framing" and "personalization" of the message is one of the key features for devising a persuasive messaging strategy. For an effective health communication, it is imperative to focus on "preference-based framin… ▽ More

    Submitted 4 October, 2020; v1 submitted 7 June, 2020; originally announced June 2020.

    Comments: 11 pages, 4 tables, 1 figure, 1 algorithm, accepted to SBP-BRiMS 2020 -- International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation

  28. arXiv:2004.08031  [pdf, other

    cs.CL

    AlloVera: A Multilingual Allophone Database

    Authors: David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. Black, Florian Metze, Graham Neubig

    Abstract: We introduce a new resource, AlloVera, which provides map**s from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a uni… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

    Comments: 8 pages, LREC 2020

  29. arXiv:2002.11800  [pdf, other

    cs.CL cs.SD eess.AS

    Universal Phone Recognition with a Multilingual Allophone System

    Authors: Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W Black, Florian Metze

    Abstract: Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages. Multilingual acoustic models, however, generally ignore the difference between phonemes (sounds that can support lexical contrasts in a particular language) and their corresponding phones (the sounds that are actually spoken, which are language independent). This c… ▽ More

    Submitted 26 February, 2020; originally announced February 2020.

    Comments: ICASSP 2020

  30. arXiv:2002.11781  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Zero-shot Learning for Automatic Phonemic Transcription

    Authors: Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W Black, Florian Metze

    Abstract: Automatic phonemic transcription tools are useful for low-resource language documentation. However, due to the lack of training sets, only a tiny fraction of languages have phonemic transcription tools. Fortunately, multilingual acoustic modeling provides a solution given limited audio training data. A more challenging problem is to build phonemic transcribers for languages with zero training data… ▽ More

    Submitted 26 February, 2020; originally announced February 2020.

    Comments: AAAI 2020

  31. Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods

    Authors: Maria Ryskina, Ella Rabinovich, Taylor Berg-Kirkpatrick, David R. Mortensen, Yulia Tsvetkov

    Abstract: We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find… ▽ More

    Submitted 21 January, 2020; originally announced January 2020.

    Comments: SCiL 2020

    Journal ref: Proceedings of the Society for Computation in Linguistics 3.1 (2020): 43-52

  32. arXiv:1911.02709  [pdf, other

    cs.CL

    Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation

    Authors: Zhong Zhou, Lori Levin, David R. Mortensen, Alex Waibel

    Abstract: We demonstrate a new approach to Neural Machine Translation (NMT) for low-resource languages using a ubiquitous linguistic resource, Interlinear Glossed Text (IGT). IGT represents a non-English sentence as a sequence of English lemmas and morpheme labels. As such, it can serve as a pivot or interlingua for NMT. Our contribution is four-fold. Firstly, we pool IGT for 1,497 languages in ODIN (54,545… ▽ More

    Submitted 3 March, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

  33. arXiv:1907.10129  [pdf, other

    cs.CL

    CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology

    Authors: Aditi Chaudhary, Elizabeth Salesky, Gayatri Bhat, David R. Mortensen, Jaime G. Carbonell, Yulia Tsvetkov

    Abstract: This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context. This task requires us to produce the lemma and morpho-syntactic description of each token in a sequence, for 107 treebanks. We approach this task with a hierarchical neural conditional random field (CRF) model which predicts each coarse-grained feature (eg. PO… ▽ More

    Submitted 23 July, 2019; originally announced July 2019.

    Comments: In Proceedings of the ACL-SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology

  34. arXiv:1902.08899  [pdf, other

    cs.CL

    The ARIEL-CMU Systems for LoReHLT18

    Authors: Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy, Alan W Black, Jaime Carbonell, Graham V. Horwood , et al. (5 additional authors not shown)

    Abstract: This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

    Submitted 24 February, 2019; originally announced February 2019.

  35. arXiv:1808.09500  [pdf

    cs.CL

    Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

    Authors: Aditi Chaudhary, Chunting Zhou, Lori Levin, Graham Neubig, David R. Mortensen, Jaime G. Carbonell

    Abstract: Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging. We present two approaches for improving generalization to low-resourced languages by adapting continuous word representations using linguistically motivated subword units: phonemes, morphemes and graphemes. Our method requires neither parallel cor… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Comments: Accepted at EMNLP 2018