Skip to main content

Showing 1–23 of 23 results for author: de Lhoneux, M

.
  1. arXiv:2402.04222  [pdf, other

    cs.CL

    What is "Typological Diversity" in NLP?

    Authors: Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva

    Abstract: The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is co… ▽ More

    Submitted 16 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  2. arXiv:2402.03137  [pdf, other

    cs.CL cs.LG

    Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

    Authors: Kushal Tatariya, Heather Lent, Johannes Bjerva, Miryam de Lhoneux

    Abstract: Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression, especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolin… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 5 pages, Accepted to SIGTYP 2024 @ EACL

  3. arXiv:2310.19567  [pdf, other

    cs.CL cs.AI

    CreoleVal: Multilingual Multitask Benchmarks for Creoles

    Authors: Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, Johannes Bjerva

    Abstract: Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning… ▽ More

    Submitted 6 May, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to TACL

  4. arXiv:2302.10086  [pdf, other

    cs.CL

    A Two-Sided Discussion of Preregistration of NLP Research

    Authors: Anders Søgaard, Daniel Hershcovich, Miryam de Lhoneux

    Abstract: Van Miltenburg et al. (2021) suggest NLP research should adopt preregistration to prevent fishing expeditions and to promote publication of negative results. At face value, this is a very reasonable suggestion, seemingly solving many methodological problems with NLP research. We discuss pros and cons -- some old, some new: a) Preregistration is challenged by the practice of retrieving hypotheses a… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

    Comments: EACL 2023

  5. arXiv:2207.06991  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Language Modelling with Pixels

    Authors: Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott

    Abstract: Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of… ▽ More

    Submitted 26 April, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: ICLR 2023

  6. arXiv:2206.00437  [pdf, other

    cs.CL cs.CY

    What a Creole Wants, What a Creole Needs

    Authors: Heather Lent, Kelechi Ogueji, Miryam de Lhoneux, Orevaoghene Ahia, Anders Søgaard

    Abstract: In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

    Comments: LREC 2022

  7. arXiv:2203.10020  [pdf, other

    cs.CL

    Challenges and Strategies in Cross-Cultural NLP

    Authors: Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, Anders Søgaard

    Abstract: Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogo… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

    Comments: ACL 2022 - Theme track

  8. arXiv:2203.09306  [pdf, other

    cs.CL cs.CV

    Finding Structural Knowledge in Multimodal-BERT

    Authors: Victor Milewski, Miryam de Lhoneux, Marie-Francine Moens

    Abstract: In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022

  9. arXiv:2203.08555  [pdf, other

    cs.CL

    Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

    Authors: Miryam de Lhoneux, Sheng Zhang, Anders Søgaard

    Abstract: Large multilingual pretrained language models such as mBERT and XLM-RoBERTa have been found to be surprisingly effective for cross-lingual transfer of syntactic parsing models (Wu and Dredze 2019), but only between related languages. However, source and training languages are rarely related, when parsing truly low-resource languages. To close this gap, we adopt a method from multi-task learning, w… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  10. arXiv:2112.03625  [pdf, other

    cs.CL

    Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

    Authors: Rob van der Goot, Miryam de Lhoneux

    Abstract: With an increase of dataset availability, the potential for learning from a variety of data sources has increased. One particular method to improve learning from multiple data sources is to embed the data source during training. This allows the model to learn generalizable features as well as distinguishing features between datasets. However, these dataset embeddings have mostly been used before c… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted to TLT at SyntaxFest 2021

  11. arXiv:2109.06074  [pdf, other

    cs.CL

    On Language Models for Creoles

    Authors: Heather Lent, Emanuele Bugliarello, Miryam de Lhoneux, Chen Qiu, Anders Søgaard

    Abstract: Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much s… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: CoNLL 2021

  12. arXiv:2106.03269  [pdf, other

    cs.CL

    Itihasa: A large-scale corpus for Sanskrit to English translation

    Authors: Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders Søgaard

    Abstract: This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of… ▽ More

    Submitted 5 October, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

    Comments: Fixed typo

  13. arXiv:2011.00834  [pdf, other

    cs.CL

    Comparison by Conversion: Reverse-Engineering UCCA from Syntax and Lexical Semantics

    Authors: Daniel Hershcovich, Nathan Schneider, Dotan Dvir, Jakob Prange, Miryam de Lhoneux, Omri Abend

    Abstract: Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the map** between meaning representations from different frameworks using two complementary methods: (i) a rule-based converter, and (ii) a supervised delexicaliz… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: COLING 2020 camera ready

  14. arXiv:2005.12094  [pdf, other

    cs.CL

    Køpsala: Transition-Based Graph Parsing via Efficient Training and Effective Encoding

    Authors: Daniel Hershcovich, Miryam de Lhoneux, Artur Kulmizev, Elham Pejhan, Joakim Nivre

    Abstract: We present Køpsala, the Copenhagen-Uppsala system for the Enhanced Universal Dependencies Shared Task at IWPT 2020. Our system is a pipeline consisting of off-the-shelf models for everything but enhanced graph parsing, and for the latter, a transition-based graph parser adapted from Che et al. (2019). We train a single enhanced parser model per language, using gold sentence splitting and tokenizat… ▽ More

    Submitted 2 June, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

    Comments: IWPT shared task 2020

  15. arXiv:1908.07397  [pdf, other

    cs.CL

    Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing -- A Tale of Two Parsers Revisited

    Authors: Artur Kulmizev, Miryam de Lhoneux, Johannes Gontrum, Elena Fano, Joakim Nivre

    Abstract: Transition-based and graph-based dependency parsers have previously been shown to have complementary strengths and weaknesses: transition-based parsers exploit rich structural features but suffer from error propagation, while graph-based parsers benefit from global optimization but have restricted feature scope. In this paper, we show that, even though some details of the picture have changed afte… ▽ More

    Submitted 27 August, 2019; v1 submitted 20 August, 2019; originally announced August 2019.

    Comments: Accepted at EMNLP 2019

  16. arXiv:1907.07950  [pdf, other

    cs.CL

    What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?

    Authors: Miryam de Lhoneux, Sara Stymne, Joakim Nivre

    Abstract: There is a growing interest in investigating what neural NLP models learn about language. A prominent open question is the question of whether or not it is necessary to model hierarchical structure. We present a linguistic investigation of a neural parser adding insights to this question. We look at transitivity and agreement information of auxiliary verb constructions (AVCs) in comparison to fini… ▽ More

    Submitted 12 October, 2020; v1 submitted 18 July, 2019; originally announced July 2019.

    Comments: Accepted by the Computational Linguistics journal

  17. arXiv:1902.09781  [pdf, other

    cs.CL

    Recursive Subtree Composition in LSTM-Based Dependency Parsing

    Authors: Miryam de Lhoneux, Miguel Ballesteros, Joakim Nivre

    Abstract: The need for tree structure modelling on top of sequence modelling is an open issue in neural dependency parsing. We investigate the impact of adding a tree layer on top of a sequential model by recursively composing subtree representations (composition) in a transition-based parser that uses features extracted by a BiLSTM. Composition seems superfluous with such a model, suggesting that BiLSTMs c… ▽ More

    Submitted 26 February, 2019; originally announced February 2019.

    Comments: Accepted at NAACL 2019

  18. arXiv:1809.02237  [pdf, ps, other

    cs.CL

    82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

    Authors: Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao, Sara Stymne

    Abstract: We present the Uppsala system for the CoNLL 2018 Shared Task on universal dependency parsing. Our system is a pipeline consisting of three components: the first performs joint word and sentence segmentation; the second predicts part-of- speech tags and morphological features; the third predicts dependency trees from words and tags. Instead of training a single parsing model for each treebank, we t… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

  19. arXiv:1809.00070  [pdf, ps, other

    cs.CL

    Nightmare at test time: How punctuation prevents parsers from generalizing

    Authors: Anders Søgaard, Miryam de Lhoneux, Isabelle Augenstein

    Abstract: Punctuation is a strong indicator of syntactic structure, and parsers trained on text with punctuation often rely heavily on this signal. Punctuation is a diversion, however, since human language processing does not rely on punctuation to the same extent, and in informal texts, we therefore often leave out punctuation. We also use punctuation ungrammatically for emphatic or creative purposes, or s… ▽ More

    Submitted 31 August, 2018; originally announced September 2018.

    Comments: Analyzing and interpreting neural networks for NLP, EMNLP 2018 workshop

  20. arXiv:1808.09060  [pdf, other

    cs.CL

    An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing

    Authors: Aaron Smith, Miryam de Lhoneux, Sara Stymne, Joakim Nivre

    Abstract: We provide a comprehensive analysis of the interactions between pre-trained word embeddings, character models and POS tags in a transition-based dependency parser. While previous studies have shown POS information to be less important in the presence of character models, we show that in fact there are complex interactions between all three techniques. In isolation each produces large improvements… ▽ More

    Submitted 27 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018

  21. arXiv:1808.09055  [pdf, ps, other

    cs.CL

    Parameter sharing between dependency parsers for related languages

    Authors: Miryam de Lhoneux, Johannes Bjerva, Isabelle Augenstein, Anders Søgaard

    Abstract: Previous work has suggested that parameter sharing between transition-based neural dependency parsers for related languages can lead to better performance, but there is no consensus on what parameters to share. We present an evaluation of 27 different parameter sharing strategies across 10 languages, representing five pairs of related languages, each pair from a different language family. We find… ▽ More

    Submitted 4 October, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018

  22. arXiv:1805.05089  [pdf, other

    cs.CL

    Parser Training with Heterogeneous Treebanks

    Authors: Sara Stymne, Miryam de Lhoneux, Aaron Smith, Joakim Nivre

    Abstract: How to make the most of multiple heterogeneous treebanks when training a monolingual dependency parser is an open question. We start by investigating previously suggested, but little evaluated, strategies for exploiting multiple treebanks based on concatenating training sets, with or without fine-tuning. We go on to propose a new method based on treebank embeddings. We perform experiments for seve… ▽ More

    Submitted 14 May, 2018; originally announced May 2018.

    Comments: 7 pages. Accepted to ACL 2018, short papers

  23. arXiv:1505.04420  [pdf, other

    cs.CL

    CCG Parsing and Multiword Expressions

    Authors: Miryam de Lhoneux

    Abstract: This thesis presents a study about the integration of information about Multiword Expressions (MWEs) into parsing with Combinatory Categorial Grammar (CCG). We build on previous work which has shown the benefit of adding information about MWEs to syntactic parsing by implementing a similar pipeline with CCG parsing. More specifically, we collapse MWEs to one token in training and test data in CCGb… ▽ More

    Submitted 17 May, 2015; originally announced May 2015.

    Comments: MSc thesis, The University of Edinburgh, 2014, School of Informatics, MSc Artificial Intelligence