-
CreoleVal: Multilingual Multitask Benchmarks for Creoles
Authors:
Heather Lent,
Kushal Tatariya,
Raj Dabre,
Yiyi Chen,
Marcell Fekete,
Esther Ploeger,
Li Zhou,
Ruth-Ann Armstrong,
Abee Eijansantos,
Catriona Malau,
Hans Erik Heje,
Ernests Lavrinovics,
Diptesh Kanojia,
Paul Belony,
Marcel Bollmann,
Loïc Grobol,
Miryam de Lhoneux,
Daniel Hershcovich,
Michel DeGraff,
Anders Søgaard,
Johannes Bjerva
Abstract:
Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning…
▽ More
Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.
△ Less
Submitted 6 May, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
6-Layer Model for a Structured Description and Categorization of Urban Traffic and Environment
Authors:
Maike Scholtes,
Lukas Westhofen,
Lara Ruth Turner,
Katrin Lotto,
Michael Schuldes,
Hendrik Weber,
Nicolas Wagener,
Christian Neurohr,
Martin Bollmann,
Franziska Körtke,
Johannes Hiller,
Michael Hoss,
Julian Bock,
Lutz Eckstein
Abstract:
Verification and validation of automated driving functions impose large challenges. Currently, scenario-based approaches are investigated in research and industry, aiming at a reduction of testing efforts by specifying safety relevant scenarios. To define those scenarios and operate in a complex real-world design domain, a structured description of the environment is needed. Within the PEGASUS res…
▽ More
Verification and validation of automated driving functions impose large challenges. Currently, scenario-based approaches are investigated in research and industry, aiming at a reduction of testing efforts by specifying safety relevant scenarios. To define those scenarios and operate in a complex real-world design domain, a structured description of the environment is needed. Within the PEGASUS research project, the 6-Layer Model (6LM) was introduced for the description of highway scenarios. This paper refines the 6LM and extends it to urban traffic and environment. As defined in PEGASUS, the 6LM provides the possibility to categorize the environment and, therefore, functions as a structured basis for subsequent scenario description. The model enables a structured description and categorization of the general environment, without incorporating any knowledge or anticipating any functions of actors. Beyond that, there is a variety of other applications of the 6LM, which are elaborated in this paper. The 6LM includes a description of the road network and traffic guidance objects, roadside structures, temporary modifications of the former, dynamic objects, environmental conditions and digital information. The work at hand specifies each layer by categorizing its items. Guidelines are formulated and explanatory examples are given to standardize the application of the model for an objective environment description. In contrast to previous publications, the model and its design are described in far more detail. Finally, the holistic description of the 6LM presented includes remarks on possible future work when expanding the concept to machine perception aspects.
△ Less
Submitted 2 February, 2021; v1 submitted 9 December, 2020;
originally announced December 2020.
-
A Large-Scale Comparison of Historical Text Normalization Systems
Authors:
Marcel Bollmann
Abstract:
There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the larges…
▽ More
There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Few-Shot and Zero-Shot Learning for Historical Text Normalization
Authors:
Marcel Bollmann,
Natalia Korchagina,
Anders Søgaard
Abstract:
Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63~multi-task learning configurations for sequence-to-sequence-based historical text norma…
▽ More
Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63~multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme map**, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.
△ Less
Submitted 13 October, 2019; v1 submitted 12 March, 2019;
originally announced March 2019.
-
Transmuting CHY formulae
Authors:
Max Bollmann,
Livia Ferro
Abstract:
The various formulations of scattering amplitudes presented in recent years have underlined a hidden unity among very different theories. The KLT and BCJ relations, together with the CHY formulation, connect the S-matrices of a wide range of theories: the transmutation operators, recently proposed by Cheung, Shen and Wen, provide an account for these similarities. In this note we use the transmuta…
▽ More
The various formulations of scattering amplitudes presented in recent years have underlined a hidden unity among very different theories. The KLT and BCJ relations, together with the CHY formulation, connect the S-matrices of a wide range of theories: the transmutation operators, recently proposed by Cheung, Shen and Wen, provide an account for these similarities. In this note we use the transmutation operators to link the various CHY integrands at tree-level. Starting from gravity, we generate the integrands for Yang-Mills, biadjoint scalar, Einstein-Maxwell, Yang-Mills scalar, Born-Infeld, Dirac-Born-Infeld, non-linear sigma model and special Galileon theories, as well as for their extensions. We also commence the study of the CHY-like formulae at loop level.
△ Less
Submitted 22 August, 2018;
originally announced August 2018.
-
Improving historical spelling normalization with bi-directional LSTMs and multi-task learning
Authors:
Marcel Bollmann,
Anders Søgaard
Abstract:
Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously…
▽ More
Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.
△ Less
Submitted 25 October, 2016;
originally announced October 2016.