Search | arXiv e-print repository

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt

Abstract: Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this pap… ▽ More Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: 9 pages, 1 figure, 3 tables

arXiv:2204.13511 [pdf, other]

RobBERTje: a Distilled Dutch BERT Model

Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt

Abstract: Pre-trained large-scale language models such as BERT have gained a lot of attention thanks to their outstanding performance on a wide range of natural language tasks. However, due to their large number of parameters, they are resource-intensive both to deploy and to fine-tune. Researchers have created several methods for distilling language models into smaller ones to increase efficiency, with a s… ▽ More Pre-trained large-scale language models such as BERT have gained a lot of attention thanks to their outstanding performance on a wide range of natural language tasks. However, due to their large number of parameters, they are resource-intensive both to deploy and to fine-tune. Researchers have created several methods for distilling language models into smaller ones to increase efficiency, with a small performance trade-off. In this paper, we create several different distilled versions of the state-of-the-art Dutch RobBERT model and call them RobBERTje. The distillations differ in their distillation corpus, namely whether or not they are shuffled and whether they are merged with subsequent sentences. We found that the performance of the models using the shuffled versus non-shuffled datasets is similar for most tasks and that randomly merging subsequent sentences in a corpus creates models that train faster and perform better on tasks with long sequences. Upon comparing distillation architectures, we found that the larger DistilBERT architecture worked significantly better than the Bort hyperparametrization. Interestingly, we also found that the distilled models exhibit less gender-stereotypical bias than its teacher model. Since smaller architectures decrease the time to fine-tune, these models allow for more efficient training and more lightweight deployment of many Dutch downstream language tasks. △ Less

Submitted 28 April, 2022; originally announced April 2022.

Comments: Published in CLIN journal

Journal ref: Computational Linguistics in the Netherlands Journal 2021

arXiv:2109.10217 [pdf, other]

Shape Inference and Grammar Induction for Example-based Procedural Generation

Authors: Gillis Hermans, Thomas Winters, Luc De Raedt

Abstract: Designers increasingly rely on procedural generation for automatic generation of content in various industries. These techniques require extensive knowledge of the desired content, and about how to actually implement such procedural methods. Algorithms for learning interpretable generative models from example content could alleviate both difficulties. We propose SIGI, a novel method for inferring… ▽ More Designers increasingly rely on procedural generation for automatic generation of content in various industries. These techniques require extensive knowledge of the desired content, and about how to actually implement such procedural methods. Algorithms for learning interpretable generative models from example content could alleviate both difficulties. We propose SIGI, a novel method for inferring shapes and inducing a shape grammar from grid-based 3D building examples. This interpretable grammar is well-suited for co-creative design. Applied to Minecraft buildings, we show how the shape grammar can be used to automatically generate new buildings in a similar style. △ Less

Submitted 21 September, 2021; originally announced September 2021.

ACM Class: F.4.2; I.2.6; I.2.1; J.5

Journal ref: International Conference on Computational Creativity 12 (2021) 342-349

arXiv:2106.12574 [pdf, other]

DeepStochLog: Neural Stochastic Logic Programming

Authors: Thomas Winters, Giuseppe Marra, Robin Manhaeve, Luc De Raedt

Abstract: Recent advances in neural symbolic learning, such as DeepProbLog, extend probabilistic logic programs with neural predicates. Like graphical models, these probabilistic logic programs define a probability distribution over possible worlds, for which inference is computationally hard. We propose DeepStochLog, an alternative neural symbolic framework based on stochastic definite clause grammars, a t… ▽ More Recent advances in neural symbolic learning, such as DeepProbLog, extend probabilistic logic programs with neural predicates. Like graphical models, these probabilistic logic programs define a probability distribution over possible worlds, for which inference is computationally hard. We propose DeepStochLog, an alternative neural symbolic framework based on stochastic definite clause grammars, a type of stochastic logic program, which defines a probability distribution over possible derivations. More specifically, we introduce neural grammar rules into stochastic definite clause grammars to create a framework that can be trained end-to-end. We show that inference and learning in neural stochastic logic programming scale much better than for neural probabilistic logic programs. Furthermore, the experimental evaluation shows that DeepStochLog achieves state-of-the-art results on challenging neural symbolic learning tasks. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: Thomas Winters and Giuseppe Marra contributed equally to this work

MSC Class: 68T27; 68T37; 68T07 ACM Class: I.2.6; I.2.5; I.2.3

arXiv:2010.13652 [pdf, other]

Dutch Humor Detection by Generating Negative Examples

Authors: Thomas Winters, Pieter Delobelle

Abstract: Detecting if a text is humorous is a hard task to do computationally, as it usually requires linguistic and common sense insights. In machine learning, humor detection is usually modeled as a binary classification task, trained to predict if the given text is a joke or another type of text. Rather than using completely different non-humorous texts, we propose using text generation algorithms for i… ▽ More Detecting if a text is humorous is a hard task to do computationally, as it usually requires linguistic and common sense insights. In machine learning, humor detection is usually modeled as a binary classification task, trained to predict if the given text is a joke or another type of text. Rather than using completely different non-humorous texts, we propose using text generation algorithms for imitating the original joke dataset to increase the difficulty for the learning algorithm. We constructed several different joke and non-joke datasets to test the humor detection abilities of different language technologies. In particular, we compare the humor detection capabilities of classic neural network approaches with the state-of-the-art Dutch language model RobBERT. In doing so, we create and compare the first Dutch humor detection systems. We found that while other language models perform well when the non-jokes came from completely different domains, RobBERT was the only one that was able to distinguish jokes from generated negative examples. This performance illustrates the usefulness of using text generation to create negative datasets for humor recognition, and also shows that transformer models are a large step forward in humor detection. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: Accepted at the Proceedings of the 32st Benelux Conference on Artificial Intelligence (BNAIC 2020) and the 29th Belgian Dutch Conference on Machine Learning (Benelearn 2020)

MSC Class: 68T50 ACM Class: I.2.7; I.2.6

arXiv:2009.04530 [pdf, ps, other]

Discovering Textual Structures: Generative Grammar Induction using Template Trees

Authors: Thomas Winters, Luc De Raedt

Abstract: Natural language generation provides designers with methods for automatically generating text, e.g. for creating summaries, chatbots and game content. In practise, text generators are often either learned and hard to interpret, or created by hand using techniques such as grammars and templates. In this paper, we introduce a novel grammar induction algorithm for learning interpretable grammars for… ▽ More Natural language generation provides designers with methods for automatically generating text, e.g. for creating summaries, chatbots and game content. In practise, text generators are often either learned and hard to interpret, or created by hand using techniques such as grammars and templates. In this paper, we introduce a novel grammar induction algorithm for learning interpretable grammars for generative purposes, called Gitta. We also introduce the novel notion of template trees to discover latent templates in corpora to derive these generative grammars. By using existing human-created grammars, we found that the algorithm can reasonably approximate these grammars using only a few examples. These results indicate that Gitta could be used to automatically learn interpretable and easily modifiable grammars, and thus provide a step** stone for human-machine co-creation of generative models. △ Less

Submitted 9 September, 2020; originally announced September 2020.

Comments: Published in the Proceedings of the 11th International Conference on Computational Creativity, p177-180

MSC Class: 68T50 ACM Class: I.2.7; I.2.6

arXiv:2001.06286 [pdf, other]

RobBERT: a Dutch RoBERTa-based Language Model

Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt

Abstract: Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT, which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies s… ▽ More Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT, which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies show that BERT models trained on a single language significantly outperform the multilingual version. Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks. While previous approaches have used earlier implementations of BERT to train a Dutch version of BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT. We measured its performance on various tasks as well as the importance of the fine-tuning dataset size. We also evaluated the importance of language-specific tokenizers and the model's fairness. We found that RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets. These results indicate that it is a powerful pre-trained model for a large variety of Dutch language tasks. The pre-trained and fine-tuned models are publicly available to support further downstream Dutch NLP applications. △ Less

Submitted 16 September, 2020; v1 submitted 17 January, 2020; originally announced January 2020.

Comments: 11 pages, 4 tables, 3 figures. Accepted in EMNLP Findings

arXiv:1909.09480 [pdf, ps, other]

Generating Philosophical Statements using Interpolated Markov Models and Dynamic Templates

Authors: Thomas Winters

Abstract: Automatically imitating input text is a common task in natural language generation, often used to create humorous results. Classic algorithms for learning to imitate text, e.g. simple Markov chains, usually have a trade-off between originality and syntactic correctness. We present two ways of automatically parodying philosophical statements from examples overcoming this issue, and show how these c… ▽ More Automatically imitating input text is a common task in natural language generation, often used to create humorous results. Classic algorithms for learning to imitate text, e.g. simple Markov chains, usually have a trade-off between originality and syntactic correctness. We present two ways of automatically parodying philosophical statements from examples overcoming this issue, and show how these can work in interactive systems as well. The first algorithm uses interpolated Markov models with extensions to improve the quality of the generated texts. For the second algorithm, we propose dynamically extracting templates and filling these with new content. To illustrate these algorithms, we implemented TorfsBot, a Twitterbot imitating the witty, semi-philosophical tweets of professor Rik Torfs, the previous KU Leuven rector. We found that users preferred generative models that focused on local coherent sentences, rather than those mimicking the global structure of a philosophical statement. The proposed algorithms are thus valuable new tools for automatic parody as well as template learning systems. △ Less

Submitted 19 September, 2019; originally announced September 2019.

Comments: Winters T. (2019) Imitating Philosophical Statements using Stacked Markov Chains and Dynamic Templates, In: 31st European Summer School in Logic, Language and Information (ESSLLI2019): Student Session, University of Latvia

MSC Class: 68T05

arXiv:1903.09308 [pdf, other]

doi 10.1007/978-3-030-16667-0_9

Automatically Generating Engaging Presentation Slide Decks

Authors: Thomas Winters, Kory W. Mathewson

Abstract: Talented public speakers have thousands of hours of practice. One means of improving public speaking skills is practice through improvisation, e.g. presenting an improvised presentation using an unseen slide deck. We present TEDRIC, a novel system capable of generating coherent slide decks based on a single topic suggestion. It combines semantic word webs with text and image data sources to create… ▽ More Talented public speakers have thousands of hours of practice. One means of improving public speaking skills is practice through improvisation, e.g. presenting an improvised presentation using an unseen slide deck. We present TEDRIC, a novel system capable of generating coherent slide decks based on a single topic suggestion. It combines semantic word webs with text and image data sources to create an engaging slide deck with an overarching theme. We found that audience members perceived the quality of improvised presentations using these generated slide decks to be on par with presentations using human created slide decks for the Improvised TED Talk performance format. TEDRIC is thus a valuable new creative tool for improvisers to perform with, and for anyone looking to improve their presentation skills. △ Less

Submitted 2 March, 2019; originally announced March 2019.

Comments: To appear at EvoMusArt 2019

MSC Class: 97R40

Showing 1–9 of 9 results for author: Winters, T