-
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Authors:
Michela Lorandi,
Anya Belz
Abstract:
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, su…
▽ More
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Authors:
Michela Lorandi,
Anya Belz
Abstract:
The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English,…
▽ More
The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric's suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods
Authors:
Mohammed Sabry,
Anya Belz
Abstract:
As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning. In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a…
▽ More
As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning. In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a study comprising 1,440 training/testing runs to test the portability of modules trained by parameter-efficient finetuning (PEFT) techniques, using sentiment analysis as an example task. We test portability in a wide range of scenarios, involving different PEFT techniques and different pretrained host models, among other dimensions. We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module. We find that the ported modules far outperform the two alternatives tested, but that there are interesting performance differences between the four PEFT techniques. We conclude that task-specific knowledge in the form of structurally modular sets of parameters as produced by PEFT techniques is highly portable, but that degree of success depends on type of PEFT and on differences between originating and receiving pretrained models.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
A Single Photon Source based on a Long-Range Interacting Room Temperature Vapor
Authors:
Felix Moumtsilis,
Max Mäusezahl,
Haim Nakav,
Annika Belz,
Robert Löw,
Tilman Pfau
Abstract:
We report on the current development of a single photon source based on a long-range interacting room temperature rubidium vapor. We discuss the history of the project, the production of vapor cells, and the observation of Rabi-oscillations in the four-wave-mixing excitation scheme.
We report on the current development of a single photon source based on a long-range interacting room temperature rubidium vapor. We discuss the history of the project, the production of vapor cells, and the observation of Rabi-oscillations in the four-wave-mixing excitation scheme.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate
Authors:
Michela Lorandi,
Anya Belz
Abstract:
LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4…
▽ More
LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4 with a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced language, and (ii) generation into English followed by translation into the under-resourced language. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed competitor systems by substantial margins in all languages on all metrics. We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs. However, our best results (for Welsh) remain well below the lowest ranked English system at WebNLG'20.
△ Less
Submitted 19 August, 2023;
originally announced August 2023.
-
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Authors:
Anya Belz,
Craig Thomson,
Ehud Reiter,
Gavin Abercrombie,
Jose M. Alonso-Moral,
Mohammad Arvan,
Anouck Braggaar,
Mark Cieliebak,
Elizabeth Clark,
Kees van Deemter,
Tanvi Dinkar,
Ondřej Dušek,
Steffen Eger,
Qixiang Fang,
Mingqi Gao,
Albert Gatt,
Dimitra Gkatzia,
Javier González-Corbelle,
Dirk Hovy,
Manuela Hürlimann,
Takumi Ito,
John D. Kelleher,
Filip Klubicka,
Emiel Krahmer,
Huiyuan Lai
, et al. (17 additional authors not shown)
Abstract:
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, a…
▽ More
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
△ Less
Submitted 7 August, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques
Authors:
Mohammed Sabry,
Anya Belz
Abstract:
Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM). As different PEFT techniques proliferate, it is becoming difficult to compare them, in particular in terms of (i) the structure and functionality they add to the PLM, (ii) the different types and degrees of efficiency improvements achieved, (…
▽ More
Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM). As different PEFT techniques proliferate, it is becoming difficult to compare them, in particular in terms of (i) the structure and functionality they add to the PLM, (ii) the different types and degrees of efficiency improvements achieved, (iii) performance at different downstream tasks, and (iv) how differences in structure and functionality relate to efficiency and task performance. To facilitate such comparisons, this paper presents a reference architecture which standardises aspects shared by different PEFT techniques, while isolating differences to specific locations and interactions with the standard components. Through this process of standardising and isolating differences, a modular view of PEFT techniques emerges, supporting not only direct comparison of different techniques and their efficiency and task performance, but also systematic exploration of reusability and composability of the different types of finetuned modules. We demonstrate how the reference architecture can be applied to understand properties and relative advantages of PEFT techniques, hence to inform selection of techniques for specific tasks, and design choices for new PEFT techniques.
△ Less
Submitted 19 October, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation
Authors:
Aleksandar Savkov,
Francesco Moramarco,
Alex Papadopoulos Korfiatis,
Mark Perera,
Anya Belz,
Ehud Reiter
Abstract:
Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previ…
▽ More
Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
User-Driven Research of Medical Note Generation Software
Authors:
Tom Knoll,
Francesco Moramarco,
Alex Papadopoulos Korfiatis,
Rachel Young,
Claudia Ruffini,
Mark Perera,
Christian Perstl,
Ehud Reiter,
Anya Belz,
Aleksandar Savkov
Abstract:
A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three ro…
▽ More
A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three rounds of user studies, carried out in the context of develo** a medical note generation system. We present, analyse and discuss the participating clinicians' impressions and views of how the system ought to be adapted to be of value to them. Next, we describe a three-week test run of the system in a live telehealth clinical practice. Major findings include (i) the emergence of five different note-taking behaviours; (ii) the importance of the system generating notes in real time during the consultation; and (iii) the identification of a number of clinical use cases that could prove challenging for automatic note generation systems.
△ Less
Submitted 6 May, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Quantified Reproducibility Assessment of NLP Results
Authors:
Anya Belz,
Maja Popović,
Simon Mille
Abstract:
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure…
▽ More
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and allows conclusions to be drawn about what changes to system and/or evaluation design might lead to improved reproducibility.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation
Authors:
Francesco Moramarco,
Alex Papadopoulos Korfiatis,
Mark Perera,
Damir Juric,
Jack Flann,
Ehud Reiter,
Anya Belz,
Aleksandar Savkov
Abstract:
In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 cl…
▽ More
In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
Transient Density-Induced Dipolar Interactions in a Thin Vapor Cell
Authors:
Florian Christaller,
Max Mäusezahl,
Felix Moumtsilis,
Annika Belz,
Harald Kübler,
Hadiseh Alaeian,
Charles S. Adams,
Robert Löw,
Tilman Pfau
Abstract:
We exploit the effect of light-induced atomic desorption to produce high atomic densities ($n\gg k^3$) in a rubidium vapor cell. An intense off-resonant laser is pulsed for roughly one nanosecond on a micrometer-sized sapphire-coated cell, which results in the desorption of atomic clouds from both internal surfaces. We probe the transient atomic density evolution by time-resolved absorption spectr…
▽ More
We exploit the effect of light-induced atomic desorption to produce high atomic densities ($n\gg k^3$) in a rubidium vapor cell. An intense off-resonant laser is pulsed for roughly one nanosecond on a micrometer-sized sapphire-coated cell, which results in the desorption of atomic clouds from both internal surfaces. We probe the transient atomic density evolution by time-resolved absorption spectroscopy.With a temporal resolution of $\approx1\,\mathrm{ns}$, we measure the broadening and line shift of the atomic resonances. Both broadening and line shift are attributed to dipole-dipole interactions. This fast switching of the atomic density and dipolar interactions could be the basis for future quantum devices based on the excitation blockade.
△ Less
Submitted 28 April, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.
-
Quantifying Reproducibility in NLP and ML
Authors:
Anya Belz
Abstract:
Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged. The assumption has been that wider scientific reproducibility terminology and definitions are not applicable to NLP/ML, with the result that many different terms and definitions have been proposed, some diame…
▽ More
Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged. The assumption has been that wider scientific reproducibility terminology and definitions are not applicable to NLP/ML, with the result that many different terms and definitions have been proposed, some diametrically opposed. In this paper, we test this assumption, by taking the standard terminology and definitions from metrology and applying them directly to NLP/ML. We find that we are able to straightforwardly derive a practical framework for assessing reproducibility which has the desirable property of yielding a quantified degree of reproducibility that is comparable across different reproduction studies.
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
-
The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP
Authors:
Anastasia Shimorina,
Anya Belz
Abstract:
This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to facilitate the recording of properties of human eval…
▽ More
This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to facilitate the recording of properties of human evaluations in sufficient detail, and with sufficient standardisation, to support comparability, meta-evaluation, and reproducibility tests.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
A Systematic Review of Reproducibility Research in Natural Language Processing
Authors:
Anya Belz,
Shubham Agarwal,
Anastasia Shimorina,
Ehud Reiter
Abstract:
Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defi…
▽ More
Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP, delineating differences and similarities, and providing pointers to common denominators.
△ Less
Submitted 21 March, 2021; v1 submitted 14 March, 2021;
originally announced March 2021.
-
Learning Computational Grammars
Authors:
John Nerbonne,
Anja Belz,
Nicola Cancedda,
Herve Dejean,
James Hammerton,
Rob Koeling,
Stasinos Konstantopoulos,
Miles Osborne,
Franck Thollard,
Erik F. Tjong Kim Sang
Abstract:
This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the da…
▽ More
This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the data, and the availability of knowledge bases (grammars). We focused on syntax, esp. noun phrase (NP) syntax.
△ Less
Submitted 15 July, 2001;
originally announced July 2001.
-
Multi-Syllable Phonotactic Modelling
Authors:
Anja Belz
Abstract:
This paper describes a novel approach to constructing phonotactic models. The underlying theoretical approach to phonological description is the multisyllable approach in which multiple syllable classes are defined that reflect phonotactically idiosyncratic syllable subcategories. A new finite-state formalism, OFS Modelling, is used as a tool for encoding, automatically constructing and generali…
▽ More
This paper describes a novel approach to constructing phonotactic models. The underlying theoretical approach to phonological description is the multisyllable approach in which multiple syllable classes are defined that reflect phonotactically idiosyncratic syllable subcategories. A new finite-state formalism, OFS Modelling, is used as a tool for encoding, automatically constructing and generalising phonotactic descriptions. Language-independent prototype models are constructed which are instantiated on the basis of data sets of phonological strings, and generalised with a clustering algorithm. The resulting approach enables the automatic construction of phonotactic models that encode arbitrarily close approximations of a language's set of attested phonological forms. The approach is applied to the construction of multi-syllable word-level phonotactic models for German, English and Dutch.
△ Less
Submitted 22 February, 2001;
originally announced February 2001.