Search | arXiv e-print repository

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

Abstract: Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, su… ▽ More Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: The Fourth Workshop on Human Evaluation of NLP Systems (HumEval 2024) at LREC-COLING 2024

arXiv:2402.12267 [pdf, other]

High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models

Authors: Michela Lorandi, Anya Belz

Abstract: The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English,… ▽ More The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric's suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2401.14228 [pdf, other]

Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods

Authors: Mohammed Sabry, Anya Belz

Abstract: As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning. In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a… ▽ More As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning. In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a study comprising 1,440 training/testing runs to test the portability of modules trained by parameter-efficient finetuning (PEFT) techniques, using sentiment analysis as an example task. We test portability in a wide range of scenarios, involving different PEFT techniques and different pretrained host models, among other dimensions. We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module. We find that the ported modules far outperform the two alternatives tested, but that there are interesting performance differences between the four PEFT techniques. We conclude that task-specific knowledge in the form of structurally modular sets of parameters as produced by PEFT techniques is highly portable, but that degree of success depends on type of PEFT and on differences between originating and receiving pretrained models. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: Accepted to Findings of EACL 2024. Camera ready version

arXiv:2401.12209 [pdf]

A Single Photon Source based on a Long-Range Interacting Room Temperature Vapor

Authors: Felix Moumtsilis, Max Mäusezahl, Haim Nakav, Annika Belz, Robert Löw, Tilman Pfau

Abstract: We report on the current development of a single photon source based on a long-range interacting room temperature rubidium vapor. We discuss the history of the project, the production of vapor cells, and the observation of Rabi-oscillations in the four-wave-mixing excitation scheme. We report on the current development of a single photon source based on a long-range interacting room temperature rubidium vapor. We discuss the history of the project, the production of vapor cells, and the observation of Rabi-oscillations in the four-wave-mixing excitation scheme. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 8 pages, 6 figures

arXiv:2308.09957 [pdf, other]

Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate

Authors: Michela Lorandi, Anya Belz

Abstract: LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4… ▽ More LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4 with a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced language, and (ii) generation into English followed by translation into the under-resourced language. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed competitor systems by substantial margins in all languages on all metrics. We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs. However, our best results (for Welsh) remain well below the lowest ranked English system at WebNLG'20. △ Less

Submitted 19 August, 2023; originally announced August 2023.

arXiv:2305.01633 [pdf, other]

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Authors: Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai , et al. (17 additional authors not shown)

Abstract: We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, a… ▽ More We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP. △ Less

Submitted 7 August, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: 5 pages plus appendix, 4 tables, 1 figure. To appear at "Workshop on Insights from Negative Results in NLP" (co-located with EACL2023). Updated author list and acknowledgements

MSC Class: 68 ACM Class: I.2.7

arXiv:2304.12410 [pdf, other]

PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques

Authors: Mohammed Sabry, Anya Belz

Abstract: Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM). As different PEFT techniques proliferate, it is becoming difficult to compare them, in particular in terms of (i) the structure and functionality they add to the PLM, (ii) the different types and degrees of efficiency improvements achieved, (… ▽ More Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM). As different PEFT techniques proliferate, it is becoming difficult to compare them, in particular in terms of (i) the structure and functionality they add to the PLM, (ii) the different types and degrees of efficiency improvements achieved, (iii) performance at different downstream tasks, and (iv) how differences in structure and functionality relate to efficiency and task performance. To facilitate such comparisons, this paper presents a reference architecture which standardises aspects shared by different PEFT techniques, while isolating differences to specific locations and interactions with the standard components. Through this process of standardising and isolating differences, a modular view of PEFT techniques emerges, supporting not only direct comparison of different techniques and their efficiency and task performance, but also systematic exploration of reusability and composability of the different types of finetuned modules. We demonstrate how the reference architecture can be applied to understand properties and relative advantages of PEFT techniques, hence to inform selection of techniques for specific tasks, and design choices for new PEFT techniques. △ Less

Submitted 19 October, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

arXiv:2211.09455 [pdf, other]

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

Authors: Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, Ehud Reiter

Abstract: Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previ… ▽ More Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Accepted for publication at EMNLP 2022

arXiv:2205.02549 [pdf, other]

User-Driven Research of Medical Note Generation Software

Authors: Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov

Abstract: A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three ro… ▽ More A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations. However, there are very few studies on how such systems could be used in clinical practice, how clinicians would adjust to using them, or how system design should be influenced by such considerations. In this paper, we present three rounds of user studies, carried out in the context of develo** a medical note generation system. We present, analyse and discuss the participating clinicians' impressions and views of how the system ought to be adapted to be of value to them. Next, we describe a three-week test run of the system in a live telehealth clinical practice. Major findings include (i) the emergence of five different note-taking behaviours; (ii) the importance of the system generating notes in real time during the consultation; and (iii) the identification of a number of clinical use cases that could prove challenging for automatic note generation systems. △ Less

Submitted 6 May, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

Comments: Accepted for publication at NAACL 2022

arXiv:2204.05961 [pdf, other]

Quantified Reproducibility Assessment of NLP Results

Authors: Anya Belz, Maja Popović, Simon Mille

Abstract: This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure… ▽ More This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and allows conclusions to be drawn about what changes to system and/or evaluation design might lead to improved reproducibility. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: To be published in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL'22)

arXiv:2204.00447 [pdf, other]

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Authors: Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov

Abstract: In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 cl… ▽ More In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: To be published in proceedings of ACL 2022

arXiv:2110.00437 [pdf, other]

doi 10.1103/PhysRevLett.128.173401

Transient Density-Induced Dipolar Interactions in a Thin Vapor Cell

Authors: Florian Christaller, Max Mäusezahl, Felix Moumtsilis, Annika Belz, Harald Kübler, Hadiseh Alaeian, Charles S. Adams, Robert Löw, Tilman Pfau

Abstract: We exploit the effect of light-induced atomic desorption to produce high atomic densities ($n\gg k^3$) in a rubidium vapor cell. An intense off-resonant laser is pulsed for roughly one nanosecond on a micrometer-sized sapphire-coated cell, which results in the desorption of atomic clouds from both internal surfaces. We probe the transient atomic density evolution by time-resolved absorption spectr… ▽ More We exploit the effect of light-induced atomic desorption to produce high atomic densities ($n\gg k^3$) in a rubidium vapor cell. An intense off-resonant laser is pulsed for roughly one nanosecond on a micrometer-sized sapphire-coated cell, which results in the desorption of atomic clouds from both internal surfaces. We probe the transient atomic density evolution by time-resolved absorption spectroscopy.With a temporal resolution of $\approx1\,\mathrm{ns}$, we measure the broadening and line shift of the atomic resonances. Both broadening and line shift are attributed to dipole-dipole interactions. This fast switching of the atomic density and dipolar interactions could be the basis for future quantum devices based on the excitation blockade. △ Less

Submitted 28 April, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: 13 pages, 4+6 figures

Journal ref: Phys. Rev. Lett. 128, 173401 (2022)

arXiv:2109.01211 [pdf, other]

Quantifying Reproducibility in NLP and ML

Authors: Anya Belz

Abstract: Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged. The assumption has been that wider scientific reproducibility terminology and definitions are not applicable to NLP/ML, with the result that many different terms and definitions have been proposed, some diame… ▽ More Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged. The assumption has been that wider scientific reproducibility terminology and definitions are not applicable to NLP/ML, with the result that many different terms and definitions have been proposed, some diametrically opposed. In this paper, we test this assumption, by taking the standard terminology and definitions from metrology and applying them directly to NLP/ML. We find that we are able to straightforwardly derive a practical framework for assessing reproducibility which has the desirable property of yielding a quantified degree of reproducibility that is comparable across different reproduction studies. △ Less

Submitted 2 September, 2021; originally announced September 2021.

arXiv:2103.09710 [pdf, other]

The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

Authors: Anastasia Shimorina, Anya Belz

Abstract: This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to facilitate the recording of properties of human eval… ▽ More This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP). Originally taking inspiration from seminal papers by Bender and Friedman (2018), Mitchell et al. (2019), and Gebru et al. (2020), the Human Evaluation Datasheet is intended to facilitate the recording of properties of human evaluations in sufficient detail, and with sufficient standardisation, to support comparability, meta-evaluation, and reproducibility tests. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Comments: Unpublished manuscript

arXiv:2103.07929 [pdf, other]

A Systematic Review of Reproducibility Research in Natural Language Processing

Authors: Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter

Abstract: Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defi… ▽ More Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP, delineating differences and similarities, and providing pointers to common denominators. △ Less

Submitted 21 March, 2021; v1 submitted 14 March, 2021; originally announced March 2021.

Comments: To be published in proceedings of EACL'21

arXiv:cs/0107017 [pdf, ps, other]

Learning Computational Grammars

Authors: John Nerbonne, Anja Belz, Nicola Cancedda, Herve Dejean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard, Erik F. Tjong Kim Sang

Abstract: This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the da… ▽ More This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the data, and the availability of knowledge bases (grammars). We focused on syntax, esp. noun phrase (NP) syntax. △ Less

Submitted 15 July, 2001; originally announced July 2001.

ACM Class: I.2.7

Journal ref: In: Walter Daelemans and Remi Zajac (eds.), Proceedings of CoNLL-2001, Toulouse, France, 2001, pp. 97-104

arXiv:cs/0102020 [pdf, ps, other]

Multi-Syllable Phonotactic Modelling

Authors: Anja Belz

Abstract: This paper describes a novel approach to constructing phonotactic models. The underlying theoretical approach to phonological description is the multisyllable approach in which multiple syllable classes are defined that reflect phonotactically idiosyncratic syllable subcategories. A new finite-state formalism, OFS Modelling, is used as a tool for encoding, automatically constructing and generali… ▽ More This paper describes a novel approach to constructing phonotactic models. The underlying theoretical approach to phonological description is the multisyllable approach in which multiple syllable classes are defined that reflect phonotactically idiosyncratic syllable subcategories. A new finite-state formalism, OFS Modelling, is used as a tool for encoding, automatically constructing and generalising phonotactic descriptions. Language-independent prototype models are constructed which are instantiated on the basis of data sets of phonological strings, and generalised with a clustering algorithm. The resulting approach enables the automatic construction of phonotactic models that encode arbitrarily close approximations of a language's set of attested phonological forms. The approach is applied to the construction of multi-syllable word-level phonotactic models for German, English and Dutch. △ Less

Submitted 22 February, 2001; originally announced February 2001.

Comments: 11 pages, 4 tables, 9 figures, workshop

ACM Class: I.2.7

Journal ref: Jason Eisner, Lauri Karttunen and Alain Theriault (eds.), Finite-State Phonology: Proceedings of the 5th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 46-56. Luxembourg, August 2000

Showing 1–17 of 17 results for author: Belz, A