Search | arXiv e-print repository

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Authors: Anouck Braggaar, Christine Liebrecht, Emiel van Miltenburg, Emiel Krahmer

Abstract: This review gives an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service. The review (1) provides an overview of the used constructs and metrics in previous work, (2) discusses challenges in the context of dialogue system evaluation and (3) develops a research agenda fo… ▽ More This review gives an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service. The review (1) provides an overview of the used constructs and metrics in previous work, (2) discusses challenges in the context of dialogue system evaluation and (3) develops a research agenda for the future of dialogue system evaluation. We conducted a systematic review of four databases (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. We found a wide variety in both constructs and methods. Especially the operationalisation is not always clearly reported. Newer developments concerning large language models are discussed in two contexts: to power dialogue systems and to use in the evaluation process. We hope that future work will take a more critical approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with recommendations for evaluation and suggestions for outstanding questions. △ Less

Submitted 8 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: Added section 3.3 and updated other parts to refer to this section. Also updated Prisma figure to clarify counts

arXiv:2305.01633 [pdf, other]

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Authors: Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai , et al. (17 additional authors not shown)

Abstract: We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, a… ▽ More We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP. △ Less

Submitted 7 August, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: 5 pages plus appendix, 4 tables, 1 figure. To appear at "Workshop on Insights from Negative Results in NLP" (co-located with EACL2023). Updated author list and acknowledgements

MSC Class: 68 ACM Class: I.2.7

arXiv:2207.06839 [pdf, ps, other]

Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model

Authors: Chris van der Lee, Thiago Castro Ferreira, Chris Emmery, Travis Wiltshire, Emiel Krahmer

Abstract: This study discusses the effect of semi-supervised learning in combination with pretrained language models for data-to-text generation. It is not known whether semi-supervised learning is still helpful when a large-scale language model is also supplemented. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text system… ▽ More This study discusses the effect of semi-supervised learning in combination with pretrained language models for data-to-text generation. It is not known whether semi-supervised learning is still helpful when a large-scale language model is also supplemented. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text systems that are additionally enriched by a data augmentation or a pseudo-labeling semi-supervised learning approach. Results show that semi-supervised learning results in higher scores on diversity metrics. In terms of output quality, extending the training set of a data-to-text system with a language model using the pseudo-labeling approach did increase text quality scores, but the data augmentation approach yielded similar scores to the system without training set extension. These results indicate that semi-supervised learning approaches can bolster output quality and diversity, even when a language model is also present. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: 22 pages (excluding bibliography and appendix)

arXiv:2103.06944 [pdf, other]

Preregistering NLP Research

Authors: Emiel van Miltenburg, Chris van der Lee, Emiel Krahmer

Abstract: Preregistration refers to the practice of specifying what you are going to do, and what you expect to find in your study, before carrying out the study. This practice is increasingly common in medicine and psychology, but is rarely discussed in NLP. This paper discusses preregistration in more detail, explores how NLP researchers could preregister their work, and presents several preregistration q… ▽ More Preregistration refers to the practice of specifying what you are going to do, and what you expect to find in your study, before carrying out the study. This practice is increasingly common in medicine and psychology, but is rarely discussed in NLP. This paper discusses preregistration in more detail, explores how NLP researchers could preregister their work, and presents several preregistration questions for different kinds of studies. Finally, we argue in favour of registered reports, which could provide firmer grounds for slow science in NLP research. The goal of this paper is to elicit a discussion in the NLP community, which we hope to synthesise into a general NLP preregistration form in future research. △ Less

Submitted 23 March, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: Accepted at NAACL2021; pre-final draft, comments welcome

arXiv:1908.09022 [pdf, ps, other]

Neural data-to-text generation: A comparison between pipeline and end-to-end architectures

Authors: Thiago Castro Ferreira, Chris van der Lee, Emiel van Miltenburg, Emiel Krahmer

Abstract: Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. In contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much… ▽ More Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. In contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in-between. This study introduces a systematic comparison between neural pipeline and end-to-end data-to-text approaches for the generation of text from RDF triples. Both architectures were implemented making use of state-of-the art deep learning methods as the encoder-decoder Gated-Recurrent Units (GRU) and Transformer. Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches. Moreover, the pipeline models generalize better to unseen inputs. Data and code are publicly available. △ Less

Submitted 27 November, 2019; v1 submitted 23 August, 2019; originally announced August 2019.

Comments: Preprint version of the EMNLP 2019 article

arXiv:1905.00650 [pdf, other]

Differential privacy with partial knowledge

Authors: Damien Desfontaines, Esfandiar Mohammadi, Elisabeth Krahmer, David Basin

Abstract: Differential privacy offers formal quantitative guarantees for algorithms over datasets, but it assumes attackers that know and can influence all but one record in the database. This assumption often vastly overapproximates the attackers' actual strength, resulting in unnecessarily poor utility. Recent work has made significant steps towards privacy in the presence of partial background knowledg… ▽ More Differential privacy offers formal quantitative guarantees for algorithms over datasets, but it assumes attackers that know and can influence all but one record in the database. This assumption often vastly overapproximates the attackers' actual strength, resulting in unnecessarily poor utility. Recent work has made significant steps towards privacy in the presence of partial background knowledge, which can model a realistic attacker's uncertainty. Prior work, however, has definitional problems for correlated data and does not precisely characterize the underlying attacker model. We propose a practical criterion to prevent problems due to correlations, and we show how to characterize attackers with limited influence or only partial background knowledge over the dataset. We use these foundations to analyze practical scenarios: we significantly improve known results about the privacy of counting queries under partial knowledge, and we show that thresholding can provide formal guarantees against such weak attackers, even with little entropy in the data. These results allow us to draw novel links between k-anonymity and differential privacy under partial knowledge. Finally, we prove composition results on differential privacy with partial knowledge, which quantifies the privacy leakage of complex mechanisms. Our work provides a basis for formally quantifying the privacy of many widely-used mechanisms, e.g. publishing the result of surveys, elections or referendums, and releasing usage statistics of online services. △ Less

Submitted 27 November, 2020; v1 submitted 2 May, 2019; originally announced May 2019.

arXiv:1808.03507 [pdf, other]

Making effective use of healthcare data using data-to-text technology

Authors: Steffen Pauws, Albert Gatt, Emiel Krahmer, Ehud Reiter

Abstract: Healthcare organizations are in a continuous effort to improve health outcomes, reduce costs and enhance patient experience of care. Data is essential to measure and help achieving these improvements in healthcare delivery. Consequently, a data influx from various clinical, financial and operational sources is now overtaking healthcare organizations and their patients. The effective use of this da… ▽ More Healthcare organizations are in a continuous effort to improve health outcomes, reduce costs and enhance patient experience of care. Data is essential to measure and help achieving these improvements in healthcare delivery. Consequently, a data influx from various clinical, financial and operational sources is now overtaking healthcare organizations and their patients. The effective use of this data, however, is a major challenge. Clearly, text is an important medium to make data accessible. Financial reports are produced to assess healthcare organizations on some key performance indicators to steer their healthcare delivery. Similarly, at a clinical level, data on patient status is conveyed by means of textual descriptions to facilitate patient review, shift handover and care transitions. Likewise, patients are informed about data on their health status and treatments via text, in the form of reports or via ehealth platforms by their doctors. Unfortunately, such text is the outcome of a highly labour-intensive process if it is done by healthcare professionals. It is also prone to incompleteness, subjectivity and hard to scale up to different domains, wider audiences and varying communication purposes. Data-to-text is a recent breakthrough technology in artificial intelligence which automatically generates natural language in the form of text or speech from data. This chapter provides a survey of data-to-text technology, with a focus on how it can be deployed in a healthcare setting. It will (1) give an up-to-date synthesis of data-to-text approaches, (2) give a categorized overview of use cases in healthcare, (3) seek to make a strong case for evaluating and implementing data-to-text in a healthcare setting, and (4) highlight recent research challenges. △ Less

Submitted 10 August, 2018; originally announced August 2018.

Comments: 27 pages, 2 figures, book chapter

arXiv:1805.08093 [pdf, ps, other]

NeuralREG: An end-to-end approach to referring expression generation

Authors: Thiago Castro Ferreira, Diego Moussallem, Ákos Kádár, Sander Wubben, Emiel Krahmer

Abstract: Traditionally, Referring Expression Generation (REG) models first decide on the form and then on the content of references to discourse entities in text, typically relying on features such as salience and grammatical function. In this paper, we present a new approach (NeuralREG), relying on deep neural networks, which makes decisions about form and content in one go without explicit feature extrac… ▽ More Traditionally, Referring Expression Generation (REG) models first decide on the form and then on the content of references to discourse entities in text, typically relying on features such as salience and grammatical function. In this paper, we present a new approach (NeuralREG), relying on deep neural networks, which makes decisions about form and content in one go without explicit feature extraction. Using a delexicalized version of the WebNLG corpus, we show that the neural model substantially improves over two strong baselines. Data and models are publicly available. △ Less

Submitted 21 May, 2018; originally announced May 2018.

Comments: Accepted for presentation at ACL 2018

arXiv:1703.09902 [pdf, other]

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Authors: Albert Gatt, Emiel Krahmer

Abstract: This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore ai… ▽ More This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them. △ Less

Submitted 29 January, 2018; v1 submitted 29 March, 2017; originally announced March 2017.

Comments: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 table

ACM Class: I.2.7; H.5

Journal ref: Journal of AI Research, volume 60, 2017

Showing 1–9 of 9 results for author: Krahmer, E