How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia-Wei; Lowe, Ryan; Serban, Iulian V.; Noseworthy, Michael; Charlin, Laurent; Pineau, Joelle

Computer Science > Computation and Language

arXiv:1603.08023v1 (cs)

[Submitted on 25 Mar 2016 (this version), latest version 3 Jan 2017 (v2)]

Title:How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Authors:Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau

View PDF

Abstract:We investigate evaluation metrics for end-to-end dialogue systems where supervised labels, such as task completion, are not available. Recent works in end-to-end dialogue systems have adopted metrics from machine translation and text summarization to compare a model's generated response to a single target response. We show that these metrics correlate very weakly or not at all with human judgements of the response quality in both technical and non-technical domains. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Comments:	First 4 authors had equal contribution. 13 pages, 5 tables, 6 figures. Submitted to ACL 2016
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:1603.08023 [cs.CL]
	(or arXiv:1603.08023v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1603.08023

Submission history

From: Ryan Lowe T. [view email]
[v1] Fri, 25 Mar 2016 20:32:21 UTC (787 KB)
[v2] Tue, 3 Jan 2017 18:28:32 UTC (723 KB)

Computer Science > Computation and Language

Title:How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators