How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

Jesus, Sérgio; Belém, Catarina; Balayan, Vladimir; Bento, João; Saleiro, Pedro; Bizarro, Pedro; Gama, João

doi:10.1145/3442188.3445941 10.1145/3442188.3445941 10.1145/3442188.3445941 10.1145/3442188.3445941

Computer Science > Artificial Intelligence

arXiv:2101.08758 (cs)

[Submitted on 21 Jan 2021 (v1), last revised 22 Jan 2021 (this version, v2)]

Title:How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

Authors:Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, João Gama

View PDF

Abstract:There have been several research works proposing new Explainable AI (XAI) methods designed to generate model explanations having specific properties, or desiderata, such as fidelity, robustness, or human-interpretability. However, explanations are seldom evaluated based on their true practical impact on decision-making tasks. Without that assessment, explanations might be chosen that, in fact, hurt the overall performance of the combined system of ML model + end-users. This study aims to bridge this gap by proposing XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information. We conducted an experiment following XAI Test to evaluate three popular post-hoc explanation methods -- LIME, SHAP, and TreeInterpreter -- on a real-world fraud detection task, with real data, a deployed ML model, and fraud analysts. During the experiment, we gradually increased the information provided to the fraud analysts in three stages: Data Only, i.e., just transaction data without access to model score nor explanations, Data + ML Model Score, and Data + ML Model Score + Explanations. Using strong statistical analysis, we show that, in general, these popular explainers have a worse impact than desired. Some of the conclusion highlights include: i) showing Data Only results in the highest decision accuracy and the slowest decision time among all variants tested, ii) all the explainers improve accuracy over the Data + ML Model Score variant but still result in lower accuracy when compared with Data Only; iii) LIME was the least preferred by users, probably due to its substantially lower variability of explanations from case to case.

Comments:	Accepted at FAccT'21, the ACM Conference on Fairness, Accountability, and Transparency
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2101.08758 [cs.AI]
	(or arXiv:2101.08758v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2101.08758
Related DOI:	https://doi.org/10.1145/3442188.3445941 https://doi.org/10.1145/3442188.3445941 https://doi.org/10.1145/3442188.3445941 https://doi.org/10.1145/3442188.3445941

Submission history

From: Pedro Saleiro [view email]
[v1] Thu, 21 Jan 2021 18:15:13 UTC (757 KB)
[v2] Fri, 22 Jan 2021 12:05:16 UTC (1,510 KB)

Computer Science > Artificial Intelligence

Title:How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators