Gradient-based Analysis of NLP Models is Manipulable

Wang, Junlin; Tuyls, Jens; Wallace, Eric; Singh, Sameer

Computer Science > Computation and Language

arXiv:2010.05419 (cs)

[Submitted on 12 Oct 2020]

Title:Gradient-based Analysis of NLP Models is Manipulable

Authors:Junlin Wang, Jens Tuyls, Eric Wallace, Sameer Singh

View PDF

Abstract:Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade that overwhelms the gradients without affecting the predictions. This Facade can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (text classification, NLI, and QA), we show that our method can manipulate numerous gradient-based analysis techniques: saliency maps, input reduction, and adversarial perturbations all identify unimportant or targeted tokens as being highly important. The code and a tutorial of this paper is available at this http URL.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2010.05419 [cs.CL]
	(or arXiv:2010.05419v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.05419

Submission history

From: Jens Tuyls [view email]
[v1] Mon, 12 Oct 2020 02:54:22 UTC (491 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-10

Change to browse by:

cs
cs.LG

References & Citations

DBLP - CS Bibliography

listing | bibtex

Junlin Wang
Eric Wallace
Sameer Singh

export BibTeX citation

Computer Science > Computation and Language

Title:Gradient-based Analysis of NLP Models is Manipulable

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Gradient-based Analysis of NLP Models is Manipulable

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators