Data Contamination: From Memorization to Exploitation

Magar, Inbal; Schwartz, Roy

Computer Science > Computation and Language

arXiv:2203.08242 (cs)

[Submitted on 15 Mar 2022]

Title:Data Contamination: From Memorization to Exploitation

Authors:Inbal Magar, Roy Schwartz

View PDF

Abstract:Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.

Comments:	Accepted to ACL 2022
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2203.08242 [cs.CL]
	(or arXiv:2203.08242v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.08242

Submission history

From: Inbal Magar [view email]
[v1] Tue, 15 Mar 2022 20:37:16 UTC (1,283 KB)

Computer Science > Computation and Language

Title:Data Contamination: From Memorization to Exploitation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Data Contamination: From Memorization to Exploitation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators