On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Schwartz, Roy; Stanovsky, Gabriel

Computer Science > Computation and Language

arXiv:2204.12708 (cs)

[Submitted on 27 Apr 2022]

Title:On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Authors:Roy Schwartz, Gabriel Stanovsky

View PDF

Abstract:Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

Comments:	Findings of NAACL 2022
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2204.12708 [cs.CL]
	(or arXiv:2204.12708v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2204.12708

Submission history

From: Roy Schwartz [view email]
[v1] Wed, 27 Apr 2022 05:42:40 UTC (1,779 KB)

Computer Science > Computation and Language

Title:On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators