LESS: Selecting Influential Data for Targeted Instruction Tuning

Xia, Mengzhou; Malladi, Sadhika; Gururangan, Suchin; Arora, Sanjeev; Chen, Danqi

Computer Science > Computation and Language

arXiv:2402.04333 (cs)

[Submitted on 6 Feb 2024 (v1), last revised 13 Jun 2024 (this version, v3)]

Title:LESS: Selecting Influential Data for Targeted Instruction Tuning

Authors:Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen

View PDF HTML (experimental)

Abstract:Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

Comments:	ICML 2024; Code and data are available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2402.04333 [cs.CL]
	(or arXiv:2402.04333v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.04333

Submission history

From: Mengzhou Xia [view email]
[v1] Tue, 6 Feb 2024 19:18:04 UTC (1,784 KB)
[v2] Tue, 20 Feb 2024 02:24:09 UTC (1,784 KB)
[v3] Thu, 13 Jun 2024 03:42:02 UTC (1,803 KB)

Computer Science > Computation and Language

Title:LESS: Selecting Influential Data for Targeted Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LESS: Selecting Influential Data for Targeted Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators