Skip to main content

Showing 1–2 of 2 results for author: Jahagirdar, H

.
  1. arXiv:2405.02774  [pdf, other

    cs.LG cs.AI cs.CL

    Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

    Authors: Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia

    Abstract: This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerg… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Published as a conference paper at ICLR 2024

  2. arXiv:2210.06516  [pdf, other

    cs.CR cs.AI cs.LG

    How to Sift Out a Clean Data Subset in the Presence of Data Poisoning?

    Authors: Yi Zeng, Minzhou Pan, Himanshu Jahagirdar, Ming **, Lingjuan Lyu, Ruoxi Jia

    Abstract: Given the volume of data needed to train modern machine learning models, external suppliers are increasingly used. However, incorporating external data poses data poisoning risks, wherein attackers manipulate their data to degrade model utility or integrity. Most poisoning defenses presume access to a set of clean data (or base set). While this assumption has been taken for granted, given the fast… ▽ More

    Submitted 31 May, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: 13 pages of the main text