Skip to main content

Showing 1–21 of 21 results for author: Whang, S E

.
  1. arXiv:2405.17938  [pdf, other

    cs.LG

    RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

    Authors: Seong-Hyeon Hwang, Minsu Kim, Steven Euijong Whang

    Abstract: We study the problem of robust data augmentation for regression tasks in the presence of noisy data. Data augmentation is essential for generalizing deep learning models, but most of the techniques like the popular Mixup are primarily designed for classification tasks on image data. Recently, there are also Mixup techniques that are specialized to regression tasks like C-Mixup. In comparison to Mi… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted to KDD 2024

  2. arXiv:2403.05266  [pdf, other

    cs.CL cs.AI cs.LG

    ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

    Authors: Jio Oh, Soyeon Kim, Junseok Seo, **dong Wang, Ruochen Xu, Xing Xie, Steven Euijong Whang

    Abstract: Large language models (LLMs) have achieved unprecedented performance in various applications, yet their evaluation remains a critical issue. Existing hallucination benchmarks are either static or lack adjustable complexity for thorough analysis. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks due to their accurate knowledge description vi… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2402.04644  [pdf, other

    cs.LG cs.AI

    LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

    Authors: Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yu** Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

    Abstract: Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitation… ▽ More

    Submitted 18 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  4. arXiv:2401.12722  [pdf, other

    cs.LG

    Falcon: Fair Active Learning using Multi-armed Bandits

    Authors: Ki Hyun Tae, Hantian Zhang, Jaeyoung Park, Kexin Rong, Steven Euijong Whang

    Abstract: Biased data can lead to unfair machine learning models, highlighting the importance of embedding fairness at the beginning of data analysis, particularly during dataset curation and labeling. In response, we propose Falcon, a scalable fair active learning framework. Falcon adopts a data-centric approach that improves machine learning model fairness via strategic sample selection. Given a user-spec… ▽ More

    Submitted 23 January, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: Accepted to VLDB 2024

  5. arXiv:2312.09691  [pdf, other

    cs.LG cs.AI

    Quilt: Robust Data Segment Selection against Concept Drifts

    Authors: Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang

    Abstract: Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data p… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted to AAAI 2024

  6. arXiv:2305.15165  [pdf, other

    cs.LG cs.AI cs.CR

    Personalized DP-SGD using Sampling Mechanisms

    Authors: Geon Heo, Junseok Seo, Steven Euijong Whang

    Abstract: Personalized privacy becomes critical in deep learning for Trustworthy AI. While Differentially Private Stochastic Gradient Descent (DP-SGD) is widely used in deep learning methods supporting privacy, it provides the same level of privacy to all individuals, which may lead to overprotection and low utility. In practice, different users may require different privacy levels, and the model can be imp… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 10 pages, 5 figures

  7. arXiv:2302.02323  [pdf, other

    cs.LG cs.AI stat.ML

    Improving Fair Training under Correlation Shifts

    Authors: Yuji Roh, Kangwook Lee, Steven Euijong Whang, Changho Suh

    Abstract: Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make t… ▽ More

    Submitted 5 February, 2023; originally announced February 2023.

  8. arXiv:2209.10956  [pdf, other

    cs.LG

    XClusters: Explainability-first Clustering

    Authors: Hyunseung Hwang, Steven Euijong Whang

    Abstract: We study the problem of explainability-first clustering where explainability becomes a first-class citizen for clustering. Previous clustering approaches use decision trees for explanation, but only after the clustering is completed. In contrast, our approach is to perform clustering and decision tree training holistically where the decision tree's performance and size also influence the clusterin… ▽ More

    Submitted 11 December, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

    Comments: 13 pages

  9. arXiv:2209.07047  [pdf, other

    cs.LG cs.CY

    iFlipper: Label Flip** for Individual Fairness

    Authors: Hantian Zhang, Ki Hyun Tae, Jaeyoung Park, Xu Chu, Steven Euijong Whang

    Abstract: As machine learning becomes prevalent, mitigating any unfairness present in the training data becomes critical. Among the various notions of fairness, this paper focuses on the well-known individual fairness, which states that similar individuals should be treated similarly. While individual fairness can be improved when training a model (in-processing), we contend that fixing the data before mode… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: 20 pages, 19 figures, 8 tables

  10. arXiv:2202.02902  [pdf, other

    cs.LG cs.AI cs.CR

    Redactor: A Data-centric and Individualized Defense Against Inference Attacks

    Authors: Geon Heo, Steven Euijong Whang

    Abstract: Information leakage is becoming a critical problem as various information becomes publicly available by mistake, and machine learning models train on that data to provide services. As a result, one's private information could easily be memorized by such trained models. Unfortunately, deleting information is out of the question as the data is already exposed to the Web or third-party platforms. Mor… ▽ More

    Submitted 1 December, 2022; v1 submitted 6 February, 2022; originally announced February 2022.

    Comments: 11 pages, 6 figures

  11. arXiv:2112.06409  [pdf, other

    cs.LG

    Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

    Authors: Steven Euijong Whang, Yuji Roh, Hwanjun Song, Jae-Gil Lee

    Abstract: Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparati… ▽ More

    Submitted 26 December, 2022; v1 submitted 12 December, 2021; originally announced December 2021.

  12. arXiv:2110.14222  [pdf, other

    cs.LG cs.AI stat.ML

    Sample Selection for Fair and Robust Training

    Authors: Yuji Roh, Kangwook Lee, Steven Euijong Whang, Changho Suh

    Abstract: Fairness and robustness are critical elements of Trustworthy AI that need to be addressed together. Fairness is about learning an unbiased model while robustness is about learning from corrupted data, and it is known that addressing only one of them may have an adverse affect on the other. In this work, we propose a sample selection-based algorithm for fair and robust training. To this end, we for… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted to 35th Conference on Neural Information Processing Systems (NeurIPS), 2021

  13. arXiv:2106.03374  [pdf, other

    cs.LG

    RegMix: Data Mixing Augmentation for Regression

    Authors: Seong-Hyeon Hwang, Steven Euijong Whang

    Abstract: Data augmentation is becoming essential for improving regression performance in critical applications including manufacturing, climate prediction, and finance. Existing techniques for data augmentation largely focus on classification tasks and do not readily apply to regression tasks. In particular, the recent Mixup techniques for classification have succeeded in improving the model performance, w… ▽ More

    Submitted 17 August, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: 10 pages, 9 figures, 6 tables

  14. arXiv:2101.05967  [pdf, other

    cs.LG cs.AI

    Responsible AI Challenges in End-to-end Machine Learning

    Authors: Steven Euijong Whang, Ki Hyun Tae, Yuji Roh, Geon Heo

    Abstract: Responsible AI is becoming critical as AI is widely used in our everyday lives. Many companies that deploy AI publicly state that when training a model, we not only need to improve its accuracy, but also need to guarantee that the model does not discriminate against users (fairness), is resilient to noisy or poisoned data (robustness), is explainable, and more. In addition, these objectives are no… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

  15. arXiv:2012.01696  [pdf, other

    cs.LG cs.AI stat.ML

    FairBatch: Batch Selection for Model Fairness

    Authors: Yuji Roh, Kangwook Lee, Steven Euijong Whang, Changho Suh

    Abstract: Training a fair machine learning model is essential to prevent demographic disparity. Existing techniques for improving model fairness require broad changes in either data preprocessing or model training, rendering themselves difficult-to-adopt for potentially already complex machine learning systems. We address this problem via the lens of bilevel optimization. While kee** the standard training… ▽ More

    Submitted 2 June, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

    Comments: In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021

  16. arXiv:2004.03264  [pdf, other

    cs.LG cs.CV eess.IV stat.ML

    Inspector Gadget: A Data Programming-based Labeling System for Industrial Images

    Authors: Geon Heo, Yuji Roh, Seonghyeon Hwang, Dayun Lee, Steven Euijong Whang

    Abstract: As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed w… ▽ More

    Submitted 21 August, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

    Comments: 10 pages, 11 figures

    Report number: pp 28--36

    Journal ref: Proceedings of the VLDB Endowment, Volume 14, Issue 1, September 2020

  17. arXiv:2003.04549  [pdf, other

    cs.LG stat.ML

    Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models

    Authors: Ki Hyun Tae, Steven Euijong Whang

    Abstract: As machine learning becomes democratized in the era of Software 2.0, a serious bottleneck is acquiring enough data to ensure accurate and fair models. Recent techniques including crowdsourcing provide cost-effective ways to gather such data. However, simply acquiring data as much as possible is not necessarily an effective strategy for optimizing accuracy and fairness. For example, if an online ap… ▽ More

    Submitted 21 August, 2021; v1 submitted 10 March, 2020; originally announced March 2020.

    Comments: 15 pages, 11 figures, 11 tables

  18. arXiv:2002.10234  [pdf, other

    cs.LG stat.ML

    FR-Train: A Mutual Information-Based Approach to Fair and Robust Training

    Authors: Yuji Roh, Kangwook Lee, Steven Euijong Whang, Changho Suh

    Abstract: Trustworthy AI is a critical issue in machine learning where, in addition to training a model that is accurate, one must consider both fair and robust training in the presence of data bias and poisoning. However, the existing model fairness techniques mistakenly view poisoned data as an additional bias to be fixed, resulting in severe performance degradation. To address this problem, we propose FR… ▽ More

    Submitted 3 July, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

  19. arXiv:1904.10761  [pdf, other

    cs.DB cs.LG

    Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

    Authors: Ki Hyun Tae, Yuji Roh, Young Hun Oh, Hyunsu Kim, Steven Euijong Whang

    Abstract: The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes imperative that the trained model is accurate, fair, and robust to attacks. While many techniques have been proposed to improve the model training process (in-… ▽ More

    Submitted 22 April, 2019; originally announced April 2019.

    Comments: 4 pages

  20. arXiv:1811.03402  [pdf, other

    cs.LG stat.ML

    A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective

    Authors: Yuji Roh, Geon Heo, Steven Euijong Whang

    Abstract: Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniq… ▽ More

    Submitted 12 August, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

    Comments: 20 pages

  21. arXiv:1807.06068  [pdf, other

    cs.DB cs.LG

    Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

    Authors: Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, Steven Euijong Whang

    Abstract: As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to hel** users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an i… ▽ More

    Submitted 6 January, 2019; v1 submitted 16 July, 2018; originally announced July 2018.