Skip to main content

Showing 1–28 of 28 results for author: Jiao, R

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.11011  [pdf, other

    cs.LG cs.CL stat.ML

    Data Shapley in One Training Run

    Authors: Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

    Abstract: Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, m… ▽ More

    Submitted 29 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

  2. arXiv:2405.03875  [pdf, other

    cs.LG stat.ML

    Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

    Authors: Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia

    Abstract: Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis te… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: ICML 2024

  3. arXiv:2402.08922  [pdf, other

    cs.LG stat.ML

    The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

    Authors: Myeongseob Ko, Feiyang Kang, Weiyan Shi, Ming **, Zhou Yu, Ruoxi Jia

    Abstract: Large-scale black-box models have become ubiquitous across numerous applications. Understanding the influence of individual training data sources on predictions made by these models is crucial for improving their trustworthiness. Current influence estimation techniques involve computing gradients for every training point or repeated training on different subsets. These approaches face obvious comp… ▽ More

    Submitted 19 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

    Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  4. arXiv:2401.11103  [pdf, other

    cs.DS cs.LG stat.ML

    Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

    Authors: Jiachen T. Wang, Prateek Mittal, Ruoxi Jia

    Abstract: This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: AISTATS 2024 Oral

  5. arXiv:2310.17168  [pdf, other

    cs.LG stat.ML

    Learning an Inventory Control Policy with General Inventory Arrival Dynamics

    Authors: Sohrab Andaz, Carson Eisenach, Dhruv Madeka, Kari Torkkola, Randy Jia, Dean Foster, Sham Kakade

    Abstract: In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To th… ▽ More

    Submitted 21 January, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

  6. arXiv:2310.16096  [pdf, ps, other

    stat.ML cs.LG

    Contextual Bandits for Evaluating and Improving Inventory Control Policies

    Authors: Dean Foster, Randy Jia, Dhruv Madeka

    Abstract: Solutions to address the periodic review inventory control problem with nonstationary random demand, lost sales, and stochastic vendor lead times typically involve making strong assumptions on the dynamics for either approximation or simulation, and applying methods such as optimization, dynamic programming, or reinforcement learning. Therefore, it is important to analyze and evaluate any inventor… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  7. arXiv:2308.15709  [pdf, other

    cs.LG cs.CR cs.GT stat.ML

    Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation

    Authors: Jiachen T. Wang, Yuqing Zhu, Yu-Xiang Wang, Ruoxi Jia, Prateek Mittal

    Abstract: Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowad… ▽ More

    Submitted 25 November, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: NeurIPS 2023 Spotlight

  8. arXiv:2305.00054  [pdf, other

    cs.LG cs.AI stat.ML

    LAVA: Data Valuation without Pre-Specified Learning Algorithms

    Authors: Hoang Anh Just, Feiyang Kang, Jiachen T. Wang, Yi Zeng, Myeongseob Ko, Ming **, Ruoxi Jia

    Abstract: Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisit… ▽ More

    Submitted 19 December, 2023; v1 submitted 28 April, 2023; originally announced May 2023.

    Comments: ICLR 2023 Spotlight Latest Updated Version: 2023/12/19

  9. arXiv:2304.04258  [pdf, ps, other

    stat.ML cs.LG

    A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms"

    Authors: Jiachen T. Wang, Ruoxi Jia

    Abstract: Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) m… ▽ More

    Submitted 25 November, 2023; v1 submitted 9 April, 2023; originally announced April 2023.

    Comments: Technical Note

  10. arXiv:2302.11431  [pdf, ps, other

    stat.ML cs.LG

    A Note on "Towards Efficient Data Valuation Based on the Shapley Value''

    Authors: Jiachen T. Wang, Ruoxi Jia

    Abstract: The Shapley value (SV) has emerged as a promising method for data valuation. However, computing or estimating the SV is often computationally expensive. To overcome this challenge, Jia et al. (2019) propose an advanced SV estimation algorithm called ``Group Testing-based SV estimator'' which achieves favorable asymptotic sample complexity. In this technical note, we present several improvements in… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

  11. arXiv:2210.16835  [pdf, other

    stat.ML cs.LG

    Variance reduced Shapley value estimation for trustworthy data valuation

    Authors: Mengmeng Wu, Ruoxi Jia, Changle Lin, Wei Huang, Xiangyu Chang

    Abstract: Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace,… ▽ More

    Submitted 22 May, 2023; v1 submitted 30 October, 2022; originally announced October 2022.

  12. arXiv:2205.15466  [pdf, other

    cs.LG cs.GT stat.ML

    Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

    Authors: Jiachen T. Wang, Ruoxi Jia

    Abstract: Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave… ▽ More

    Submitted 18 December, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: AISTATS 2023 Oral

    Journal ref: AISTATS 2023

  13. arXiv:2111.12545  [pdf, other

    cs.LG stat.CO

    ModelPred: A Framework for Predicting Trained Model from Training Data

    Authors: Yingyan Zeng, Jiachen T. Wang, Si Chen, Hoang Anh Just, Ran **, Ruoxi Jia

    Abstract: In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain be… ▽ More

    Submitted 23 December, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

  14. arXiv:2103.01496  [pdf, other

    cs.LG cs.CR stat.ML

    DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

    Authors: Wenxiao Wang, Tianhao Wang, Lun Wang, Nanqing Luo, Pan Zhou, Dawn Song, Ruoxi Jia

    Abstract: Deep learning techniques have achieved remarkable performance in wide-ranging tasks. However, when trained on privacy-sensitive datasets, the model parameters may expose private information in training data. Prior attempts for differentially private training, although offering rigorous privacy guarantees, lead to much lower model performance than the non-private ones. Besides, different runs of th… ▽ More

    Submitted 20 June, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

    Comments: The 21st Privacy Enhancing Technologies Symposium (PETS), 2021

  15. arXiv:2009.06192  [pdf, other

    cs.LG cs.CY stat.ML

    A Principled Approach to Data Valuation for Federated Learning

    Authors: Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, Dawn Song

    Abstract: Federated learning (FL) is a popular technique to train machine learning (ML) models on decentralized data sources. In order to sustain long-term participation of data owners, it is important to fairly appraise each data source and compensate data owners for their contribution to the training process. The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data v… ▽ More

    Submitted 14 September, 2020; originally announced September 2020.

  16. arXiv:2006.13039  [pdf, ps, other

    stat.ML cs.CR cs.LG stat.ME

    D2P-Fed: Differentially Private Federated Learning With Efficient Communication

    Authors: Lun Wang, Ruoxi Jia, Dawn Song

    Abstract: In this paper, we propose the discrete Gaussian based differentially private federated learning (D2P-Fed), a unified scheme to achieve both differential privacy (DP) and communication efficiency in federated learning (FL). In particular, compared with the only prior work taking care of both aspects, D2P-Fed provides stronger privacy guarantee, better composability and smaller communication cost. T… ▽ More

    Submitted 2 January, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

  17. arXiv:2003.05622  [pdf, other

    cs.DC cs.LG stat.ML

    Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

    Authors: Weijie Zhao, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, ** Li

    Abstract: Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory n… ▽ More

    Submitted 12 March, 2020; originally announced March 2020.

  18. arXiv:2002.07454  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Distributed Optimization over Block-Cyclic Data

    Authors: Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, Rongfei Jia

    Abstract: We consider practical data characteristics underlying federated learning, where unbalanced and non-i.i.d. data from clients have a block-cyclic structure: each cycle contains several blocks, and each client's training data follow block-specific and non-i.i.d. distributions. Such a data structure would introduce client and block biases during the collaborative training: the single global model woul… ▽ More

    Submitted 18 February, 2020; originally announced February 2020.

  19. arXiv:1911.07135  [pdf, other

    cs.LG stat.ML

    The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

    Authors: Yuheng Zhang, Ruoxi Jia, Hengzhi Pei, Wenxiao Wang, Bo Li, Dawn Song

    Abstract: This paper studies model-inversion attacks, in which the access to a model is abused to infer information about the training data. Since its first introduction, such attacks have raised serious concerns given that training data usually contain privacy-sensitive information. Thus far, successful model-inversion attacks have only been demonstrated on simple models, such as linear regression and logi… ▽ More

    Submitted 17 April, 2020; v1 submitted 16 November, 2019; originally announced November 2019.

  20. arXiv:1911.07128  [pdf, other

    cs.LG stat.ML

    Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?

    Authors: Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura, Ce Zhang, Bo Li, Dawn Song

    Abstract: Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption. One simple idea is to use the leave-one-out error of each training point to indicate its importance. Recent work has also proposed to use the Shapl… ▽ More

    Submitted 25 April, 2021; v1 submitted 16 November, 2019; originally announced November 2019.

  21. arXiv:1911.02254  [pdf, other

    cs.LG cs.CR cs.DC stat.ML

    Secure Federated Submodel Learning

    Authors: Chaoyue Niu, Fan Wu, Shaojie Tang, Lifeng Hua, Rongfei Jia, Chengfei Lv, Zhihua Wu, Guihai Chen

    Abstract: Federated learning was proposed with an intriguing vision of achieving collaborative machine learning among numerous clients without uploading their private data to a cloud server. However, the conventional framework requires each client to leverage the full model for learning, which can be prohibitively inefficient for resource-constrained clients and large-scale deep learning tasks. We thus prop… ▽ More

    Submitted 11 November, 2019; v1 submitted 6 November, 2019; originally announced November 2019.

  22. arXiv:1908.08619  [pdf, other

    cs.LG stat.ML

    Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

    Authors: Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J. Spanos, Dawn Song

    Abstract: Given a data set $\mathcal{D}$ containing millions of data points and a data consumer who is willing to pay for \$$X$ to train a machine learning (ML) model over $\mathcal{D}$, how should we distribute this \$$X$ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world inter… ▽ More

    Submitted 29 March, 2020; v1 submitted 22 August, 2019; originally announced August 2019.

    Journal ref: PVLDB, 12(11): 1610-1623, 2019

  23. arXiv:1907.07789  [pdf

    q-bio.GN stat.AP

    Genome-wide Causation Studies of Complex Diseases

    Authors: Rong Jiao, Xiangning Chen, Eric Boerwinkle, Momiao Xiong

    Abstract: Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the signals identified by association analysis may not have specific pathological relevance to diseases so that a large fraction of disease causing genetic variants is still hidden. Association is used to measure dependence between two variables or two sets of variable… ▽ More

    Submitted 17 July, 2019; originally announced July 2019.

    Comments: 61 pages, 5 figures

  24. arXiv:1905.04337  [pdf, other

    cs.LG stat.ML

    Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

    Authors: Shipra Agrawal, Randy Jia

    Abstract: We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called ``base-stock policies'' for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policie… ▽ More

    Submitted 10 May, 2019; originally announced May 2019.

  25. arXiv:1902.10275  [pdf, other

    cs.LG stat.ML

    Towards Efficient Data Valuation Based on the Shapley Value

    Authors: Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gurel, Bo Li, Ce Zhang, Dawn Song, Costas Spanos

    Abstract: "How much is my data worth?" is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits among multiple data contributors and determining prospective compensation when data breaches happen. In this paper, we study the problem of data valuation by utilizing the Shapley value, a popular notion of v… ▽ More

    Submitted 16 August, 2020; v1 submitted 26 February, 2019; originally announced February 2019.

  26. arXiv:1810.09650  [pdf, other

    cs.LG cs.CR cs.CV cs.IT stat.ML

    One Bit Matters: Understanding Adversarial Examples as the Abuse of Redundancy

    Authors: **gkang Wang, Ruoxi Jia, Gerald Friedland, Bo Li, Costas Spanos

    Abstract: Despite the great success achieved in machine learning (ML), adversarial examples have caused concerns with regards to its trustworthiness: A small perturbation of an input results in an arbitrary failure of an otherwise seemingly well-trained ML model. While studies are being conducted to discover the intrinsic properties of adversarial examples, such as their transferability and universality, th… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

  27. arXiv:1805.04164  [pdf

    q-bio.GN stat.ME

    Bivariate Causal Discovery and its Applications to Gene Expression and Imaging Data Analysis

    Authors: Rong Jiao, Nan Lin, Zixin Hu, David A Bennett, Li **, Momiao Xiong

    Abstract: The mainstream of research in genetics, epigenetics and imaging data analysis focuses on statistical association or exploring statistical dependence between variables. Despite their significant progresses in genetic research, understanding the etiology and mechanism of complex phenotypes remains elusive. Using association analysis as a major analytical platform for the complex data analysis is a k… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

  28. Environmental Sensing by Wearable Device for Indoor Activity and Location Estimation

    Authors: Ming **, Han Zou, Kevin Weekly, Ruoxi Jia, Alexandre M. Bayen, Costas J. Spanos

    Abstract: We present results from a set of experiments in this pilot study to investigate the causal influence of user activity on various environmental parameters monitored by occupant carried multi-purpose sensors. Hypotheses with respect to each type of measurements are verified, including temperature, humidity, and light level collected during eight typical activities: sitting in lab / cubicle, indoor w… ▽ More

    Submitted 22 June, 2014; originally announced June 2014.

    Comments: submitted to the 40th Annual Conference of the IEEE Industrial Electronics Society (IECON)