Skip to main content

Showing 1–21 of 21 results for author: Xie, S M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02470  [pdf, other

    quant-ph cs.LG

    Meta-Designing Quantum Experiments with Language Models

    Authors: Sören Arlt, Haonan Duan, Felix Li, Sang Michael Xie, Yuhuai Wu, Mario Krenn

    Abstract: Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by finding solutions beyond human capabilities. However, these super-human solutions are often unintuitive and require considerable effort to uncover underlying principles, if possible at all. Here, we show how a code-generating language model trained on synthetic data can not only find solutions to specif… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 10+3 pages, 5 figures

  2. arXiv:2402.16827  [pdf, other

    cs.CL cs.LG

    A Survey on Data Selection for Language Models

    Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

    Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the am… ▽ More

    Submitted 8 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Paper list available at https://github.com/alon-albalak/data-selection-survey

  3. arXiv:2402.03325  [pdf, other

    cs.CV cs.LG

    Connect Later: Improving Fine-tuning for Robustness with Targeted Augmentations

    Authors: Helen Qu, Sang Michael Xie

    Abstract: Models trained on a labeled source domain (e.g., labeled images from wildlife camera traps) often generalize poorly when deployed on an out-of-distribution (OOD) target domain (e.g., images from new camera trap locations). In the domain adaptation setting where unlabeled target data is available, self-supervised pretraining (e.g., masked autoencoding or contrastive learning) is a promising method… ▽ More

    Submitted 21 June, 2024; v1 submitted 8 January, 2024; originally announced February 2024.

    Comments: ICML 2024

  4. arXiv:2305.10429  [pdf, other

    cs.CL cs.LG

    DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

    Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu

    Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of do… ▽ More

    Submitted 20 November, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  5. arXiv:2303.00001  [pdf, other

    cs.LG cs.AI cs.CL

    Reward Design with Language Models

    Authors: Minae Kwon, Sang Michael Xie, Kalesha Bullard, Dorsa Sadigh

    Abstract: Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, wher… ▽ More

    Submitted 27 February, 2023; originally announced March 2023.

    Comments: International Conference on Learning Representations (ICLR) 2023

  6. arXiv:2302.03169  [pdf, other

    cs.CL cs.LG

    Data Selection for Language Models via Importance Resampling

    Authors: Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang

    Abstract: Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given unlabeled target samples. Due to the scale and dimensionality of the raw text data, existing methods use simple heuristics or r… ▽ More

    Submitted 18 November, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: NeurIPS 2023

  7. arXiv:2211.09110  [pdf, other

    cs.CL cs.AI cs.LG

    Holistic Evaluation of Language Models

    Authors: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao , et al. (25 additional authors not shown)

    Abstract: Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest fo… ▽ More

    Submitted 1 October, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/helm/v1.0

    Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2023

  8. arXiv:2210.14199  [pdf, other

    cs.LG

    Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

    Authors: Hong Liu, Sang Michael Xie, Zhiyuan Li, Tengyu Ma

    Abstract: Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The validation pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when develo** language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate com… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

  9. arXiv:2204.00570  [pdf, other

    cs.LG cs.CV

    Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation

    Authors: Kendrick Shen, Robbie Jones, Ananya Kumar, Sang Michael Xie, Jeff Z. HaoChen, Tengyu Ma, Percy Liang

    Abstract: We consider unsupervised domain adaptation (UDA), where labeled data from a source domain (e.g., photographs) and unlabeled data from a target domain (e.g., sketches) are used to learn a classifier for the target domain. Conventional UDA methods (e.g., domain adversarial training) learn domain-invariant features to improve generalization to the target domain. In this paper, we show that contrastiv… ▽ More

    Submitted 1 December, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Comments: ICML 2022 (Long Talk)

  10. arXiv:2112.05090  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Extending the WILDS Benchmark for Unsupervised Adaptation

    Authors: Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, Percy Liang

    Abstract: Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribu… ▽ More

    Submitted 23 April, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

  11. arXiv:2111.02080  [pdf, other

    cs.CL cs.LG

    An Explanation of In-context Learning as Implicit Bayesian Inference

    Authors: Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma

    Abstract: Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning ca… ▽ More

    Submitted 21 July, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

    Comments: ICLR 2022

  12. arXiv:2109.05554  [pdf, other

    cs.LG

    No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets

    Authors: Fahim Tajwar, Ananya Kumar, Sang Michael Xie, Percy Liang

    Abstract: Out-of-distribution detection is an important component of reliable ML systems. Prior literature has proposed various methods (e.g., MSP (Hendrycks & Gimpel, 2017), ODIN (Liang et al., 2018), Mahalanobis (Lee et al., 2018)), claiming they are state-of-the-art by showing they outperform previous methods on a selected set of in-distribution (ID) and out-of-distribution (OOD) datasets. In this work,… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

    Comments: ICML Workshop on Uncertainty & Robustness in Deep Learning, 2021

  13. arXiv:2108.07258  [pdf, other

    cs.LG cs.AI cs.CY

    On the Opportunities and Risks of Foundation Models

    Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

    Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More

    Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

  14. arXiv:2106.09226  [pdf, other

    cs.LG stat.ML

    Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

    Authors: Colin Wei, Sang Michael Xie, Tengyu Ma

    Abstract: Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downs… ▽ More

    Submitted 20 April, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

  15. arXiv:2012.07421  [pdf, other

    cs.LG

    WILDS: A Benchmark of in-the-Wild Distribution Shifts

    Authors: Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang

    Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchma… ▽ More

    Submitted 16 July, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

  16. arXiv:2012.04550  [pdf, other

    cs.LG stat.ML

    In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

    Authors: Sang Michael Xie, Ananya Kumar, Robbie Jones, Fereshte Khani, Tengyu Ma, Percy Liang

    Abstract: Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in- and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically acro… ▽ More

    Submitted 7 April, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

    Comments: ICLR 2021

  17. arXiv:2006.16205  [pdf, other

    cs.LG stat.ML

    Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization

    Authors: Sang Michael Xie, Tengyu Ma, Percy Liang

    Abstract: We focus on prediction problems with structured outputs that are subject to output validity constraints, e.g. pseudocode-to-code translation where the code must compile. While labeled input-output pairs are expensive to obtain, "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available (e.g. code on GitHub) and provide information about output validity. We can capture the… ▽ More

    Submitted 24 October, 2023; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: ICML 2021 Long talk

  18. arXiv:2002.10716  [pdf, other

    cs.LG stat.ML

    Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

    Authors: Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, Percy Liang

    Abstract: Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the eff… ▽ More

    Submitted 6 July, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Comments: Appearing at International Conference on Machine Learning (ICML) 2020

  19. arXiv:1906.06032  [pdf, other

    cs.LG stat.ML

    Adversarial Training Can Hurt Generalization

    Authors: Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, Percy Liang

    Abstract: While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinit… ▽ More

    Submitted 26 August, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

  20. arXiv:1901.10517  [pdf, other

    cs.LG stat.ML

    Reparameterizable Subset Sampling via Continuous Relaxations

    Authors: Sang Michael Xie, Stefano Ermon

    Abstract: Many machine learning tasks require sampling a subset of items from a collection based on a parameterized distribution. The Gumbel-softmax trick can be used to sample a single item, and allows for low-variance reparameterized gradients with respect to the parameters of the underlying distribution. However, stochastic optimization involving subset sampling is typically not reparameterizable. To ove… ▽ More

    Submitted 26 February, 2021; v1 submitted 29 January, 2019; originally announced January 2019.

    Comments: IJCAI 2019

  21. arXiv:1805.10407  [pdf, other

    cs.LG cs.AI stat.ML

    Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

    Authors: Neal Jean, Sang Michael Xie, Stefano Ermon

    Abstract: Large amounts of labeled data are typically required to train deep learning models. For many real-world problems, however, acquiring additional data can be expensive or even impossible. We present semi-supervised deep kernel learning (SSDKL), a semi-supervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical represe… ▽ More

    Submitted 4 March, 2019; v1 submitted 25 May, 2018; originally announced May 2018.

    Comments: In Proceedings of Neural Information Processing Systems (NeurIPS) 2018