Skip to main content

Showing 1–50 of 76 results for author: Steinhardt, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20053  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

    Authors: Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

    Abstract: Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious d… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 22 pages

  2. arXiv:2406.19501  [pdf, other

    cs.CL cs.LG

    Monitoring Latent World States in Language Models with Propositional Probes

    Authors: Jiahai Feng, Stuart Russell, Jacob Steinhardt

    Abstract: Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. W… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  3. arXiv:2406.14595  [pdf, other

    cs.CR cs.AI cs.LG

    Adversaries Can Misuse Combinations of Safe Models

    Authors: Erik Jones, Anca Dragan, Jacob Steinhardt

    Abstract: Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes thi… ▽ More

    Submitted 1 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  4. arXiv:2406.04341  [pdf, other

    cs.CV

    Interpreting the Second-Order Effects of Neurons in CLIP

    Authors: Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

    Abstract: We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the l… ▽ More

    Submitted 23 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: project page: https://yossigandelsman.github.io/clip_neurons/index.html

  5. arXiv:2402.18563  [pdf, other

    cs.LG cs.AI cs.CL cs.IR

    Approaching Human-Level Forecasting with Language Models

    Authors: Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

    Abstract: Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large data… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  6. arXiv:2402.06627  [pdf, other

    cs.LG cs.AI cs.CL

    Feedback Loops With Language Models Drive In-Context Reward Hacking

    Authors: Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

    Abstract: Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the world, which in turn affect subsequent LLM outputs. In this work, we show that feedback loops can cause in-context reward hacking (ICRH), where the LL… ▽ More

    Submitted 6 June, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: ICML 2024 camera-ready

  7. arXiv:2312.02974  [pdf, other

    cs.CV cs.CL cs.CY cs.LG

    Describing Differences in Image Sets with Natural Language

    Authors: Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy

    Abstract: How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in im… ▽ More

    Submitted 26 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Oral

  8. arXiv:2310.17191  [pdf, other

    cs.LG cs.AI cs.CL

    How do Language Models Bind Entities in Context?

    Authors: Jiahai Feng, Jacob Steinhardt

    Abstract: To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a "green square" and a "blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large… ▽ More

    Submitted 6 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

  9. arXiv:2310.05916  [pdf, other

    cs.CV cs.AI

    Interpreting CLIP's Image Representation via Text-Based Decomposition

    Authors: Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

    Abstract: We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representa… ▽ More

    Submitted 28 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: Project page and code: https://yossigandelsman.github.io/clip_decomposition/

  10. arXiv:2307.09476  [pdf, other

    cs.LG cs.AI cs.CL

    Overthinking the Truth: Understanding how Language Models Process False Demonstrations

    Authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

    Abstract: Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false… ▽ More

    Submitted 12 March, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

  11. arXiv:2307.08678  [pdf, other

    cs.CL cs.AI cs.LG

    Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

    Authors: Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown

    Abstract: Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model's… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  12. arXiv:2307.02483  [pdf, other

    cs.LG cs.CR

    Jailbroken: How Does LLM Safety Training Fail?

    Authors: Alexander Wei, Nika Haghtalab, Jacob Steinhardt

    Abstract: Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

  13. arXiv:2306.17105  [pdf, other

    cs.LG

    Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations

    Authors: Yongyi Yang, Jacob Steinhardt, Wei Hu

    Abstract: Recent work has observed an intriguing ''Neural Collapse'' phenomenon in well-trained neural networks, where the last-layer representations of training samples with the same label collapse into each other. This appears to suggest that the last-layer representations are completely determined by the labels, and do not depend on the intrinsic structure of input distribution. We provide evidence that… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: This paper has been accepted as a conference paper at ICML 2023

  14. arXiv:2306.14670  [pdf, other

    cs.GT cs.CY cs.LG stat.ML

    Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition

    Authors: Meena Jagadeesan, Michael I. Jordan, Jacob Steinhardt, Nika Haghtalab

    Abstract: As the scale of machine learning models increases, trends such as scaling laws anticipate consistent downstream improvements in predictive accuracy. However, these trends take the perspective of a single model-provider in isolation, while in reality providers often compete with each other for users. In this work, we demonstrate that competition can fundamentally alter the behavior of these scaling… ▽ More

    Submitted 6 February, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

    Comments: Appeared at NeurIPS 2023; this is the full version

  15. arXiv:2306.12105  [pdf, other

    cs.LG cs.CL cs.SE

    Mass-Producing Failures of Multimodal Systems with Language Models

    Authors: Shengbang Tong, Erik Jones, Jacob Steinhardt

    Abstract: Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MultiMon, a system that automatically identifies systematic failures -- generalizable, natural-language descriptions of patterns of model failures. To uncover systematic failures, MultiMon scrapes a corpus for examples of erroneous agreement: inputs that… ▽ More

    Submitted 1 March, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

    Comments: Under Review

  16. arXiv:2306.07479  [pdf, ps, other

    cs.GT cs.IR cs.LG stat.ML

    Incentivizing High-Quality Content in Online Recommender Systems

    Authors: Xinyan Hu, Meena Jagadeesan, Michael I. Jordan, Jacob Steinhardt

    Abstract: In content recommender systems such as TikTok and YouTube, the platform's recommendation algorithm shapes content producer incentives. Many platforms employ online learning, which generates intertemporal incentives, since content produced today affects recommendations of future content. We study the game between producers and analyze the content created at equilibrium. We show that standard online… ▽ More

    Submitted 21 June, 2024; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: Updated version with revised and expanded content

  17. arXiv:2303.08112  [pdf, other

    cs.LG

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Authors: Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

    Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique… ▽ More

    Submitted 26 November, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

  18. arXiv:2303.04381  [pdf, other

    cs.LG cs.CL

    Automatically Auditing Large Language Models via Discrete Optimization

    Authors: Erik Jones, Anca Dragan, Aditi Raghunathan, Jacob Steinhardt

    Abstract: Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as an optimization problem, where we automatically search for input-output pairs that match a desired target behavior. For example, we might aim to find a non-toxic input that starts with "Barack Obama" that a model maps to a toxic output.… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  19. arXiv:2302.14233  [pdf, other

    cs.CL cs.AI cs.LG

    Goal Driven Discovery of Distributional Differences via Language Descriptions

    Authors: Ruiqi Zhong, Peter Zhang, Steve Li, **woo Ahn, Dan Klein, Jacob Steinhardt

    Abstract: Mining large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. The task input is a problem comprising a research goal "$\textit{comparing the side effects of drug A and drug B}$" and a corpus pair (two large collections of patients' self-reported reactions a… ▽ More

    Submitted 24 October, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

  20. arXiv:2302.12349  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws

    Authors: Kush Bhatia, Wenshuo Guo, Jacob Steinhardt

    Abstract: Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optima… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted to AISTATS 2023

  21. arXiv:2301.05217  [pdf, other

    cs.LG cs.AI

    Progress measures for grokking via mechanistic interpretability

    Authors: Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

    Abstract: Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: r… ▽ More

    Submitted 19 October, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

    Comments: 10 page main body, 2 page references, 24 page appendix

  22. arXiv:2212.03827  [pdf, other

    cs.CL cs.AI cs.LG

    Discovering Latent Knowledge in Language Models Without Supervision

    Authors: Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt

    Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a l… ▽ More

    Submitted 2 March, 2024; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: ICLR 2023

  23. arXiv:2211.00593  [pdf, other

    cs.LG cs.AI cs.CL

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Authors: Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt

    Abstract: Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  24. arXiv:2210.10039  [pdf, other

    cs.CV cs.CY cs.LG

    How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

    Authors: Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks

    Abstract: In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is develo** human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To fac… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022; datasets available at https://github.com/hendrycks/emodiversity/

  25. arXiv:2206.15474  [pdf, other

    cs.LG cs.CL

    Forecasting Future World Events with Neural Networks

    Authors: Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

    Abstract: Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing tho… ▽ More

    Submitted 9 October, 2022; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022; our dataset is available at https://github.com/andyzoujm/autocast

  26. arXiv:2206.13498  [pdf, other

    cs.LG cs.AI cs.CV cs.NE

    Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior

    Authors: Jean-Stanislas Denain, Jacob Steinhardt

    Abstract: Model visualizations provide information that outputs alone might miss. But can we trust that model visualizations reflect model behavior? For instance, can they diagnose abnormal behavior such as planted backdoors or overregularization? To evaluate visualization methods, we test whether they assign different visualizations to anomalously trained models and normal models. We find that while existi… ▽ More

    Submitted 29 May, 2023; v1 submitted 27 June, 2022; originally announced June 2022.

    Comments: Fixed backdoor localization results, made changes to abstract and introduction

  27. arXiv:2206.13489  [pdf, other

    cs.GT cs.LG econ.GN

    Supply-Side Equilibria in Recommender Systems

    Authors: Meena Jagadeesan, Nikhil Garg, Jacob Steinhardt

    Abstract: Algorithmic recommender systems such as Spotify and Netflix affect not only consumer behavior but also producer incentives. Producers seek to create content that will be shown by the recommendation algorithm, which can impact both the diversity and quality of their content. In this work, we investigate the resulting supply-side equilibria in personalized content recommender systems. We model users… ▽ More

    Submitted 11 December, 2023; v1 submitted 27 June, 2022; originally announced June 2022.

    Comments: Appeared at NeurIPS 2023; this is the full version

  28. arXiv:2203.06176  [pdf, other

    cs.LG stat.ML

    More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

    Authors: Alexander Wei, Wei Hu, Jacob Steinhardt

    Abstract: Of theories for why large-scale machine learning models generalize despite being vastly overparameterized, which of their assumptions are needed to capture the qualitative phenomena of generalization in the real world? On one hand, we find that most theoretical analyses fall short of capturing these qualitative phenomena even for kernel regression, when applied to kernels derived from large-scale… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

  29. arXiv:2202.12299  [pdf, other

    cs.CL cs.AI cs.LG

    Capturing Failures of Large Language Models via Human Cognitive Biases

    Authors: Erik Jones, Jacob Steinhardt

    Abstract: Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw… ▽ More

    Submitted 23 November, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

    Comments: Published at NeurIPS 2022

  30. arXiv:2202.05834  [pdf, other

    cs.LG stat.ML

    Predicting Out-of-Distribution Error with the Projection Norm

    Authors: Yaodong Yu, Zitong Yang, Alexander Wei, Yi Ma, Jacob Steinhardt

    Abstract: We propose a metric -- Projection Norm -- to predict a model's performance on out-of-distribution (OOD) data without access to ground truth labels. Projection Norm first uses model predictions to pseudo-label test samples and then trains a new model on the pseudo-labels. The more the new model's parameters differ from an in-distribution model, the greater the predicted OOD error. Empirically, our… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

  31. arXiv:2201.12323  [pdf, other

    cs.CL cs.AI cs.LG

    Describing Differences between Text Distributions with Natural Language

    Authors: Ruiqi Zhong, Charlie Snell, Dan Klein, Jacob Steinhardt

    Abstract: How do two distributions of texts differ? Humans are slow at answering this, since discovering patterns might require tediously reading through hundreds of samples. We propose to automatically summarize the differences by "learning a natural language hypothesis": given two distributions $D_{0}$ and $D_{1}$, we search for a description that is more often true for $D_{1}$, e.g., "is military-related… ▽ More

    Submitted 18 May, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: International Conference on Machine Learning, 2022

  32. arXiv:2201.03544  [pdf, ps, other

    cs.LG cs.AI stat.ML

    The Effects of Reward Misspecification: Map** and Mitigating Misaligned Models

    Authors: Alexander Pan, Kush Bhatia, Jacob Steinhardt

    Abstract: Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. M… ▽ More

    Submitted 14 February, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

    Comments: ICLR 2022; 19 pages

  33. arXiv:2112.05135  [pdf, other

    cs.LG cs.CV

    PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

    Authors: Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt

    Abstract: In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy. These other goals include out-of-distribution (OOD) robustness, prediction consistency, resilience to adversaries, calibrated uncertainty estimates, and the ability to detect anomalous inputs. However, improving performance towards these goals is often… ▽ More

    Submitted 29 March, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

    Comments: CVPR 2022. Code and models are available at https://github.com/andyzoujm/pixmix

  34. arXiv:2112.04094  [pdf, other

    cs.LG

    The Effect of Model Size on Worst-Group Generalization

    Authors: Alan Pham, Eunice Chan, Vikranth Srivatsa, Dhruba Ghosh, Yaoqing Yang, Yaodong Yu, Ruiqi Zhong, Joseph E. Gonzalez, Jacob Steinhardt

    Abstract: Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) archite… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

    Comments: The first four authors contributed equally to the work

  35. arXiv:2110.13136  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    What Would Jiminy Cricket Do? Towards Agents That Behave Morally

    Authors: Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

    Abstract: When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environme… ▽ More

    Submitted 7 February, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2021. Environments available here https://github.com/hendrycks/jiminy-cricket

  36. arXiv:2109.13916  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Unsolved Problems in ML Safety

    Authors: Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

    Abstract: Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the tec… ▽ More

    Submitted 16 June, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: Position Paper

  37. arXiv:2108.08843  [pdf, other

    cs.LG cs.GT stat.ML

    Learning Equilibria in Matching Markets from Bandit Feedback

    Authors: Meena Jagadeesan, Alexander Wei, Yixin Wang, Michael I. Jordan, Jacob Steinhardt

    Abstract: Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To… ▽ More

    Submitted 31 January, 2023; v1 submitted 19 August, 2021; originally announced August 2021.

    Comments: Accepted to the Journal of the ACM; conference version appeared at NeurIPS 2021

  38. arXiv:2108.01661  [pdf, other

    cs.LG stat.ML

    Grounding Representation Similarity with Statistical Testing

    Authors: Frances Ding, Jean-Stanislas Denain, Jacob Steinhardt

    Abstract: To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar… ▽ More

    Submitted 3 November, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

    Comments: Accepted at NeurIPS 2021. 10 pages, 3 figures

  39. arXiv:2105.09938  [pdf, other

    cs.SE cs.CL cs.LG

    Measuring Coding Challenge Competence With APPS

    Authors: Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

    Abstract: While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for c… ▽ More

    Submitted 8 November, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: NeurIPS 2021. Code and the APPS dataset is available at https://github.com/hendrycks/apps

  40. arXiv:2105.06020  [pdf, other

    cs.CL cs.AI cs.LG

    Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

    Authors: Ruiqi Zhong, Dhruba Ghosh, Dan Klein, Jacob Steinhardt

    Abstract: Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-of-distribution robustness, while other work suggests they have lower accuracy on rare subgroups. To understand these differences, we investigate these models at the level of individual instances. However, one major challenge is that ind… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

    Comments: ACL 2021 Findings. Code and data: https://github.com/ruiqi-zhong/acl2021-instance-level

  41. arXiv:2104.08482  [pdf, other

    cs.LG stat.ML

    Agnostic learning with unknown utilities

    Authors: Kush Bhatia, Peter L. Bartlett, Anca D. Dragan, Jacob Steinhardt

    Abstract: Traditional learning approaches for classification implicitly assume that each mistake has the same cost. In many real-world problems though, the utility of a decision depends on the underlying context $x$ and decision $y$. However, directly incorporating these utilities into the learning objective is often infeasible since these can be quite complex and difficult for humans to specify. We forma… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

    Comments: 30 pages; published as a conference paper at ITCS 2021

  42. arXiv:2103.09947  [pdf, other

    cs.LG stat.ML

    Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition

    Authors: Yaodong Yu, Zitong Yang, Edgar Dobriban, Jacob Steinhardt, Yi Ma

    Abstract: Adversarially trained models exhibit a large generalization gap: they can interpolate the training set even for large perturbation radii, but at the cost of large test error on clean samples. To investigate this gap, we decompose the test risk into its bias and variance components and study their behavior as a function of adversarial training perturbation radii ($\varepsilon$). We find that the bi… ▽ More

    Submitted 13 June, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

    Comments: V2 adds new results and improves organization and presentation

  43. arXiv:2103.07601  [pdf, other

    cs.CL cs.AI

    Approximating How Single Head Attention Learns

    Authors: Charlie Snell, Ruiqi Zhong, Dan Klein, Jacob Steinhardt

    Abstract: Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is $o$ because it knows `i` translat… ▽ More

    Submitted 20 October, 2021; v1 submitted 12 March, 2021; originally announced March 2021.

  44. arXiv:2103.05898  [pdf, other

    cs.CV cs.AI cs.LG

    Limitations of Post-Hoc Feature Alignment for Robustness

    Authors: Collin Burns, Jacob Steinhardt

    Abstract: Feature alignment is an approach to improving robustness to distribution shift that matches the distribution of feature activations between the training distribution and test distribution. A particularly simple but effective approach to feature alignment involves aligning the batch normalization statistics between the two distributions in a trained neural network. This technique has received renew… ▽ More

    Submitted 10 March, 2021; originally announced March 2021.

    Comments: Accepted to CVPR 2021

  45. arXiv:2103.03874  [pdf, other

    cs.LG cs.AI cs.CL

    Measuring Mathematical Problem Solving With the MATH Dataset

    Authors: Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt

    Abstract: Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanati… ▽ More

    Submitted 8 November, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: NeurIPS 2021. Code and the MATH dataset is available at https://github.com/hendrycks/math/

  46. arXiv:2010.11645  [pdf, other

    cs.LG cs.AI

    Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming

    Authors: Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow, Percy Liang, Pushmeet Kohli

    Abstract: Convex relaxations have emerged as a promising approach for verifying desirable properties of neural networks like robustness to adversarial perturbations. Widely used Linear Programming (LP) relaxations only work well when networks are trained to facilitate verification. This precludes applications that involve verification-agnostic networks, i.e., networks not specially trained for verification.… ▽ More

    Submitted 3 November, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

  47. arXiv:2009.03300  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Measuring Massive Multitask Language Understanding

    Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

    Abstract: We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over… ▽ More

    Submitted 12 January, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: ICLR 2021; the test and code is available at https://github.com/hendrycks/test

  48. arXiv:2008.02275  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Aligning AI With Shared Human Values

    Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

    Abstract: We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable… ▽ More

    Submitted 17 February, 2023; v1 submitted 5 August, 2020; originally announced August 2020.

    Comments: ICLR 2021; the ETHICS dataset is available at https://github.com/hendrycks/ethics/

  49. arXiv:2006.16241  [pdf, other

    cs.CV cs.LG stat.ML

    The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

    Authors: Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

    Abstract: We introduce four new real-world distribution shift datasets consisting of changes in image style, image blurriness, geographic location, camera operation, and more. With our new datasets, we take stock of previously proposed methods for improving out-of-distribution robustness and put them to the test. We find that using larger models and artificial data augmentations can improve robustness on re… ▽ More

    Submitted 24 July, 2021; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: ICCV 2021; Datasets, code, and models available at https://github.com/hendrycks/imagenet-r

  50. arXiv:2005.14073  [pdf, other

    stat.ML cs.LG eess.SP math.ST stat.CO

    Robust estimation via generalized quasi-gradients

    Authors: Banghua Zhu, Jiantao Jiao, Jacob Steinhardt

    Abstract: We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of "generalized quasi-gradients". Whenever these quasi-gradients exist, a large family of low-regret algorithms are guaranteed to approximate the global… ▽ More

    Submitted 28 May, 2020; originally announced May 2020.