Skip to main content

Showing 1–50 of 64 results for author: Tramèr, F

.
  1. arXiv:2407.08707  [pdf, other

    cs.CV cs.LG

    Extracting Training Data from Document-Based VQA Models

    Authors: Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari

    Abstract: Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Informat… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: ICML 2024

    ACM Class: I.2.7; I.2.10; K.4.1

  2. arXiv:2406.18382  [pdf, other

    cs.CR cs.LG

    Adversarial Search Engine Optimization for Large Language Models

    Authors: Fredrik Nestaas, Edoardo Debenedetti, Florian Tramèr

    Abstract: Large Language Models (LLMs) are increasingly used in applications where the model selects from competing third-party content, such as in LLM-powered search engines or chatbot plugins. In this paper, we introduce Preference Manipulation Attacks, a new class of attacks that manipulate an LLM's selections to favor the attacker. We demonstrate that carefully crafted website content or plugin document… ▽ More

    Submitted 2 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  3. arXiv:2406.16201  [pdf, ps, other

    cs.CR cs.CL cs.LG

    Blind Baselines Beat Membership Inference Attacks for Foundation Models

    Authors: Debeshee Das, Jie Zhang, Florian Tramèr

    Abstract: Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks can be used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample mem… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

  4. arXiv:2406.13352  [pdf, other

    cs.CR cs.LG

    AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

    Authors: Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, Florian Tramèr

    Abstract: AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data.… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  5. arXiv:2406.12027  [pdf, other

    cs.CR

    Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

    Authors: Robert Hönig, Javier Rando, Nicholas Carlini, Florian Tramèr

    Abstract: Artists are increasingly concerned about advancements in image generation models that can closely replicate their unique artistic styles. In response, several protection tools against style mimicry have been developed that incorporate small adversarial perturbations into artworks published online. In this work, we evaluate the effectiveness of popular protections -- with millions of downloads -- a… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  6. arXiv:2406.07954  [pdf, other

    cs.CR cs.AI

    Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

    Authors: Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr

    Abstract: Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2404.17399  [pdf, other

    cs.CR cs.LG

    Evaluations of Machine Learning Privacy Defenses are Misleading

    Authors: Michael Aerni, Jie Zhang, Florian Tramèr

    Abstract: Empirical defenses for machine learning privacy forgo the provable guarantees of differential privacy in the hope of achieving higher utility while resisting realistic adversaries. We identify severe pitfalls in existing empirical privacy evaluations (based on membership inference attacks) that result in misleading conclusions. In particular, we show that prior evaluations fail to characterize the… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  8. arXiv:2404.14461  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

    Authors: Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr

    Abstract: Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any pro… ▽ More

    Submitted 6 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Competition Report

  9. arXiv:2404.09932  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    Foundational Challenges in Assuring Alignment and Safety of Large Language Models

    Authors: Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi , et al. (13 additional authors not shown)

    Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.

    Submitted 15 April, 2024; originally announced April 2024.

  10. arXiv:2404.01318  [pdf, other

    cs.CR cs.LG

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Authors: Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

    Abstract: Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and suc… ▽ More

    Submitted 16 June, 2024; v1 submitted 27 March, 2024; originally announced April 2024.

    Comments: JailbreakBench v1.0: more attack artifacts, more test-time defenses, a more accurate jailbreak judge (Llama-3-70B with a custom prompt), a larger dataset of human preferences for selecting a jailbreak judge (300 examples), an over-refusal evaluation dataset (100 benign/borderline behaviors), a semantic refusal judge based on Llama-3-8B

  11. arXiv:2404.00473  [pdf, other

    cs.CR cs.LG

    Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

    Authors: Shanglun Feng, Florian Tramèr

    Abstract: Practitioners commonly download pretrained machine learning models from open repositories and finetune them to fit specific applications. We show that this practice introduces a new risk of privacy backdoors. By tampering with a pretrained model's weights, an attacker can fully compromise the privacy of the finetuning data. We show how to build privacy backdoors for a variety of models, including… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: Code at https://github.com/ShanglunFengatETHZ/PrivacyBackdoor

  12. arXiv:2403.06634  [pdf, other

    cs.CR

    Stealing Part of a Production Language Model

    Authors: Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, Florian Tramèr

    Abstract: We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \… ▽ More

    Submitted 9 July, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

  13. arXiv:2402.12329  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Query-Based Adversarial Prompt Generation

    Authors: Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr

    Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  14. arXiv:2311.17035  [pdf, other

    cs.LG cs.CL cs.CR

    Scalable Extraction of Training Data from (Production) Language Models

    Authors: Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee

    Abstract: This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  15. arXiv:2311.14455  [pdf, other

    cs.AI cs.CL cs.CR cs.LG

    Universal Jailbreak Backdoors from Poisoned Human Feedback

    Authors: Javier Rando, Florian Tramèr

    Abstract: Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the mode… ▽ More

    Submitted 29 April, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: Accepted as conference paper in ICLR 2024

  16. arXiv:2309.05610  [pdf, other

    cs.CR cs.LG

    Privacy Side Channels in Machine Learning Systems

    Authors: Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr

    Abstract: Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum, when in reality, ML models are part of larger systems that include components for training data filtering, output monitoring, and more. In this work, we introduce privacy side channels: attacks that exploit these system-level components to extract private information at far higher rates th… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

  17. arXiv:2307.14692  [pdf, other

    cs.CR

    Backdoor Attacks for In-Context Learning with Language Models

    Authors: Nikhil Kandpal, Matthew Jagielski, Florian Tramèr, Nicholas Carlini

    Abstract: Because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

    Comments: AdvML Frontiers Workshop 2023

  18. arXiv:2306.15447  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Are aligned neural networks adversarially aligned?

    Authors: Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt

    Abstract: Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models rema… ▽ More

    Submitted 6 May, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

  19. arXiv:2306.09983  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Evaluating Superhuman Models with Consistency Checks

    Authors: Lukas Fluri, Daniel Paleka, Florian Tramèr

    Abstract: If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impos… ▽ More

    Submitted 19 October, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: 42 pages, 18 figures. Code and data are available at https://github.com/ethz-spylab/superhuman-ai-consistency

  20. arXiv:2306.02895  [pdf, other

    cs.CR cs.LG stat.ML

    Evading Black-box Classifiers Without Breaking Eggs

    Authors: Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr

    Abstract: Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asy… ▽ More

    Submitted 14 February, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

    Comments: Code at https://github.com/ethz-privsec/realistic-adv-examples. Accepted at IEEE SaTML 2024

  21. arXiv:2302.13464  [pdf, other

    cs.LG cs.CR

    Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

    Authors: Keane Lucas, Matthew Jagielski, Florian Tramèr, Lujo Bauer, Nicholas Carlini

    Abstract: It is becoming increasingly imperative to design robust ML defenses. However, recent work has found that many defenses that initially resist state-of-the-art attacks can be broken by an adaptive adversary. In this work we take steps to simplify the design of defenses and argue that white-box defenses should eschew randomness when possible. We begin by illustrating a new issue with the deployment o… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  22. arXiv:2302.10149  [pdf, other

    cs.CR cs.LG

    Poisoning Web-Scale Training Datasets is Practical

    Authors: Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

    Abstract: Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet… ▽ More

    Submitted 6 May, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

  23. arXiv:2302.07956  [pdf, other

    cs.LG cs.CR

    Tight Auditing of Differentially Private Machine Learning

    Authors: Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, Andreas Terzis

    Abstract: Auditing mechanisms for differential privacy use probabilistic means to empirically estimate the privacy level of an algorithm. For private machine learning, existing auditing mechanisms are tight: the empirical privacy estimate (nearly) matches the algorithm's provable privacy guarantee. But these auditing techniques suffer from two limitations. First, they only give tight estimates under implaus… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  24. arXiv:2301.13188  [pdf, other

    cs.CR cs.CV cs.LG

    Extracting Training Data from Diffusion Models

    Authors: Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

    Abstract: Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

  25. arXiv:2212.06470  [pdf, ps, other

    cs.LG cs.CR stat.ML

    Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

    Authors: Florian Tramèr, Gautam Kamath, Nicholas Carlini

    Abstract: The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pret… ▽ More

    Submitted 2 June, 2024; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: ICML 2024

  26. arXiv:2210.17546  [pdf, other

    cs.LG cs.CL

    Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

    Authors: Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A. Choquette-Choo, Nicholas Carlini

    Abstract: Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argu… ▽ More

    Submitted 11 September, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

  27. arXiv:2210.04610  [pdf, other

    cs.AI cs.CR cs.CV cs.CY cs.LG

    Red-Teaming the Stable Diffusion Safety Filter

    Authors: Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr

    Abstract: Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations a… ▽ More

    Submitted 10 November, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: ML Safety Workshop NeurIPS 2022

  28. arXiv:2210.03297  [pdf, other

    cs.CR cs.CV cs.LG

    Preprocessors Matter! Realistic Decision-Based Attacks on Machine Learning Systems

    Authors: Chawin Sitawarin, Florian Tramèr, Nicholas Carlini

    Abstract: Decision-based attacks construct adversarial examples against a machine learning (ML) model by making only hard-label queries. These attacks have mainly been applied directly to standalone neural networks. However, in practice, ML models are just one component of a larger learning system. We find that by adding a single preprocessor in front of a classifier, state-of-the-art query-based attacks ar… ▽ More

    Submitted 20 July, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: ICML 2023. Code can be found at https://github.com/google-research/preprocessor-aware-black-box-attack

  29. arXiv:2208.12348  [pdf, other

    cs.LG cs.CR

    SNAP: Efficient Extraction of Private Properties with Poisoning

    Authors: Harsh Chaudhari, John Abascal, Alina Oprea, Matthew Jagielski, Florian Tramèr, Jonathan Ullman

    Abstract: Property inference attacks allow an adversary to extract global properties of the training dataset from a machine learning model. Such attacks have privacy implications for data owners sharing their datasets to train machine learning models. Several existing approaches for property inference attacks against deep neural networks have been proposed, but they all rely on the attacker training a large… ▽ More

    Submitted 21 June, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

    Comments: 28 pages, 16 figures

  30. arXiv:2207.00099  [pdf, other

    cs.LG

    Measuring Forgetting of Memorized Training Examples

    Authors: Matthew Jagielski, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Chiyuan Zhang

    Abstract: Machine learning models exhibit two seemingly contradictory phenomena: training data memorization, and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what… ▽ More

    Submitted 9 May, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

    Comments: Appeared at ICLR '23, 22 pages, 12 figures

  31. arXiv:2206.13991  [pdf, other

    cs.LG cs.CR cs.CV

    Increasing Confidence in Adversarial Robustness Evaluations

    Authors: Roland S. Zimmermann, Wieland Brendel, Florian Tramer, Nicholas Carlini

    Abstract: Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these defenses held up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this pa… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: Oral at CVPR 2022 Workshop (Art of Robustness). Project website https://zimmerrol.github.io/active-tests/

  32. arXiv:2206.10550  [pdf, other

    cs.LG cs.CR

    (Certified!!) Adversarial Robustness for Free!

    Authors: Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, J. Zico Kolter

    Abstract: In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. 2020 by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accura… ▽ More

    Submitted 6 March, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

  33. arXiv:2206.10469  [pdf, other

    cs.LG cs.CR

    The Privacy Onion Effect: Memorization is Relative

    Authors: Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, Florian Tramer

    Abstract: Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable to a privac… ▽ More

    Submitted 22 June, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

  34. arXiv:2204.00032  [pdf, other

    cs.CR cs.LG stat.ML

    Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets

    Authors: Florian Tramèr, Reza Shokri, Ayrton San Joaquin, Hoang Le, Matthew Jagielski, Sanghyun Hong, Nicholas Carlini

    Abstract: We introduce a new class of attacks on machine learning models. We show that an adversary who can poison a training dataset can cause models trained on this dataset to leak significant private details of training points belonging to other parties. Our active inference attacks connect two independent lines of work targeting the integrity and privacy of machine learning training data. Our attacks… ▽ More

    Submitted 6 October, 2022; v1 submitted 31 March, 2022; originally announced April 2022.

    Comments: ACM CCS 2022

  35. arXiv:2202.12219  [pdf, other

    cs.LG

    Debugging Differential Privacy: A Case Study for Privacy Auditing

    Authors: Florian Tramer, Andreas Terzis, Thomas Steinke, Shuang Song, Matthew Jagielski, Nicholas Carlini

    Abstract: Differential Privacy can provide provable privacy guarantees for training data in machine learning. However, the presence of proofs does not preclude the presence of errors. Inspired by recent advances in auditing which have been used for estimating lower bounds on differentially private algorithms, here we show that auditing can also be used to find flaws in (purportedly) differentially private s… ▽ More

    Submitted 28 March, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

  36. arXiv:2202.07646  [pdf, other

    cs.LG cs.CL

    Quantifying Memorization Across Neural Language Models

    Authors: Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

    Abstract: Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe thr… ▽ More

    Submitted 6 March, 2023; v1 submitted 15 February, 2022; originally announced February 2022.

  37. arXiv:2202.05520  [pdf, other

    stat.ML cs.CL cs.LG

    What Does it Mean for a Language Model to Preserve Privacy?

    Authors: Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, Florian Tramèr

    Abstract: Natural language reflects our private lives and identities, making its privacy concerns as broad as those of real life. Language models lack the ability to understand the context and sensitivity of text, and tend to memorize phrases present in their training sets. An adversary can exploit this tendency to extract training data. Depending on the nature of the content and the context in which this d… ▽ More

    Submitted 14 February, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: 21 pages, 2 figures

  38. arXiv:2112.12938  [pdf, other

    cs.CL cs.AI cs.LG

    Counterfactual Memorization in Neural Language Models

    Authors: Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, Nicholas Carlini

    Abstract: Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data. Understanding this memorization is important in real world applications and also from a learning-theoretical perspective. An open question in previous studies of language model memorization is how to filter out "common" memorization. In fact, most memorization cri… ▽ More

    Submitted 13 October, 2023; v1 submitted 23 December, 2021; originally announced December 2021.

    Comments: NeurIPS 2023; 42 pages, 33 figures

  39. arXiv:2112.03570  [pdf, other

    cs.CR cs.LG

    Membership Inference Attacks From First Principles

    Authors: Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramer

    Abstract: A membership inference attack allows an adversary to query a trained machine learning model to predict whether or not a particular example was contained in the model's training dataset. These attacks are currently evaluated using average-case "accuracy" metrics that fail to characterize whether the attack can confidently identify any members of the training set. We argue that attacks should instea… ▽ More

    Submitted 12 April, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

  40. arXiv:2110.05679  [pdf, other

    cs.LG cs.CL

    Large Language Models Can Be Strong Differentially Private Learners

    Authors: Xuechen Li, Florian Tramèr, Percy Liang, Tatsunori Hashimoto

    Abstract: Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language mod… ▽ More

    Submitted 10 November, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: 31 pages; update ethics statement to clarify benefits and potential long-term harms

  41. arXiv:2108.07258  [pdf, other

    cs.LG cs.AI cs.CY

    On the Opportunities and Risks of Foundation Models

    Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

    Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More

    Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

  42. arXiv:2108.07256  [pdf, ps, other

    cs.CR

    NeuraCrypt is not private

    Authors: Nicholas Carlini, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Florian Tramer

    Abstract: NeuraCrypt (Yara et al. arXiv 2021) is an algorithm that converts a sensitive dataset to an encoded dataset so that (1) it is still possible to train machine learning models on the encoded data, but (2) an adversary who has access only to the encoded dataset can not learn much about the original sensitive dataset. We break NeuraCrypt privacy claims, by perfectly solving the authors' public challen… ▽ More

    Submitted 16 August, 2021; originally announced August 2021.

  43. arXiv:2107.11630  [pdf, other

    cs.LG cs.CR stat.ML

    Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

    Authors: Florian Tramèr

    Abstract: Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance ε (in some metric), we can build a similarly robust (but inefficient) cl… ▽ More

    Submitted 16 June, 2022; v1 submitted 24 July, 2021; originally announced July 2021.

    Comments: ICML 2022 (Long Talk)

  44. arXiv:2106.14851  [pdf, other

    cs.LG cs.CR

    Data Poisoning Won't Save You From Facial Recognition

    Authors: Evani Radiya-Dixit, Sanghyun Hong, Nicholas Carlini, Florian Tramèr

    Abstract: Data poisoning has been proposed as a compelling defense against facial recognition models trained on Web-scraped pictures. Users can perturb images they post online, so that models will misclassify future (unperturbed) pictures. We demonstrate that this strategy provides a false sense of security, as it ignores an inherent asymmetry between the parties: users' pictures are perturbed once and for… ▽ More

    Submitted 14 March, 2022; v1 submitted 28 June, 2021; originally announced June 2021.

    Comments: ICLR 2022

  45. arXiv:2106.03408  [pdf, other

    cs.LG cs.CR

    Antipodes of Label Differential Privacy: PATE and ALIBI

    Authors: Mani Malek, Ilya Mironov, Karthik Prasad, Igor Shilov, Florian Tramèr

    Abstract: We consider the privacy-preserving machine learning (ML) setting where the trained model must satisfy differential privacy (DP) with respect to the labels of the training examples. We propose two novel approaches based on, respectively, the Laplace mechanism and the PATE framework, and demonstrate their effectiveness on standard benchmarks. While recent work by Ghazi et al. proposed Label DP sch… ▽ More

    Submitted 29 October, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: 2021 Conference on Neural Information Processing Systems (NeurIPS)

  46. arXiv:2012.07805  [pdf, other

    cs.CR cs.CL cs.LG

    Extracting Training Data from Large Language Models

    Authors: Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel

    Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and ar… ▽ More

    Submitted 15 June, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

  47. arXiv:2011.11660  [pdf, other

    cs.LG cs.CR stat.ML

    Differentially Private Learning Needs Better Features (or Much More Data)

    Authors: Florian Tramèr, Dan Boneh

    Abstract: We demonstrate that differentially private machine learning has not yet reached its "AlexNet moment" on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to… ▽ More

    Submitted 17 February, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: ICLR 2021. Code available at https://github.com/ftramer/Handcrafted-DP

  48. arXiv:2011.05315  [pdf, other

    cs.CR cs.CV cs.LG

    Is Private Learning Possible with Instance Encoding?

    Authors: Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramer

    Abstract: A private machine learning algorithm hides as much as possible about its training data while still preserving accuracy. In this work, we study whether a non-private learning algorithm can be made private by relying on an instance-encoding mechanism that modifies the training inputs before feeding them to a normal learner. We formalize both the notion of instance encoding and its privacy by providi… ▽ More

    Submitted 27 April, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

  49. arXiv:2007.14321  [pdf, other

    cs.CR cs.LG stat.ML

    Label-Only Membership Inference Attacks

    Authors: Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot

    Abstract: Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, witho… ▽ More

    Submitted 5 December, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

    Comments: 16 pages, 11 figures, 2 tables Revision 2: 19 pages, 12 figures, 3 tables. Improved text and additional experiments. Final ICML paper

  50. arXiv:2002.08347  [pdf, other

    cs.LG cs.CR stat.ML

    On Adaptive Attacks to Adversarial Example Defenses

    Authors: Florian Tramer, Nicholas Carlini, Wieland Brendel, Aleksander Madry

    Abstract: Adaptive attacks have (rightfully) become the de facto standard for evaluating defenses to adversarial examples. We find, however, that typical adaptive evaluations are incomplete. We demonstrate that thirteen defenses recently published at ICLR, ICML and NeurIPS---and chosen for illustrative and pedagogical purposes---can be circumvented despite attempting to perform evaluations using adaptive at… ▽ More

    Submitted 23 October, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

    Comments: NeurIPS 2020