Search | arXiv e-print repository

Calibrating Large Language Models Using Their Generations Only

Authors: Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

Abstract: As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary pre… ▽ More As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2402.12991 [pdf, other]

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Authors: Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

Abstract: Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a… ▽ More Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function. △ Less

Submitted 6 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted at ACL 2024 (findings)

arXiv:2307.01881 [pdf, other]

ProPILE: Probing Privacy Leakage in Large Language Models

Authors: Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, Seong Joon Oh

Abstract: The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, o… ▽ More The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage in LLM-based services. ProPILE lets data subjects formulate prompts based on their own PII to evaluate the level of privacy intrusion in LLMs. We demonstrate its application on the OPT-1.3B model trained on the publicly available Pile dataset. We show how hypothetical data subjects may assess the likelihood of their PII being included in the Pile dataset being revealed. ProPILE can also be leveraged by LLM service providers to effectively evaluate their own levels of PII leakage with more powerful prompts specifically tuned for their in-house models. This tool represents a pioneering step towards empowering the data subjects for their awareness and control over their own data on the web. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2304.02688 [pdf, other]

Going Further: Flatness at the Rescue of Early Stop** for Adversarial Example Transferability

Authors: Martin Gubri, Maxime Cordy, Yves Le Traon

Abstract: Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that early stop** the training of the surrogate model substantially increases transferability. A common hypothesis to explain this is that deep neural networks (DNNs) first learn robust features, which are more generic, thus… ▽ More Transferability is the property of adversarial examples to be misclassified by other models than the surrogate model for which they were crafted. Previous research has shown that early stop** the training of the surrogate model substantially increases transferability. A common hypothesis to explain this is that deep neural networks (DNNs) first learn robust features, which are more generic, thus a better surrogate. Then, at later epochs, DNNs learn non-robust features, which are more brittle, hence worst surrogate. First, we tend to refute this hypothesis, using transferability as a proxy for representation similarity. We then establish links between transferability and the exploration of the loss landscape in parameter space, focusing on sharpness, which is affected by early stop**. This leads us to evaluate surrogate models trained with seven minimizers that minimize both loss value and loss sharpness. Among them, SAM consistently outperforms early stop** by up to 28.8 percentage points. We discover that the strong SAM regularization from large flat neighborhoods tightly links to transferability. Finally, the best sharpness-aware minimizers prove competitive with other training methods and complement existing transferability techniques. △ Less

Submitted 20 February, 2024; v1 submitted 5 April, 2023; originally announced April 2023.

Comments: Version 2: originally submitted in April 2023 and revised in February 2024

arXiv:2207.13129 [pdf, other]

LGV: Boosting Adversarial Example Transferability from Large Geometric Vicinity

Authors: Martin Gubri, Maxime Cordy, Mike Papadakis, Yves Le Traon, Koushik Sen

Abstract: We propose transferability from Large Geometric Vicinity (LGV), a new technique to increase the transferability of black-box adversarial attacks. LGV starts from a pretrained surrogate model and collects multiple weight sets from a few additional training epochs with a constant and high learning rate. LGV exploits two geometric properties that we relate to transferability. First, models that belon… ▽ More We propose transferability from Large Geometric Vicinity (LGV), a new technique to increase the transferability of black-box adversarial attacks. LGV starts from a pretrained surrogate model and collects multiple weight sets from a few additional training epochs with a constant and high learning rate. LGV exploits two geometric properties that we relate to transferability. First, models that belong to a wider weight optimum are better surrogates. Second, we identify a subspace able to generate an effective surrogate ensemble among this wider optimum. Through extensive experiments, we show that LGV alone outperforms all (combinations of) four established test-time transformations by 1.8 to 59.9 percentage points. Our findings shed new light on the importance of the geometry of the weight space to explain the transferability of adversarial examples. △ Less

Submitted 26 July, 2022; originally announced July 2022.

Comments: Accepted at ECCV 2022

arXiv:2012.07381 [pdf, other]

doi 10.1145/3522664.3528606

Influence-Driven Data Poisoning in Graph-Based Semi-Supervised Classifiers

Authors: Adriano Franci, Maxime Cordy, Martin Gubri, Mike Papadakis, Yves Le Traon

Abstract: Graph-based Semi-Supervised Learning (GSSL) is a practical solution to learn from a limited amount of labelled data together with a vast amount of unlabelled data. However, due to their reliance on the known labels to infer the unknown labels, these algorithms are sensitive to data quality. It is therefore essential to study the potential threats related to the labelled data, more specifically, la… ▽ More Graph-based Semi-Supervised Learning (GSSL) is a practical solution to learn from a limited amount of labelled data together with a vast amount of unlabelled data. However, due to their reliance on the known labels to infer the unknown labels, these algorithms are sensitive to data quality. It is therefore essential to study the potential threats related to the labelled data, more specifically, label poisoning. In this paper, we propose a novel data poisoning method which efficiently approximates the result of label inference to identify the inputs which, if poisoned, would produce the highest number of incorrectly inferred labels. We extensively evaluate our approach on three classification problems under 24 different experimental settings each. Compared to the state of the art, our influence-driven attack produces an average increase of error rate 50\% higher, while being faster by multiple orders of magnitude. Moreover, our method can inform engineers of inputs that deserve investigation (relabelling them) before training the learning model. We show that relabelling one-third of the poisoned inputs (selected based on their influence) reduces the poisoning effect by 50\%. △ Less

Submitted 11 May, 2022; v1 submitted 14 December, 2020; originally announced December 2020.

arXiv:2011.05074 [pdf, other]

Efficient and Transferable Adversarial Examples from Bayesian Neural Networks

Authors: Martin Gubri, Maxime Cordy, Mike Papadakis, Yves Le Traon, Koushik Sen

Abstract: An established way to improve the transferability of black-box evasion attacks is to craft the adversarial examples on an ensemble-based surrogate to increase diversity. We argue that transferability is fundamentally related to uncertainty. Based on a state-of-the-art Bayesian Deep Learning technique, we propose a new method to efficiently build a surrogate by sampling approximately from the poste… ▽ More An established way to improve the transferability of black-box evasion attacks is to craft the adversarial examples on an ensemble-based surrogate to increase diversity. We argue that transferability is fundamentally related to uncertainty. Based on a state-of-the-art Bayesian Deep Learning technique, we propose a new method to efficiently build a surrogate by sampling approximately from the posterior distribution of neural network weights, which represents the belief about the value of each parameter. Our extensive experiments on ImageNet, CIFAR-10 and MNIST show that our approach improves the success rates of four state-of-the-art attacks significantly (up to 83.2 percentage points), in both intra-architecture and inter-architecture transferability. On ImageNet, our approach can reach 94% of success rate while reducing training computations from 11.6 to 2.4 exaflops, compared to an ensemble of independently trained DNNs. Our vanilla surrogate achieves 87.5% of the time higher transferability than three test-time techniques designed for this purpose. Our work demonstrates that the way to train a surrogate has been overlooked, although it is an important element of transfer-based attacks. We are, therefore, the first to review the effectiveness of several training methods in increasing transferability. We provide new directions to better understand the transferability phenomenon and offer a simple but strong baseline for future work. △ Less

Submitted 18 June, 2022; v1 submitted 10 November, 2020; originally announced November 2020.

Comments: Accepted at UAI 2022

arXiv:1801.01953 [pdf, other]

Adversarial Perturbation Intensity Achieving Chosen Intra-Technique Transferability Level for Logistic Regression

Authors: Martin Gubri

Abstract: Machine Learning models have been shown to be vulnerable to adversarial examples, ie. the manipulation of data by a attacker to defeat a defender's classifier at test time. We present a novel probabilistic definition of adversarial examples in perfect or limited knowledge setting using prior probability distributions on the defender's classifier. Using the asymptotic properties of the logistic reg… ▽ More Machine Learning models have been shown to be vulnerable to adversarial examples, ie. the manipulation of data by a attacker to defeat a defender's classifier at test time. We present a novel probabilistic definition of adversarial examples in perfect or limited knowledge setting using prior probability distributions on the defender's classifier. Using the asymptotic properties of the logistic regression, we derive a closed-form expression of the intensity of any adversarial perturbation, in order to achieve a given expected misclassification rate. This technique is relevant in a threat model of known model specifications and unknown training data. To our knowledge, this is the first method that allows an attacker to directly choose the probability of attack success. We evaluate our approach on two real-world datasets. △ Less

Submitted 5 January, 2018; originally announced January 2018.

Showing 1–8 of 8 results for author: Gubri, M