How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Authors:
Lorenzo Pacchiardi,
Alex J. Chan,
Sören Mindermann,
Ilan Moscovitz,
Alexa Y. Pan,
Yarin Gal,
Owain Evans,
Jan Brauner
Abstract:
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a…
▽ More
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
Improved Sensitivity of the DRIFT-IId Directional Dark Matter Experiment using Machine Learning
Authors:
J. B. R. Battat,
C. Eldridge,
A. C. Ezeribe,
O. P. Gaunt,
J. -L. Gauvreau,
R. R. Marcelo Gregorio,
E. K. K. Habich,
K. E. Hall,
J. L. Harton,
I. Ingabire,
R. Lafler,
D. Loomba,
W. A. Lynch,
S. M. Paling,
A. Y. Pan,
A. Scarff,
F. G. Schuckman II,
D. P. Snowden-Ifft,
N. J. C. Spooner,
C. Toth,
A. A. Xu
Abstract:
We demonstrate a new type of analysis for the DRIFT-IId directional dark matter detector using a machine learning algorithm called a Random Forest Classifier. The analysis labels events as signal or background based on a series of selection parameters, rather than solely applying hard cuts. The analysis efficiency is shown to be comparable to our previous result at high energy but with increased e…
▽ More
We demonstrate a new type of analysis for the DRIFT-IId directional dark matter detector using a machine learning algorithm called a Random Forest Classifier. The analysis labels events as signal or background based on a series of selection parameters, rather than solely applying hard cuts. The analysis efficiency is shown to be comparable to our previous result at high energy but with increased efficiency at lower energies. This leads to a projected sensitivity enhancement of one order of magnitude below a WIMP mass of 15 GeV c$^{-2}$ and a projected sensitivity limit that reaches down to a WIMP mass of 9 GeV c$^{-2}$, which is a first for a directionally sensitive dark matter detector.
△ Less
Submitted 8 June, 2021; v1 submitted 11 March, 2021;
originally announced March 2021.