-
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Authors:
Jannik Kossen,
Jiatong Han,
Muhammed Razzak,
Lisa Schut,
Shreshth Malik,
Yarin Gal
Abstract:
We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations…
▽ More
We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
The Benefits and Risks of Transductive Approaches for AI Fairness
Authors:
Muhammed Razzak,
Andreas Kirsch,
Yarin Gal
Abstract:
Recently, transductive learning methods, which leverage holdout sets during training, have gained popularity for their potential to improve speed, accuracy, and fairness in machine learning models. Despite this, the composition of the holdout set itself, particularly the balance of sensitive sub-groups, has been largely overlooked. Our experiments on CIFAR and CelebA datasets show that composition…
▽ More
Recently, transductive learning methods, which leverage holdout sets during training, have gained popularity for their potential to improve speed, accuracy, and fairness in machine learning models. Despite this, the composition of the holdout set itself, particularly the balance of sensitive sub-groups, has been largely overlooked. Our experiments on CIFAR and CelebA datasets show that compositional changes in the holdout set can substantially influence fairness metrics. Imbalanced holdout sets exacerbate existing disparities, while balanced holdouts can mitigate issues introduced by imbalanced training data. These findings underline the necessity of constructing holdout sets that are both diverse and representative.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Deep Bayesian Active Learning for Preference Modeling in Large Language Models
Authors:
Luckeciano C. Melo,
Panagiotis Tigas,
Alessandro Abate,
Yarin Gal
Abstract:
Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the furth…
▽ More
Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the further development of LLMs. Bayesian Active Learning provides a principled framework for addressing this challenge and has demonstrated remarkable success in diverse settings. However, previous attempts to employ it for Preference Modeling did not meet such expectations. In this work, we identify that naive epistemic uncertainty estimation leads to the acquisition of redundant samples. We address this by proposing the Bayesian Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition policy that not only targets points of high epistemic uncertainty according to the preference model but also seeks to maximize the entropy of the acquired prompt distribution in the feature space spanned by the employed LLM. Notably, our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous stochastic Bayesian acquisition policies.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Estimating the Hallucination Rate of Generative AI
Authors:
Andrew Jesson,
Nicolas Beltran-Velez,
Quentin Chu,
Sweta Karlekar,
Jannik Kossen,
Yarin Gal,
John P. Cunningham,
David Blei
Abstract:
This work is about estimating the hallucination rate for in-context learning (ICL) with Generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and asked to make a prediction based on that dataset. The Bayesian interpretation of ICL assumes that the CGM is calculating a posterior predictive distribution over an unknown Bayesian model of a latent parameter and data. W…
▽ More
This work is about estimating the hallucination rate for in-context learning (ICL) with Generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and asked to make a prediction based on that dataset. The Bayesian interpretation of ICL assumes that the CGM is calculating a posterior predictive distribution over an unknown Bayesian model of a latent parameter and data. With this perspective, we define a \textit{hallucination} as a generated prediction that has low-probability under the true latent parameter. We develop a new method that takes an ICL problem -- that is, a CGM, a dataset, and a prediction question -- and estimates the probability that a CGM will generate a hallucination. Our method only requires generating queries and responses from the model and evaluating its response log probability. We empirically evaluate our method on synthetic regression and natural language ICL tasks using large language models.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Challenges and Considerations in the Evaluation of Bayesian Causal Discovery
Authors:
Amir Mohammad Karimi Mamaghan,
Panagiotis Tigas,
Karl Henrik Johansson,
Yarin Gal,
Yashas Annadani,
Stefan Bauer
Abstract:
Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making. Bayesian Causal Discovery (BCD) offers a principled approach to encapsulating this uncertainty. Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, evaluating BCD presents…
▽ More
Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making. Bayesian Causal Discovery (BCD) offers a principled approach to encapsulating this uncertainty. Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, evaluating BCD presents challenges due to the nature of its inferred quantity - the posterior distribution. As a result, the research community has proposed various metrics to assess the quality of the approximate posterior. However, there is, to date, no consensus on the most suitable metric(s) for evaluation. In this work, we reexamine this question by dissecting various metrics and understanding their limitations. Through extensive empirical evaluation, we find that many existing metrics fail to exhibit a strong correlation with the quality of approximation to the true posterior, especially in scenarios with low sample sizes where BCD is most desirable. We highlight the suitability (or lack thereof) of these metrics under two distinct factors: the identifiability of the underlying causal model and the quantity of available data. Both factors affect the entropy of the true posterior, indicating that the current metrics are less fitting in settings of higher entropy. Our findings underline the importance of a more nuanced evaluation of new methods by taking into account the nature of the true posterior, as well as guide and motivate the development of new evaluation procedures for this challenge.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities
Authors:
Alexander Nikitin,
Jannik Kossen,
Yarin Gal,
Pekka Marttinen
Abstract:
Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model's semantic uncertainty, i.e., the uncertainty over the meani…
▽ More
Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model's semantic uncertainty, i.e., the uncertainty over the meanings of LLM outputs, rather than uncertainty over lexical or syntactic variations that do not affect answer correctness. To address this problem, we propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs. KLE defines positive semidefinite unit trace kernels to encode the semantic similarities of LLM outputs and quantifies uncertainty using the von Neumann entropy. It considers pairwise semantic dependencies between answers (or semantic clusters), providing more fine-grained uncertainty estimates than previous methods based on hard clustering of answers. We theoretically prove that KLE generalizes the previous state-of-the-art method called semantic entropy and empirically demonstrate that it improves uncertainty quantification performance across multiple natural language generation datasets and LLM architectures.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Authors:
Gunshi Gupta,
Karmesh Yadav,
Yarin Gal,
Dhruv Batra,
Zsolt Kira,
Cong Lu,
Tim G. J. Rudner
Abstract:
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used…
▽ More
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Explaining Explainability: Understanding Concept Activation Vectors
Authors:
Angus Nicolson,
Lisa Schut,
J. Alison Noble,
Yarin Gal
Abstract:
Recent interpretability methods propose using concept-based explanations to translate the internal representations of deep learning models into a language that humans are familiar with: concepts. This requires understanding which concepts are present in the representation space of a neural network. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using…
▽ More
Recent interpretability methods propose using concept-based explanations to translate the internal representations of deep learning models into a language that humans are familiar with: concepts. This requires understanding which concepts are present in the representation space of a neural network. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs. CAVs may be: (1) inconsistent between layers, (2) entangled with different concepts, and (3) spatially dependent. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how they affect the derived explanations, and provide recommendations to minimise their impact. Understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on ImageNet and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Bayesian Preference Elicitation with Language Models
Authors:
Kunal Handa,
Yarin Gal,
Ellie Pavlick,
Noah Goodman,
Jacob Andreas,
Alex Tamkin,
Belinda Z. Li
Abstract:
Aligning AI systems to users' interests requires understanding and incorporating humans' complex values and preferences. Recently, language models (LMs) have been used to gather information about the preferences of human users. This preference data can be used to fine-tune or guide other LMs and/or AI systems. However, LMs have been shown to struggle with crucial aspects of preference learning: qu…
▽ More
Aligning AI systems to users' interests requires understanding and incorporating humans' complex values and preferences. Recently, language models (LMs) have been used to gather information about the preferences of human users. This preference data can be used to fine-tune or guide other LMs and/or AI systems. However, LMs have been shown to struggle with crucial aspects of preference learning: quantifying uncertainty, modeling human mental states, and asking informative questions. These challenges have been addressed in other areas of machine learning, such as Bayesian Optimal Experimental Design (BOED), which focus on designing informative queries within a well-defined feature space. But these methods, in turn, are difficult to scale and apply to real-world problems where simply identifying the relevant features can be difficult. We introduce OPEN (Optimal Preference Elicitation with Natural language) a framework that uses BOED to guide the choice of informative questions and an LM to extract features and translate abstract BOED queries into natural language questions. By combining the flexibility of LMs with the rigor of BOED, OPEN can optimize the informativity of queries while remaining adaptable to real-world domains. In user studies, we find that OPEN outperforms existing LM- and BOED-based methods for preference elicitation.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Continual Learning via Sequential Function-Space Variational Inference
Authors:
Tim G. J. Rudner,
Freddie Bickford Smith,
Qixuan Feng,
Yee Whye Teh,
Yarin Gal
Abstract:
Sequential Bayesian inference over predictive functions is a natural framework for continual learning from streams of data. However, applying it to neural networks has proved challenging in practice. Addressing the drawbacks of existing techniques, we propose an optimization objective derived by formulating continual learning as sequential function-space variational inference. In contrast to exist…
▽ More
Sequential Bayesian inference over predictive functions is a natural framework for continual learning from streams of data. However, applying it to neural networks has proved challenging in practice. Addressing the drawbacks of existing techniques, we propose an optimization objective derived by formulating continual learning as sequential function-space variational inference. In contrast to existing methods that regularize neural network parameters directly, this objective allows parameters to vary widely during training, enabling better adaptation to new tasks. Compared to objectives that directly regularize neural network predictions, the proposed objective allows for more flexible variational distributions and more effective regularization. We demonstrate that, across a range of task sequences, neural networks trained via sequential function-space variational inference achieve better predictive accuracy than networks trained with related methods while depending less on maintaining a set of representative points from previous tasks.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Tractable Function-Space Variational Inference in Bayesian Neural Networks
Authors:
Tim G. J. Rudner,
Zonghao Chen,
Yee Whye Teh,
Yarin Gal
Abstract:
Reliable predictive uncertainty estimation plays an important role in enabling the deployment of neural networks to safety-critical settings. A popular approach for estimating the predictive uncertainty of neural networks is to define a prior distribution over the network parameters, infer an approximate posterior distribution, and use it to make stochastic predictions. However, explicit inference…
▽ More
Reliable predictive uncertainty estimation plays an important role in enabling the deployment of neural networks to safety-critical settings. A popular approach for estimating the predictive uncertainty of neural networks is to define a prior distribution over the network parameters, infer an approximate posterior distribution, and use it to make stochastic predictions. However, explicit inference over neural network parameters makes it difficult to incorporate meaningful prior information about the data-generating process into the model. In this paper, we pursue an alternative approach. Recognizing that the primary object of interest in most settings is the distribution over functions induced by the posterior distribution over neural network parameters, we frame Bayesian inference in neural networks explicitly as inferring a posterior distribution over functions and propose a scalable function-space variational inference method that allows incorporating prior information and results in reliable predictive uncertainty estimates. We show that the proposed method leads to state-of-the-art uncertainty estimation and predictive performance on a range of prediction tasks and demonstrate that it performs well on a challenging safety-critical medical diagnosis task in which reliable uncertainty estimation is essential.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning?
Authors:
Gunshi Gupta,
Tim G. J. Rudner,
Rowan Thomas McAllister,
Adrien Gaidon,
Yarin Gal
Abstract:
Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. Such a policy may falsely appear to be optimal during training if most of the training data contain such spurious correlations. This phenomenon is particularly pronounced in domains such as robotics, with potentially large gaps between the open- and closed-loop performance of…
▽ More
Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. Such a policy may falsely appear to be optimal during training if most of the training data contain such spurious correlations. This phenomenon is particularly pronounced in domains such as robotics, with potentially large gaps between the open- and closed-loop performance of an agent. In such settings, causally confused models may appear to perform well according to open-loop metrics during training but fail catastrophically when deployed in the real world. In this paper, we study causal confusion in offline reinforcement learning. We investigate whether selectively sampling appropriate points from a dataset of demonstrations may enable offline reinforcement learning agents to disambiguate the underlying causal mechanisms of the environment, alleviate causal confusion in offline reinforcement learning, and produce a safer model for deployment. To answer this question, we consider a set of tailored offline reinforcement learning datasets that exhibit causal ambiguity and assess the ability of active sampling techniques to reduce causal confusion at evaluation. We provide empirical evidence that uniform and active sampling techniques are able to consistently reduce causal confusion as training progresses and that active sampling is able to do so significantly more efficiently than uniform sampling.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Entanglement Dynamics in Monitored Systems and the Role of Quantum Jumps
Authors:
Youenn Le Gal,
Xhek Turkeshi,
Marco Schirò
Abstract:
Monitored quantum many-body systems display a rich pattern of entanglement dynamics, which is unique to this non-unitary setting. This work studies the effect of quantum jumps on the entanglement dynamics beyond the no-click limit corresponding to a deterministic non-Hermitian evolution. We consider two examples, a monitored SSH model and a quantum Ising chain, for which we show the jumps have rem…
▽ More
Monitored quantum many-body systems display a rich pattern of entanglement dynamics, which is unique to this non-unitary setting. This work studies the effect of quantum jumps on the entanglement dynamics beyond the no-click limit corresponding to a deterministic non-Hermitian evolution. We consider two examples, a monitored SSH model and a quantum Ising chain, for which we show the jumps have remarkably different effects on the entanglement despite having the same statistics as encoded in their waiting-time distribution. To understand this difference, we introduce a new metric, the statistics of entanglement gain and loss due to jumps and non-Hermitian evolution. This insight allows us to build a simple stochastic model of a random walk with partial resetting, which reproduces the entanglement dynamics, and to dissect the mutual role of jumps and non-Hermitian evolution on the entanglement scaling. We demonstrate that significant deviations from the no-click limit arise whenever quantum jumps strongly renormalize the non-Hermitian dynamics, as in the case of the SSH model at weak monitoring or in the Ising chain at large transverse field. On the other hand, we show that the weak monitoring phase of the Ising chain leads to a robust sub-volume logarithmic phase due to weakly renormalized non-Hermitian dynamics.
△ Less
Submitted 27 June, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design
Authors:
Clare Lyle,
Arash Mehrjou,
Pascal Notin,
Andrew Jesson,
Stefan Bauer,
Yarin Gal,
Patrick Schwab
Abstract:
The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interv…
▽ More
The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interventions that maximally change a target phenotype via diverse mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the rate of significant discoveries per experiment while simultaneously probing for a wide range of diverse mechanisms during a genomic experiment campaign. We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation covering both synthetic as well as real-world experimental design tasks. DiscoBAX outperforms existing state-of-the-art methods for experimental design, selecting effective and diverse perturbations in biological systems.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Revam** AI Models in Dermatology: Overcoming Critical Challenges for Enhanced Skin Lesion Diagnosis
Authors:
Deval Mehta,
Brigid Betz-Stablein,
Toan D Nguyen,
Yaniv Gal,
Adrian Bowling,
Martin Haskett,
Maithili Sashindranath,
Paul Bonnington,
Victoria Mar,
H Peter Soyer,
Zongyuan Ge
Abstract:
The surge in develo** deep learning models for diagnosing skin lesions through image analysis is notable, yet their clinical black faces challenges. Current dermatology AI models have limitations: limited number of possible diagnostic outputs, lack of real-world testing on uncommon skin lesions, inability to detect out-of-distribution images, and over-reliance on dermoscopic images. To address t…
▽ More
The surge in develo** deep learning models for diagnosing skin lesions through image analysis is notable, yet their clinical black faces challenges. Current dermatology AI models have limitations: limited number of possible diagnostic outputs, lack of real-world testing on uncommon skin lesions, inability to detect out-of-distribution images, and over-reliance on dermoscopic images. To address these, we present an All-In-One \textbf{H}ierarchical-\textbf{O}ut of Distribution-\textbf{C}linical Triage (HOT) model. For a clinical image, our model generates three outputs: a hierarchical prediction, an alert for out-of-distribution images, and a recommendation for dermoscopy if clinical image alone is insufficient for diagnosis. When the recommendation is pursued, it integrates both clinical and dermoscopic images to deliver final diagnosis. Extensive experiments on a representative cutaneous lesion dataset demonstrate the effectiveness and synergy of each component within our framework. Our versatile model provides valuable decision support for lesion diagnosis and sets a promising precedent for medical AI applications.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Form follows Function: Text-to-Text Conditional Graph Generation based on Functional Requirements
Authors:
Peter A. Zachares,
Vahan Hovhannisyan,
Alan Mosca,
Yarin Gal
Abstract:
This work focuses on the novel problem setting of generating graphs conditioned on a description of the graph's functional requirements in a downstream task. We pose the problem as a text-to-text generation problem and focus on the approach of fine-tuning a pretrained large language model (LLM) to generate graphs. We propose an inductive bias which incorporates information about the structure of t…
▽ More
This work focuses on the novel problem setting of generating graphs conditioned on a description of the graph's functional requirements in a downstream task. We pose the problem as a text-to-text generation problem and focus on the approach of fine-tuning a pretrained large language model (LLM) to generate graphs. We propose an inductive bias which incorporates information about the structure of the graph into the LLM's generation process by incorporating message passing layers into an LLM's architecture. To evaluate our proposed method, we design a novel set of experiments using publicly available and widely studied molecule and knowledge graph data sets. Results suggest our proposed approach generates graphs which more closely meet the requested functional requirements, outperforming baselines developed on similar tasks by a statistically significant margin.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Authors:
Lorenzo Pacchiardi,
Alex J. Chan,
Sören Mindermann,
Ilan Moscovitz,
Alexa Y. Pan,
Yarin Gal,
Owain Evans,
Jan Brauner
Abstract:
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a…
▽ More
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Fine-tuning can cripple your foundation model; preserving features may be the solution
Authors:
Jishnu Mukhoti,
Yarin Gal,
Philip H. S. Torr,
Puneet K. Dokania
Abstract:
Pre-trained foundation models, due to their enormous capacity and exposure to vast amounts of data during pre-training, are known to have learned plenty of real-world concepts. An important step in making these pre-trained models effective on downstream tasks is to fine-tune them on related datasets. While various fine-tuning methods have been devised and have been shown to be highly effective, we…
▽ More
Pre-trained foundation models, due to their enormous capacity and exposure to vast amounts of data during pre-training, are known to have learned plenty of real-world concepts. An important step in making these pre-trained models effective on downstream tasks is to fine-tune them on related datasets. While various fine-tuning methods have been devised and have been shown to be highly effective, we observe that a fine-tuned model's ability to recognize concepts on tasks $\textit{different}$ from the downstream one is reduced significantly compared to its pre-trained counterpart. This is an undesirable effect of fine-tuning as a substantial amount of resources was used to learn these pre-trained concepts in the first place. We call this phenomenon ''concept forgetting'' and via experiments show that most end-to-end fine-tuning approaches suffer heavily from this side effect. To this end, we propose a simple fix to this problem by designing a new fine-tuning method called $\textit{LDIFS}$ (short for $\ell_2$ distance in feature space) that, while learning new concepts related to the downstream task, allows a model to preserve its pre-trained knowledge as well. Through extensive experiments on 10 fine-tuning tasks we show that $\textit{LDIFS}$ significantly reduces concept forgetting. Additionally, we show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks as well, in comparison with both fine-tuning as well as continual learning baselines.
△ Less
Submitted 1 July, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
In-Context Learning Learns Label Relationships but Is Not Conventional Learning
Authors:
Jannik Kossen,
Yarin Gal,
Tom Rainforth
Abstract:
The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input--label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not e…
▽ More
The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input--label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data and, further, that ICL does not consider all in-context information equally.
△ Less
Submitted 13 March, 2024; v1 submitted 23 July, 2023;
originally announced July 2023.
-
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
Authors:
David Glukhov,
Ilia Shumailov,
Yarin Gal,
Nicolas Papernot,
Vardan Papyan
Abstract:
Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed…
▽ More
Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. As a result, we propose that the problem of censorship needs to be reevaluated; it should be treated as a security problem which warrants the adaptation of security-based approaches to mitigate potential risks.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
BatchGFN: Generative Flow Networks for Batch Active Learning
Authors:
Shreshth A. Malik,
Salem Lahlou,
Andrew Jesson,
Moksh Jain,
Nikolay Malkin,
Tristan Deleu,
Yoshua Bengio,
Yarin Gal
Abstract:
We introduce BatchGFN -- a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward. With an appropriate reward function to quantify the utility of acquiring a batch, such as the joint mutual information between the batch and the model parameters, BatchGFN is able to construct highly informative batches for active…
▽ More
We introduce BatchGFN -- a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward. With an appropriate reward function to quantify the utility of acquiring a batch, such as the joint mutual information between the batch and the model parameters, BatchGFN is able to construct highly informative batches for active learning in a principled way. We show our approach enables sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems. This alleviates the computational complexity of batch-aware algorithms and removes the need for greedy approximations to find maximizers for the batch reward. We also present early results for amortizing training across acquisition steps, which will enable scaling to real-world tasks.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages
Authors:
Andrew Jesson,
Chris Lu,
Gunshi Gupta,
Angelos Filos,
Jakob Nicolaus Foerster,
Yarin Gal
Abstract:
This paper introduces an effective and practical step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. This step manifests as three simple modifications to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating dropout as a Bay…
▽ More
This paper introduces an effective and practical step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. This step manifests as three simple modifications to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating dropout as a Bayesian approximation. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables prudent state-aware exploration around the modes of the actor via Thompson sampling. Extensive empirical evaluations on diverse benchmarks reveal the superior performance of our approach compared to existing on- and off-policy algorithms. We demonstrate significant improvements for median and interquartile mean metrics over PPO, SAC, and TD3 on the MuJoCo continuous control benchmark. Moreover, we see improvement over PPO in the challenging ProcGen generalization benchmark.
△ Less
Submitted 24 November, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
The Curse of Recursion: Training on Generated Data Makes Models Forget
Authors:
Ilia Shumailov,
Zakhar Shumaylov,
Yiren Zhao,
Yarin Gal,
Nicolas Papernot,
Ross Anderson
Abstract:
Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper…
▽ More
Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
△ Less
Submitted 14 April, 2024; v1 submitted 27 May, 2023;
originally announced May 2023.
-
Prediction-Oriented Bayesian Active Learning
Authors:
Freddie Bickford Smith,
Andreas Kirsch,
Sebastian Farquhar,
Yarin Gal,
Adam Foster,
Tom Rainforth
Abstract:
Information-theoretic approaches to active learning have traditionally focused on maximising the information gathered about the model parameters, most commonly by optimising the BALD score. We highlight that this can be suboptimal from the perspective of predictive performance. For example, BALD lacks a notion of an input distribution and so is prone to prioritise data of limited relevance. To add…
▽ More
Information-theoretic approaches to active learning have traditionally focused on maximising the information gathered about the model parameters, most commonly by optimising the BALD score. We highlight that this can be suboptimal from the perspective of predictive performance. For example, BALD lacks a notion of an input distribution and so is prone to prioritise data of limited relevance. To address this we propose the expected predictive information gain (EPIG), an acquisition function that measures information gain in the space of predictions rather than parameters. We find that using EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models, and thus provides an appealing drop-in replacement.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Revisiting Automated Prompting: Are We Actually Doing Better?
Authors:
Yulin Zhou,
Yiren Zhao,
Ilia Shumailov,
Robert Mullins,
Yarin Gal
Abstract:
Current literature demonstrates that Large Language Models (LLMs) are great few-shot learners, and prompting significantly increases their performance on a range of downstream tasks in a few-shot learning setting. An attempt to automate human-led prompting followed, with some progress achieved. In particular, subsequent work demonstrates automation can outperform fine-tuning in certain K-shot lear…
▽ More
Current literature demonstrates that Large Language Models (LLMs) are great few-shot learners, and prompting significantly increases their performance on a range of downstream tasks in a few-shot learning setting. An attempt to automate human-led prompting followed, with some progress achieved. In particular, subsequent work demonstrates automation can outperform fine-tuning in certain K-shot learning scenarios.
In this paper, we revisit techniques for automated prompting on six different downstream tasks and a larger range of K-shot learning settings. We find that automated prompting does not consistently outperform simple manual prompts. Our work suggests that, in addition to fine-tuning, manual prompts should be used as a baseline in this line of research.
△ Less
Submitted 22 June, 2023; v1 submitted 7 April, 2023;
originally announced April 2023.
-
Differentiable Multi-Target Causal Bayesian Experimental Design
Authors:
Yashas Annadani,
Panagiotis Tigas,
Desi R. Ivanova,
Andrew Jesson,
Yarin Gal,
Adam Foster,
Stefan Bauer
Abstract:
We introduce a gradient-based approach for the problem of Bayesian optimal experimental design to learn causal models in a batch setting -- a critical component for causal discovery from finite data where interventions can be costly or risky. Existing methods rely on greedy approximations to construct a batch of experiments while using black-box methods to optimize over a single target-state pair…
▽ More
We introduce a gradient-based approach for the problem of Bayesian optimal experimental design to learn causal models in a batch setting -- a critical component for causal discovery from finite data where interventions can be costly or risky. Existing methods rely on greedy approximations to construct a batch of experiments while using black-box methods to optimize over a single target-state pair to intervene with. In this work, we completely dispose of the black-box optimization techniques and greedy heuristics and instead propose a conceptually simple end-to-end gradient-based optimization procedure to acquire a set of optimal intervention target-state pairs. Such a procedure enables parameterization of the design space to efficiently optimize over a batch of multi-target-state interventions, a setting which has hitherto not been explored due to its complexity. We demonstrate that our proposed method outperforms baselines and existing acquisition strategies in both single-target and multi-target settings across a number of synthetic datasets.
△ Less
Submitted 2 June, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Authors:
Lorenz Kuhn,
Yarin Gal,
Sebastian Farquhar
Abstract:
We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semanti…
▽ More
We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
△ Less
Submitted 15 April, 2023; v1 submitted 19 February, 2023;
originally announced February 2023.
-
Using uncertainty-aware machine learning models to study aerosol-cloud interactions
Authors:
Maëlys Solal,
Andrew Jesson,
Yarin Gal,
Alyson Douglas
Abstract:
Aerosol-cloud interactions (ACI) include various effects that result from aerosols entering a cloud, and affecting cloud properties. In general, an increase in aerosol concentration results in smaller droplet sizes which leads to larger, brighter, longer-lasting clouds that reflect more sunlight and cool the Earth. The strength of the effect is however heterogeneous, meaning it depends on the surr…
▽ More
Aerosol-cloud interactions (ACI) include various effects that result from aerosols entering a cloud, and affecting cloud properties. In general, an increase in aerosol concentration results in smaller droplet sizes which leads to larger, brighter, longer-lasting clouds that reflect more sunlight and cool the Earth. The strength of the effect is however heterogeneous, meaning it depends on the surrounding environment, making ACI one of the most uncertain effects in our current climate models. In our work, we use causal machine learning to estimate ACI from satellite observations by reframing the problem as a treatment (aerosol) and outcome (change in droplet radius). We predict the causal effect of aerosol on clouds with uncertainty bounds depending on the unknown factors that may be influencing the impact of aerosol. Of the three climate models evaluated, we find that only one plausibly recreates the trend, lending more credence to its estimate cooling due to ACI.
△ Less
Submitted 30 November, 2022;
originally announced January 2023.
-
On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations
Authors:
Tim G. J. Rudner,
Cong Lu,
Michael A. Osborne,
Yarin Gal,
Yee Whye Teh
Abstract:
KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological traini…
▽ More
KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning. We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. Finally, we show that the pathology can be remedied by non-parametric behavioral reference policies and that this allows KL-regularized reinforcement learning to significantly outperform state-of-the-art approaches on a variety of challenging locomotion and dexterous hand manipulation tasks.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
CLAM: Selective Clarification for Ambiguous Questions with Generative Language Models
Authors:
Lorenz Kuhn,
Yarin Gal,
Sebastian Farquhar
Abstract:
Users often ask dialogue systems ambiguous questions that require clarification. We show that current language models rarely ask users to clarify ambiguous questions and instead provide incorrect answers. To address this, we introduce CLAM: a framework for getting language models to selectively ask for clarification about ambiguous user questions. In particular, we show that we can prompt language…
▽ More
Users often ask dialogue systems ambiguous questions that require clarification. We show that current language models rarely ask users to clarify ambiguous questions and instead provide incorrect answers. To address this, we introduce CLAM: a framework for getting language models to selectively ask for clarification about ambiguous user questions. In particular, we show that we can prompt language models to detect whether a given question is ambiguous, generate an appropriate clarifying question to ask the user, and give a final answer after receiving clarification. We also show that we can simulate users by providing language models with privileged information. This lets us automatically evaluate multi-turn clarification dialogues. Finally, CLAM significantly improves language models' accuracy on mixed ambiguous and unambiguous questions relative to SotA.
△ Less
Submitted 20 February, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks
Authors:
Neil Band,
Tim G. J. Rudner,
Qixuan Feng,
Angelos Filos,
Zachary Nado,
Michael W. Dusenberry,
Ghassen Jerfel,
Dustin Tran,
Yarin Gal
Abstract:
Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of…
▽ More
Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose the RETINA Benchmark, a set of real-world tasks that accurately reflect such complexities and are designed to assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Discovering Long-period Exoplanets using Deep Learning with Citizen Science Labels
Authors:
Shreshth A. Malik,
Nora L. Eisner,
Chris J. Lintott,
Yarin Gal
Abstract:
Automated planetary transit detection has become vital to prioritize candidates for expert analysis given the scale of modern telescopic surveys. While current methods for short-period exoplanet detection work effectively due to periodicity in the light curves, there lacks a robust approach for detecting single-transit events. However, volunteer-labelled transits recently collected by the Planet H…
▽ More
Automated planetary transit detection has become vital to prioritize candidates for expert analysis given the scale of modern telescopic surveys. While current methods for short-period exoplanet detection work effectively due to periodicity in the light curves, there lacks a robust approach for detecting single-transit events. However, volunteer-labelled transits recently collected by the Planet Hunters TESS (PHT) project now provide an unprecedented opportunity to investigate a data-driven approach to long-period exoplanet detection. In this work, we train a 1-D convolutional neural network to classify planetary transits using PHT volunteer scores as training data. We find using volunteer scores significantly improves performance over synthetic data, and enables the recovery of known planets at a precision and rate matching that of the volunteers. Importantly, the model also recovers transits found by volunteers but missed by current automated methods.
△ Less
Submitted 13 November, 2022;
originally announced November 2022.
-
Volume-to-Area Law Entanglement Transition in a non-Hermitian Free Fermionic Chain
Authors:
Youenn Le Gal,
Xhek Turkeshi,
Marco Schirò
Abstract:
We consider the dynamics of the non-Hermitian Su-Schrieffer-Heeger model arising as the no-click limit of a continuously monitored free fermion chain where particles and holes are measured on two sublattices. The model has $\mathcal{PT}$-symmetry, which we show to spontaneously break as a function of the strength of measurement backaction, resulting in a spectral transition where quasiparticles ac…
▽ More
We consider the dynamics of the non-Hermitian Su-Schrieffer-Heeger model arising as the no-click limit of a continuously monitored free fermion chain where particles and holes are measured on two sublattices. The model has $\mathcal{PT}$-symmetry, which we show to spontaneously break as a function of the strength of measurement backaction, resulting in a spectral transition where quasiparticles acquire a finite lifetime in patches of the Brillouin zone. We compute the entanglement entropy's dynamics in the thermodynamic limit and demonstrate an entanglement transition between volume-law and area-law scaling, which we characterize analytically. Interestingly we show that the entanglement transition and the $\mathcal{PT}$-symmetry breaking do not coincide, the former occurring when the entire decay spectrum of the quasiparticle becomes gapped.
△ Less
Submitted 22 February, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Exploring Low Rank Training of Deep Neural Networks
Authors:
Siddhartha Rao Kamalakara,
Acyr Locatelli,
Bharat Venkitesh,
Jimmy Ba,
Yarin Gal,
Aidan N. Gomez
Abstract:
Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pre-trained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen…
▽ More
Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pre-trained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen practice. We analyse techniques that work well in practice, and through extensive ablations on models such as GPT2 we provide evidence falsifying common beliefs in the field, hinting in the process at exciting research opportunities that still need answering.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Skin Lesion Recognition with Class-Hierarchy Regularized Hyperbolic Embeddings
Authors:
Zhen Yu,
Toan Nguyen,
Yaniv Gal,
Lie Ju,
Shekhar S. Chandra,
Lei Zhang,
Paul Bonnington,
Victoria Mar,
Zhiyong Wang,
Zongyuan Ge
Abstract:
In practice, many medical datasets have an underlying taxonomy defined over the disease label space. However, existing classification algorithms for medical diagnoses often assume semantically independent labels. In this study, we aim to leverage class hierarchy with deep learning algorithms for more accurate and reliable skin lesion recognition. We propose a hyperbolic network to learn image embe…
▽ More
In practice, many medical datasets have an underlying taxonomy defined over the disease label space. However, existing classification algorithms for medical diagnoses often assume semantically independent labels. In this study, we aim to leverage class hierarchy with deep learning algorithms for more accurate and reliable skin lesion recognition. We propose a hyperbolic network to learn image embeddings and class prototypes jointly. The hyperbola provably provides a space for modeling hierarchical relations better than Euclidean geometry. Meanwhile, we restrict the distribution of hyperbolic prototypes with a distance matrix that is encoded from the class hierarchy. Accordingly, the learned prototypes preserve the semantic class relations in the embedding space and we can predict the label of an image by assigning its feature to the nearest hyperbolic class prototype. We use an in-house skin lesion dataset which consists of around 230k dermoscopic images on 65 skin diseases to verify our method. Extensive experiments provide evidence that our model can achieve higher accuracy with less severe classification errors than models without considering class relations.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Exploring the Limits of Synthetic Creation of Solar EUV Images via Image-to-Image Translation
Authors:
Valentina Salvatelli,
Luiz F. G. dos Santos,
Souvik Bose,
Brad Neuberg,
Mark C. M. Cheung,
Miho Janvier,
Meng **,
Yarin Gal,
Atilim Gunes Baydin
Abstract:
The Solar Dynamics Observatory (SDO), a NASA multi-spectral decade-long mission that has been daily producing terabytes of observational data from the Sun, has been recently used as a use-case to demonstrate the potential of machine learning methodologies and to pave the way for future deep-space mission planning. In particular, the idea of using image-to-image translation to virtually produce ext…
▽ More
The Solar Dynamics Observatory (SDO), a NASA multi-spectral decade-long mission that has been daily producing terabytes of observational data from the Sun, has been recently used as a use-case to demonstrate the potential of machine learning methodologies and to pave the way for future deep-space mission planning. In particular, the idea of using image-to-image translation to virtually produce extreme ultra-violet channels has been proposed in several recent studies, as a way to both enhance missions with less available channels and to alleviate the challenges due to the low downlink rate in deep space. This paper investigates the potential and the limitations of such a deep learning approach by focusing on the permutation of four channels and an encoder--decoder based architecture, with particular attention to how morphological traits and brightness of the solar surface affect the neural network predictions. In this work we want to answer the question: can synthetic images of the solar corona produced via image-to-image translation be used for scientific studies of the Sun? The analysis highlights that the neural network produces high-quality images over three orders of magnitude in count rate (pixel intensity) and can generally reproduce the covariance across channels within a 1% error. However the model performance drastically diminishes in correspondence of extremely high energetic events like flares, and we argue that the reason is related to the rareness of such events posing a challenge to model training.
△ Less
Submitted 19 August, 2022;
originally announced August 2022.
-
Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities
Authors:
Andreas Kirsch,
Yarin Gal
Abstract:
Recently proposed methods in data subset selection, that is active learning and active sampling, use Fisher information, Hessians, similarity matrices based on gradients, and gradient lengths to estimate how informative data is for a model's training. Are these different approaches connected, and if so, how? We revisit the fundamentals of Bayesian optimal experiment design and show that these rece…
▽ More
Recently proposed methods in data subset selection, that is active learning and active sampling, use Fisher information, Hessians, similarity matrices based on gradients, and gradient lengths to estimate how informative data is for a model's training. Are these different approaches connected, and if so, how? We revisit the fundamentals of Bayesian optimal experiment design and show that these recently proposed methods can be understood as approximations to information-theoretic quantities: among them, the mutual information between predictions and model parameters, known as expected information gain or BALD in machine learning, and the mutual information between predictions of acquisition candidates and test samples, known as expected predictive information gain. We develop a comprehensive set of approximations using Fisher information and observed information and derive a unified framework that connects seemingly disparate literature. Although Bayesian methods are often seen as separate from non-Bayesian ones, the sometimes fuzzy notion of "informativeness" expressed in various non-Bayesian objectives leads to the same couple of information quantities, which were, in principle, already known by Lindley (1956) and MacKay (1992).
△ Less
Submitted 6 November, 2022; v1 submitted 31 July, 2022;
originally announced August 2022.
-
Plex: Towards Reliability using Pretrained Large Model Extensions
Authors:
Dustin Tran,
Jeremiah Liu,
Michael W. Dusenberry,
Du Phan,
Mark Collier,
Jie Ren,
Kehang Han,
Zi Wang,
Zelda Mariet,
Huiyi Hu,
Neil Band,
Tim G. J. Rudner,
Karan Singhal,
Zachary Nado,
Joost van Amersfoort,
Andreas Kirsch,
Rodolphe Jenatton,
Nithum Thain,
Honglin Yuan,
Kelly Buchanan,
Kevin Murphy,
D. Sculley,
Yarin Gal,
Zoubin Ghahramani,
Jasper Snoek
, et al. (1 additional authors not shown)
Abstract:
A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive per…
▽ More
A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it improves the out-of-the-box performance and does not require designing scores or tuning the model for each task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
Out-of-Distribution Detection for Long-tailed and Fine-grained Skin Lesion Images
Authors:
Deval Mehta,
Yaniv Gal,
Adrian Bowling,
Paul Bonnington,
Zongyuan Ge
Abstract:
Recent years have witnessed a rapid development of automated methods for skin lesion diagnosis and classification. Due to an increasing deployment of such systems in clinics, it has become important to develop a more robust system towards various Out-of-Distribution(OOD) samples (unknown skin lesions and conditions). However, the current deep learning models trained for skin lesion classification…
▽ More
Recent years have witnessed a rapid development of automated methods for skin lesion diagnosis and classification. Due to an increasing deployment of such systems in clinics, it has become important to develop a more robust system towards various Out-of-Distribution(OOD) samples (unknown skin lesions and conditions). However, the current deep learning models trained for skin lesion classification tend to classify these OOD samples incorrectly into one of their learned skin lesion categories. To address this issue, we propose a simple yet strategic approach that improves the OOD detection performance while maintaining the multi-class classification accuracy for the known categories of skin lesion. To specify, this approach is built upon a realistic scenario of a long-tailed and fine-grained OOD detection task for skin lesion images. Through this approach, 1) First, we target the mixup amongst middle and tail classes to address the long-tail problem. 2) Later, we combine the above mixup strategy with prototype learning to address the fine-grained nature of the dataset. The unique contribution of this paper is two-fold, justified by extensive experiments. First, we present a realistic problem setting of OOD task for skin lesion. Second, we propose an approach to target the long-tailed and fine-grained aspects of the problem setting simultaneously to increase the OOD performance.
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Authors:
Sören Mindermann,
Jan Brauner,
Muhammed Razzak,
Mrinank Sharma,
Andreas Kirsch,
Winnie Xu,
Benedikt Höltgen,
Aidan N. Gomez,
Adrien Morisot,
Sebastian Farquhar,
Yarin Gal
Abstract:
Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mi…
▽ More
Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling.
△ Less
Submitted 26 September, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Learning Dynamics and Generalization in Reinforcement Learning
Authors:
Clare Lyle,
Mark Rowland,
Will Dabney,
Marta Kwiatkowska,
Yarin Gal
Abstract:
Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to…
▽ More
Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how post-training policy distillation may avoid this pitfall, and show that this approach improves generalization to novel environments in the ProcGen suite and improves robustness to input perturbations.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval
Authors:
Pascal Notin,
Mafalda Dias,
Jonathan Frazer,
Javier Marchena-Hurtado,
Aidan Gomez,
Debora S. Marks,
Yarin Gal
Abstract:
The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful ap…
▽ More
The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Global geomagnetic perturbation forecasting using Deep Learning
Authors:
Vishal Upendran,
Panagiotis Tigas,
Banafsheh Ferdousi,
Teo Bloch,
Mark C. M. Cheung,
Siddha Ganju,
Asti Bhatt,
Ryan M. McGranaghan,
Yarin Gal
Abstract:
Geomagnetically Induced Currents (GICs) arise from spatio-temporal changes to Earth's magnetic field which arise from the interaction of the solar wind with Earth's magnetosphere, and drive catastrophic destruction to our technologically dependent society. Hence, computational models to forecast GICs globally with large forecast horizon, high spatial resolution and temporal cadence are of increasi…
▽ More
Geomagnetically Induced Currents (GICs) arise from spatio-temporal changes to Earth's magnetic field which arise from the interaction of the solar wind with Earth's magnetosphere, and drive catastrophic destruction to our technologically dependent society. Hence, computational models to forecast GICs globally with large forecast horizon, high spatial resolution and temporal cadence are of increasing importance to perform prompt necessary mitigation. Since GIC data is proprietary, the time variability of horizontal component of the magnetic field perturbation (dB/dt) is used as a proxy for GICs. In this work, we develop a fast, global dB/dt forecasting model, which forecasts 30 minutes into the future using only solar wind measurements as input. The model summarizes 2 hours of solar wind measurement using a Gated Recurrent Unit, and generates forecasts of coefficients which are folded with a spherical harmonic basis to enable global forecasts. When deployed, our model produces results in under a second, and generates global forecasts for horizontal magnetic perturbation components at 1-minute cadence. We evaluate our model across models in literature for two specific storms of 5 August 2011 and 17 March 2015, while having a self-consistent benchmark model set. Our model outperforms, or has consistent performance with state-of-the-practice high time cadence local and low time cadence global models, while also outperforming/having comparable performance with the benchmark models. Such quick inferences at high temporal cadence and arbitrary spatial resolutions may ultimately enable accurate forewarning of dB/dt for any place on Earth, resulting in precautionary measures to be taken in an informed manner.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling
Authors:
Andreas Kirsch,
Jannik Kossen,
Yarin Gal
Abstract:
Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing o…
▽ More
Principled Bayesian deep learning (BDL) does not live up to its potential when we only focus on marginal predictive distributions (marginal predictives). Recent works have highlighted the importance of joint predictives for (Bayesian) sequential decision making from a theoretical and synthetic perspective. We provide additional practical arguments grounded in real-world applications for focusing on joint predictives: we discuss online Bayesian inference, which would allow us to make predictions while taking into account additional data without retraining, and we propose new challenging evaluation settings using active learning and active sampling. These settings are motivated by an examination of marginal and joint predictives, their respective cross-entropies, and their place in offline and online learning. They are more realistic than previously suggested ones, building on work by Wen et al. (2021) and Osband et al. (2022), and focus on evaluating the performance of approximate BNNs in an online supervised setting. Initial experiments, however, raise questions on the feasibility of these ideas in high-dimensional parameter spaces with current BDL inference techniques, and we suggest experiments that might help shed further light on the practicality of current research for these problems. Importantly, our work highlights previously unidentified gaps in current research and the need for better approximate joint predictives.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Scalable Sensitivity and Uncertainty Analysis for Causal-Effect Estimates of Continuous-Valued Interventions
Authors:
Andrew Jesson,
Alyson Douglas,
Peter Manshausen,
Maëlys Solal,
Nicolai Meinshausen,
Philip Stier,
Yarin Gal,
Uri Shalit
Abstract:
Estimating the effects of continuous-valued interventions from observational data is a critically important task for climate science, healthcare, and economics. Recent work focuses on designing neural network architectures and regularization functions to allow for scalable estimation of average and individual-level dose-response curves from high-dimensional, large-sample data. Such methodologies a…
▽ More
Estimating the effects of continuous-valued interventions from observational data is a critically important task for climate science, healthcare, and economics. Recent work focuses on designing neural network architectures and regularization functions to allow for scalable estimation of average and individual-level dose-response curves from high-dimensional, large-sample data. Such methodologies assume ignorability (observation of all confounding variables) and positivity (observation of all treatment levels for every covariate value describing a set of units), assumptions problematic in the continuous treatment regime. Scalable sensitivity and uncertainty analyses to understand the ignorance induced in causal estimates when these assumptions are relaxed are less studied. Here, we develop a continuous treatment-effect marginal sensitivity model (CMSM) and derive bounds that agree with the observed data and a researcher-defined level of hidden confounding. We introduce a scalable algorithm and uncertainty-aware deep models to derive and estimate these bounds for high-dimensional, large-sample observational data. We work in concert with climate scientists interested in the climatological impacts of human emissions on cloud properties using satellite observations from the past 15 years. This problem is known to be complicated by many unobserved confounders.
△ Less
Submitted 12 October, 2022; v1 submitted 21 April, 2022;
originally announced April 2022.
-
Interventions, Where and How? Experimental Design for Causal Models at Scale
Authors:
Panagiotis Tigas,
Yashas Annadani,
Andrew Jesson,
Bernhard Schölkopf,
Yarin Gal,
Stefan Bauer
Abstract:
Causal discovery from observational and interventional data is challenging due to limited data and non-identifiability: factors that introduce uncertainty in estimating the underlying structural causal model (SCM). Selecting experiments (interventions) based on the uncertainty arising from both factors can expedite the identification of the SCM. Existing methods in experimental design for causal d…
▽ More
Causal discovery from observational and interventional data is challenging due to limited data and non-identifiability: factors that introduce uncertainty in estimating the underlying structural causal model (SCM). Selecting experiments (interventions) based on the uncertainty arising from both factors can expedite the identification of the SCM. Existing methods in experimental design for causal discovery from limited data either rely on linear assumptions for the SCM or select only the intervention target. This work incorporates recent advances in Bayesian causal discovery into the Bayesian optimal experimental design framework, allowing for active causal discovery of large, nonlinear SCMs while selecting both the interventional target and the value. We demonstrate the performance of the proposed method on synthetic graphs (Erdos-Rènyi, Scale Free) for both linear and nonlinear SCMs as well as on the \emph{in-silico} single-cell gene regulatory network dataset, DREAM.
△ Less
Submitted 21 October, 2022; v1 submitted 3 March, 2022;
originally announced March 2022.
-
Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients
Authors:
Milad Alizadeh,
Shyam A. Tailor,
Luisa M Zintgraf,
Joost van Amersfoort,
Sebastian Farquhar,
Nicholas Donald Lane,
Yarin Gal
Abstract:
Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network while consuming fewer computational resources for training and inference. However, current methods are insufficient to enable this optimization and lead to a large degradation in model performance. In this paper, we identify a fundamental limitation in the formulation of…
▽ More
Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network while consuming fewer computational resources for training and inference. However, current methods are insufficient to enable this optimization and lead to a large degradation in model performance. In this paper, we identify a fundamental limitation in the formulation of current methods, namely that their saliency criteria look at a single step at the start of training without taking into account the trainability of the network. While pruning iteratively and gradually has been shown to improve pruning performance, explicit consideration of the training stage that will immediately follow pruning has so far been absent from the computation of the saliency criterion. To overcome the short-sightedness of existing methods, we propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune. ProsPr combines an estimate of the higher-order effects of pruning on the loss and the optimization trajectory to identify the trainable sub-network. Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods.
△ Less
Submitted 5 April, 2022; v1 submitted 16 February, 2022;
originally announced February 2022.
-
Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation
Authors:
Jannik Kossen,
Sebastian Farquhar,
Yarin Gal,
Tom Rainforth
Abstract:
We propose Active Surrogate Estimators (ASEs), a new method for label-efficient model evaluation. Evaluating model performance is a challenging and important problem when labels are expensive. ASEs address this active testing problem using a surrogate-based estimation approach that interpolates the errors of points with unknown labels, rather than forming a Monte Carlo estimator. ASEs actively lea…
▽ More
We propose Active Surrogate Estimators (ASEs), a new method for label-efficient model evaluation. Evaluating model performance is a challenging and important problem when labels are expensive. ASEs address this active testing problem using a surrogate-based estimation approach that interpolates the errors of points with unknown labels, rather than forming a Monte Carlo estimator. ASEs actively learn the underlying surrogate, and we propose a novel acquisition strategy, XWED, that tailors this learning to the final estimation task. We find that ASEs offer greater label-efficiency than the current state-of-the-art when applied to challenging model evaluation problems for deep neural networks.
△ Less
Submitted 18 October, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
EXCESS workshop: Descriptions of rising low-energy spectra
Authors:
P. Adari,
A. Aguilar-Arevalo,
D. Amidei,
G. Angloher,
E. Armengaud,
C. Augier,
L. Balogh,
S. Banik,
D. Baxter,
C. Beaufort,
G. Beaulieu,
V. Belov,
Y. Ben Gal,
G. Benato,
A. Benoît,
A. Bento,
L. Bergé,
A. Bertolini,
R. Bhattacharyya,
J. Billard,
I. M. Bloch,
A. Botti,
R. Breier,
G. Bres,
J-. L. Bret
, et al. (281 additional authors not shown)
Abstract:
Many low-threshold experiments observe sharply rising event rates of yet unknown origins below a few hundred eV, and larger than expected from known backgrounds. Due to the significant impact of this excess on the dark matter or neutrino sensitivity of these experiments, a collective effort has been started to share the knowledge about the individual observations. For this, the EXCESS Workshop was…
▽ More
Many low-threshold experiments observe sharply rising event rates of yet unknown origins below a few hundred eV, and larger than expected from known backgrounds. Due to the significant impact of this excess on the dark matter or neutrino sensitivity of these experiments, a collective effort has been started to share the knowledge about the individual observations. For this, the EXCESS Workshop was initiated. In its first iteration in June 2021, ten rare event search collaborations contributed to this initiative via talks and discussions. The contributing collaborations were CONNIE, CRESST, DAMIC, EDELWEISS, MINER, NEWS-G, NUCLEUS, RICOCHET, SENSEI and SuperCDMS. They presented data about their observed energy spectra and known backgrounds together with details about the respective measurements. In this paper, we summarize the presented information and give a comprehensive overview of the similarities and differences between the distinct measurements. The provided data is furthermore publicly available on the workshop's data repository together with a plotting tool for visualization.
△ Less
Submitted 4 March, 2022; v1 submitted 10 February, 2022;
originally announced February 2022.
-
A Note on "Assessing Generalization of SGD via Disagreement"
Authors:
Andreas Kirsch,
Yarin Gal
Abstract:
Several recent works find empirically that the average test error of deep neural networks can be estimated via the prediction disagreement of models, which does not require labels. In particular, Jiang et al. (2022) show for the disagreement between two separately trained networks that this `Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the n…
▽ More
Several recent works find empirically that the average test error of deep neural networks can be estimated via the prediction disagreement of models, which does not require labels. In particular, Jiang et al. (2022) show for the disagreement between two separately trained networks that this `Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the notion of a proposed `class-aggregated calibration.' In this reproduction, we show that the suggested theory might be impractical because a deep ensemble's calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022).
△ Less
Submitted 6 November, 2022; v1 submitted 3 February, 2022;
originally announced February 2022.