Search | arXiv e-print repository

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Authors: Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Lyudmila Rvanova, Sergey Petrakov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, Artem Shelmanov

Abstract: Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for bui… ▽ More Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for building safe applications. As with other ML models, LLMs are prone to make incorrect predictions, ``hallucinate'' by fabricating claims, or simply generate low-quality output for a given input. UQ is a key element in dealing with these challenges. However research to date on UQ methods for LLMs has been fragmented, with disparate evaluation methods. In this work, we tackle this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques by researchers in various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks and shed light on the most promising approaches. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev contributed equally

arXiv:2403.04696 [pdf, other]

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov

Abstract: Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide an… ▽ More Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge. △ Less

Submitted 6 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted to ACL-2024 (Findings). Ekaterina Fadeeva, Aleksandr Rubashevskii, and Artem Shelmanov contributed equally

arXiv:2311.07383 [pdf, other]

LM-Polygraph: Uncertainty Estimation for Language Models

Authors: Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov

Abstract: Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, m… ▽ More Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP-2023

arXiv:2103.06974 [pdf, other]

doi 10.1016/j.optlastec.2021.107207

Picosecond Laser Ablation of Millimeter-Wave Subwavelength Structures on Alumina and Sapphire

Authors: Qi Wen, Elena Fadeeva, Shaul Hanany, Jürgen Koch, Tomotake Matsumura, Ryota Takaku, Karl Young

Abstract: We use a 1030 nm laser with 7 ps pulse duration and average power up to 100 W to ablate pyramid-shape subwavelength structures (SWS) on alumina and sapphire. The SWS give an effective and cryogenically robust anti-reflection coating in the millimeter-wave band. We demonstrate average ablation rate of up to 34 mm$^3$/min and 20 mm$^3$/min for structure heights of 900 $μ$m and 750 $μ$m on alumina an… ▽ More We use a 1030 nm laser with 7 ps pulse duration and average power up to 100 W to ablate pyramid-shape subwavelength structures (SWS) on alumina and sapphire. The SWS give an effective and cryogenically robust anti-reflection coating in the millimeter-wave band. We demonstrate average ablation rate of up to 34 mm$^3$/min and 20 mm$^3$/min for structure heights of 900 $μ$m and 750 $μ$m on alumina and sapphire, respectively. These rates are a factor of 34 and 9 higher than reported previously on similar structures. We propose a model that relates structure height to cumulative laser fluence. The model depends on the absorption length $δ$, which is assumed to depend on peak fluence, and on the threshold fluence $φ_{th}$. Using a best-fit procedure we find an average $δ= 630$ nm and 650 nm, and $φ_{th} = 2.0^{+0.5}_{-0.5}$ J/cm$^2$ and $2.3^{+0.1}_{-0.1}$ J/cm$^2$ for alumina and sapphire, respectively, for peak fluence values between 30 and 70 J/cm$^{2}$. With the best fit values, the model and data values for cumulative fluence agree to within 10%. Given inputs for $δ$ and $φ_{th}$ the model is used to predict average ablation rates as a function of SWS height and average laser power. △ Less

Submitted 11 March, 2021; originally announced March 2021.

Comments: 15 pages, 11 figures, submitted to Optics & Laser Technology

Journal ref: Optics & Laser Technology, Volume 142, October 2021, 107207

Showing 1–4 of 4 results for author: Fadeeva, E