Search | arXiv e-print repository

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Authors: Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Lyudmila Rvanova, Sergey Petrakov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, Artem Shelmanov

Abstract: Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for bui… ▽ More Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for building safe applications. As with other ML models, LLMs are prone to make incorrect predictions, ``hallucinate'' by fabricating claims, or simply generate low-quality output for a given input. UQ is a key element in dealing with these challenges. However research to date on UQ methods for LLMs has been fragmented, with disparate evaluation methods. In this work, we tackle this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques by researchers in various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks and shed light on the most promising approaches. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev contributed equally

arXiv:2311.07383 [pdf, other]

LM-Polygraph: Uncertainty Estimation for Language Models

Authors: Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov

Abstract: Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, m… ▽ More Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP-2023

arXiv:1606.09406 [pdf]

doi 10.1088/1742-6596/737/1/012032

Superconducting detector of IR single-photons based on thin WSi films

Authors: V. A. Seleznev, A. Divochiy, Yu. B. Vakhtomin, P. V. Morozov, P. I. Zolotov, D. D. Vasilev, K. M. Moiseev, E. I. Malevannaya, K. V. Smirnov

Abstract: We have developed the deposition technology of WSi thin films 4 to 9 nm thick with high temperature values of superconducting transition (Tc~4 K). Based on deposed films there were produced nanostructures with indicative planar sizes ~100 nm, and the research revealed that even on nanoscale the films possess of high critical temperature values of the superconducting transition (Tc~3.3-3.7K ) which… ▽ More We have developed the deposition technology of WSi thin films 4 to 9 nm thick with high temperature values of superconducting transition (Tc~4 K). Based on deposed films there were produced nanostructures with indicative planar sizes ~100 nm, and the research revealed that even on nanoscale the films possess of high critical temperature values of the superconducting transition (Tc~3.3-3.7K ) which certifies high quality and homogeneity of the films created. The first experiments on creating superconducting single-photon detectors showed that the detectors SDE (system detection efficiency) with increasing bias current (Ib) reaches a constant value of ~30% (for 1550 nm) defined by infrared radiation absorption by the superconducting structure. To enhance radiation absorption by the superconductor there were created detectors with cavity structures which demonstrated a practically constant value of quantum efficiency >65% for bias currents Ib>=0.6Ic. The minimal dark counts level (DC) made 1 s^-1 limited with background noise. Hence WSi is the most promising material for creating single-photon detectors with record SDE/DC ratio and noise equivalent power (NEP). △ Less

Submitted 30 June, 2016; originally announced June 2016.

Showing 1–3 of 3 results for author: Vasilev, D