-
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph
Authors:
Roman Vashurin,
Ekaterina Fadeeva,
Artem Vazhentsev,
Akim Tsvigun,
Daniil Vasilev,
Rui Xing,
Abdelrahman Boda Sadallah,
Lyudmila Rvanova,
Sergey Petrakov,
Alexander Panchenko,
Timothy Baldwin,
Preslav Nakov,
Maxim Panov,
Artem Shelmanov
Abstract:
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for bui…
▽ More
Uncertainty quantification (UQ) is becoming increasingly recognized as a critical component of applications that rely on machine learning (ML). The rapid proliferation of large language models (LLMs) has stimulated researchers to seek efficient and effective approaches to UQ in text generation tasks, as in addition to their emerging capabilities, these models have introduced new challenges for building safe applications. As with other ML models, LLMs are prone to make incorrect predictions, ``hallucinate'' by fabricating claims, or simply generate low-quality output for a given input. UQ is a key element in dealing with these challenges. However research to date on UQ methods for LLMs has been fragmented, with disparate evaluation methods. In this work, we tackle this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines, and provides an environment for controllable and consistent evaluation of novel techniques by researchers in various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks and shed light on the most promising approaches.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection
Authors:
Elisei Rykov,
Yana Shishkina,
Kseniia Petrushina,
Kseniia Titova,
Sergey Petrakov,
Alexander Panchenko
Abstract:
In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these exploration…
▽ More
In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
Authors:
Ekaterina Fadeeva,
Aleksandr Rubashevskii,
Artem Shelmanov,
Sergey Petrakov,
Haonan Li,
Hamdy Mubarak,
Evgenii Tsymbalov,
Gleb Kuzmin,
Alexander Panchenko,
Timothy Baldwin,
Preslav Nakov,
Maxim Panov
Abstract:
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide an…
▽ More
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.
△ Less
Submitted 6 June, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
LM-Polygraph: Uncertainty Estimation for Language Models
Authors:
Ekaterina Fadeeva,
Roman Vashurin,
Akim Tsvigun,
Artem Vazhentsev,
Sergey Petrakov,
Kirill Fedyanin,
Daniil Vasilev,
Elizaveta Goncharova,
Alexander Panchenko,
Maxim Panov,
Timothy Baldwin,
Artem Shelmanov
Abstract:
Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, m…
▽ More
Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Robust representations of oil wells' intervals via sparse attention mechanism
Authors:
Alina Ermilova,
Nikita Baramiia,
Valerii Kornilov,
Sergey Petrakov,
Alexey Zaytsev
Abstract:
Transformer-based neural network architectures achieve state-of-the-art results in different domains, from natural language processing (NLP) to computer vision (CV). The key idea of Transformers, the attention mechanism, has already led to significant breakthroughs in many areas. The attention has found their implementation for time series data as well. However, due to the quadratic complexity of…
▽ More
Transformer-based neural network architectures achieve state-of-the-art results in different domains, from natural language processing (NLP) to computer vision (CV). The key idea of Transformers, the attention mechanism, has already led to significant breakthroughs in many areas. The attention has found their implementation for time series data as well. However, due to the quadratic complexity of the attention calculation regarding input sequence length, the application of Transformers is limited by high resource demands. Moreover, their modifications for industrial time series need to be robust to missing or noised values, which complicates the expansion of the horizon of their application. To cope with these issues, we introduce the class of efficient Transformers named Regularized Transformers (Reguformers). We implement the regularization technique inspired by the dropout ideas to improve robustness and reduce computational expenses. The focus in our experiments is on oil&gas data, namely, well logs, a prominent example of multivariate time series. The goal is to solve the problems of similarity and representation learning for them. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells. The experiments show that all variations of Reguformers outperform the previously developed RNNs, classical Transformer model, and robust modifications of it like Informer and Performer in terms of well-intervals' classification and the quality of the obtained well-intervals' representations. Moreover, the sustainability to missing and incorrect data in our models exceeds that of others by a significant margin. The best result that the Reguformer achieves on well-interval similarity task is the mean PR~AUC score equal to 0.983, which is comparable to the classical Transformer and outperforms the previous models.
△ Less
Submitted 6 November, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.