Search | arXiv e-print repository

KazQAD: Kazakh Open-Domain Question Answering Dataset

Authors: Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski

Abstract: We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, an… ▽ More We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: To appear in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

arXiv:2310.07008 [pdf, other]

Answer Candidate Type Selection: Text-to-Text Language Model for Closed Book Question Answering Meets Knowledge Graphs

Authors: Mikhail Salnikov, Maria Lysyuk, Pavel Braslavski, Anton Razzhigaev, Valentin Malykh, Alexander Panchenko

Abstract: Pre-trained Text-to-Text Language Models (LMs), such as T5 or BART yield promising results in the Knowledge Graph Question Answering (KGQA) task. However, the capacity of the models is limited and the quality decreases for questions with less popular entities. In this paper, we present a novel approach which works on top of the pre-trained Text-to-Text QA system to address this issue. Our simple y… ▽ More Pre-trained Text-to-Text Language Models (LMs), such as T5 or BART yield promising results in the Knowledge Graph Question Answering (KGQA) task. However, the capacity of the models is limited and the quality decreases for questions with less popular entities. In this paper, we present a novel approach which works on top of the pre-trained Text-to-Text QA system to address this issue. Our simple yet effective method performs filtering and re-ranking of generated candidates based on their types derived from Wikidata "instance_of" property. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2310.02166 [pdf, other]

Large Language Models Meet Knowledge Graphs to Answer Factoid Questions

Authors: Mikhail Salnikov, Hai Le, Prateek Rajput, Irina Nikishina, Pavel Braslavski, Valentin Malykh, Alexander Panchenko

Abstract: Recently, it has been shown that the incorporation of structured knowledge into Large Language Models significantly improves the results for a variety of NLP tasks. In this paper, we propose a method for exploring pre-trained Text-to-Text Language Models enriched with additional information from Knowledge Graphs for answering factoid questions. More specifically, we propose an algorithm for subgra… ▽ More Recently, it has been shown that the incorporation of structured knowledge into Large Language Models significantly improves the results for a variety of NLP tasks. In this paper, we propose a method for exploring pre-trained Text-to-Text Language Models enriched with additional information from Knowledge Graphs for answering factoid questions. More specifically, we propose an algorithm for subgraphs extraction from a Knowledge Graph based on question entities and answer candidates. Then, we procure easily interpreted information with Transformer-based models through the linearization of the extracted subgraphs. Final re-ranking of the answer candidates with the extracted information boosts Hits@1 scores of the pre-trained text-to-text language models by 4-6%. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2210.11913 [pdf, ps, other]

doi 10.1093/bioinformatics/btad161

NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

Authors: Natalia Loukachevitch, Suresh Manandhar, Elina Baral, Igor Rozhkov, Pavel Braslavski, Vladimir Ivanov, Tatiana Batura, Elena Tutubalina

Abstract: This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested… ▽ More This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension (MRC) models and report their results. The dataset is freely available at https://github.com/nerel-ds/NEREL-BIO. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: Submitted to Bioinformatics (Publisher: Oxford University Press)

Journal ref: Bioinformatics, Volume 39, Issue 4, April 2023, btad161

arXiv:2204.06457 [pdf, other]

doi 10.1007/978-3-031-28241-6_4

The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer

Authors: Pavel Efimov, Leonid Boytsov, Elena Arslanova, Pavel Braslavski

Abstract: Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European langu… ▽ More Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European languages. In contrast we experiment with a typologically diverse set of languages (Spanish, Russian, Vietnamese, and Hindi) and extend their original implementations to new tasks (XSR, NER, and QA) and an additional training regime (continual learning). Our study reproduced gains in NLI for four languages, showed improved NER, XSR, and cross-lingual QA results in three languages (though some cross-lingual QA gains were not statistically significant), while mono-lingual QA performance never improved and sometimes degraded. Analysis of distances between contextualized embeddings of related and unrelated words (across languages) showed that fine-tuning leads to "forgetting" some of the cross-lingual alignment information. Based on this observation, we further improved NLI performance using continual learning. △ Less

Submitted 31 October, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: Presented at ECIR 2023

arXiv:2108.13112 [pdf, other]

NEREL: A Russian Dataset with Nested Named Entities, Relations and Events

Authors: Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Ilia Denisov, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Elena Tutubalina

Abstract: In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse le… ▽ More In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse level. NEREL can facilitate development of novel models that can extract relations between nested named entities, as well as relations on both sentence and document levels. NEREL also contains the annotation of events involving named entities and their roles in the events. The NEREL collection is available via https://github.com/nerel-ds/NEREL. △ Less

Submitted 3 September, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: accepted to RANLP

arXiv:2103.03335 [pdf, other]

doi 10.1145/3404835.3463093

A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models

Authors: Iurii Mokrii, Leonid Boytsov, Pavel Braslavski

Abstract: Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In co… ▽ More Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels -- possibly with subsequent fine-tuning using a modest number of annotated queries -- can produce a competitive or better model compared to transfer learning. Yet, it is necessary to improve the stability and/or effectiveness of the few-shot training, which, sometimes, can degrade performance of a pretrained model. △ Less

Submitted 21 November, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

Journal ref: SIGIR 2021 (44th International ACM SIGIR Conference on Research and Development in Information Retrieval)

arXiv:2005.10659 [pdf, other]

doi 10.1007/978-3-030-62466-8_7

RuBQ: A Russian Dataset for Question Answering over Wikidata

Authors: Vladislav Korablinov, Pavel Braslavski

Abstract: The paper presents RuBQ, the first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of que… ▽ More The paper presents RuBQ, the first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:1912.09723 [pdf, other]

doi 10.1007/978-3-030-58219-7_1

SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis

Authors: Pavel Efimov, Andrey Chertok, Leonid Boytsov, Pavel Braslavski

Abstract: SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results. SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a description, a thorough analysis, and baseline experimental results. △ Less

Submitted 2 May, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

arXiv:1811.05361 [pdf, other]

doi 10.1007/978-3-030-00063-9_9

Personal Names Popularity Estimation and its Application to Record Linkage

Authors: Ksenia Zhagorina, Pavel Braslavski, Vladimir Gusev

Abstract: This study deals with a fairly simply formulated problem -- how to estimate the number of people bearing the same full name in a large population. Estimation of name popularity can leverage personal name matching in databases and be of interest for many other domains. A distinctive feature of large collections of names is that they contain a large number of unique items, which is challenging for s… ▽ More This study deals with a fairly simply formulated problem -- how to estimate the number of people bearing the same full name in a large population. Estimation of name popularity can leverage personal name matching in databases and be of interest for many other domains. A distinctive feature of large collections of names is that they contain a large number of unique items, which is challenging for statistical modeling. We investigate a number of statistical techniques and also propose a simple yet effective method aimed at obtaining more accurate count estimates. In our experiments we use a dataset containing about 20 million name occurrences that correspond to about 13 million real-world persons. We perform a thorough evaluation of the name count estimation methods and a record linkage experiment guided by name popularity estimates. Obtained results suggest that theoretically informed approaches outperform simple heuristics and can be useful in a variety of applications. △ Less

Submitted 13 November, 2018; originally announced November 2018.

Comments: This is an extended version of a short paper presented at ADBIS2018

Showing 1–10 of 10 results for author: Braslavski, P