Search | arXiv e-print repository

Fake News Detection: It's All in the Data!

Authors: Soveatin Kuntur, Anna Wróblewska, Marcin Paprzycki, Maria Ganzha

Abstract: This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, a… ▽ More This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, and prevalent biases that can impact model performance. Additionally, it addresses critical ethical issues and best practices, offering a thorough overview of the current state of available datasets. Our contribution to this field is further enriched by the provision of GitHub repository, which consolidates publicly accessible datasets into a single, user-friendly portal. This repository is designed to facilitate and stimulate further research and development efforts aimed at combating the pervasive issue of fake news. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2406.16489 [pdf, other]

Deepfake tweets automatic detection

Authors: Adam Frej, Adrian Kaminski, Piotr Marciniak, Szymon Szmajdzinski, Soveatin Kuntur, Anna Wroblewska

Abstract: This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for… ▽ More This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for recognizing DeepFake content, thereby enhancing the integrity of digital communications. By develo** reliable methods for detecting AI-generated misinformation, this work contributes to a more trustworthy online information environment. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.14266 [pdf, other]

Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries

Authors: Anna Wróblewska, Marcel Witas, Kinga Frańczak, Arkadiusz Kniaź, Siew Ann Cheong, Tan Seng Chee, Janusz Hołyst, Marcin Paprzycki

Abstract: Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The imp… ▽ More Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The implemented prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings. Specifically, users (teachers) can upload their lecture videos, which are preprocessed and analysed using machine learning models. Next, users can view summaries of recognized didactic features through interactive charts and tables. Additionally, stored ML-based prediction results support comparisons between lectures based on their didactic content. In the developed application text-based models trained on lecture transcriptions, with enhancements to the transcription quality, by adopting an automatic speech recognition solution are applied. Furthermore, the system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

arXiv:2406.13553 [pdf, other]

Mining United Nations General Assembly Debates

Authors: Mateusz Grzyb, Mateusz Krzyziński, Bartłomiej Sobieski, Mikołaj Spytek, Bartosz Pieliński, Daniel Dan, Anna Wróblewska

Abstract: This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with… ▽ More This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with descriptive statistics and automatically extracted topics) from which political scientists can derive insights into international relations and have the opportunity to have a nuanced understanding of global diplomatic discourse. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 4 pages, 1 figure, 2 tables

arXiv:2404.05450 [pdf, other]

Raman scattering by carbon nanotubes coupled to quantum dots via dipolar excitonic interaction

Authors: Anna Wroblewska, Niclas S. Mueller, Mariusz Zdrojek, Stephanie Reich, Georgy Gordeev

Abstract: The dipole-dipole interactions between excitons are of paramount importance in the nanoscale structures. When two excitons are placed together they can exchange the energy can manifest in the resonant Raman cross sections. We provide theoretical framework for such effects by combining the coupled oscillator model and perturbation theory. We apply this theory to a hybrid film comprising semiconduct… ▽ More The dipole-dipole interactions between excitons are of paramount importance in the nanoscale structures. When two excitons are placed together they can exchange the energy can manifest in the resonant Raman cross sections. We provide theoretical framework for such effects by combining the coupled oscillator model and perturbation theory. We apply this theory to a hybrid film comprising semiconducting quantum dots and metallic carbon nanotubes. The quantum dots exciton has a fixed energy, while the nanotube resonances span across a larger range from 1.7 to \SI{1.93}{eV}. We acquire the resonant Raman profiles of the pristine nanotubes and hybrids and find a relative shift between them. The shift direction depends on the relative energies between the CNT and QD exciton energies, as predicted by our theory. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 19 Pages, 7 Figures

arXiv:2403.04507 [pdf, other]

NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems

Authors: Martyna Wiącek, Piotr Rybak, Łukasz Pszenny, Alina Wróblewska

Abstract: With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-… ▽ More With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries. Aware of the shortcomings of existing NLPre evaluation approaches, we investigate a novel method of reliable and fair evaluation and performance reporting. Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools, while credibly tracking their performance. The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this benchmark, we conduct an extensive evaluation of a variety of Polish NLPre systems. To facilitate the construction of benchmarking environments for other languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full customization of the publicly released source code of the benchmarking system. The links to all the resources (deployed platforms, source code, trained models, datasets etc.) can be found on the project website: https://sites.google.com/view/nlpre-benchmark. △ Less

Submitted 27 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted at LREC-COLING 2024

arXiv:2402.00163 [pdf, other]

Improving Object Detection Quality in Football Through Super-Resolution Techniques

Authors: Karolina Seweryn, Gabriel Chęć, Szymon Łukasik, Anna Wróblewska

Abstract: This study explores the potential of super-resolution techniques in enhancing object detection accuracy in football. Given the sport's fast-paced nature and the critical importance of precise object (e.g. ball, player) tracking for both analysis and broadcasting, super-resolution could offer significant improvements. We investigate how advanced image processing through super-resolution impacts the… ▽ More This study explores the potential of super-resolution techniques in enhancing object detection accuracy in football. Given the sport's fast-paced nature and the critical importance of precise object (e.g. ball, player) tracking for both analysis and broadcasting, super-resolution could offer significant improvements. We investigate how advanced image processing through super-resolution impacts the accuracy and reliability of object detection algorithms in processing football match footage. Our methodology involved applying state-of-the-art super-resolution techniques to a diverse set of football match videos from SoccerNet, followed by object detection using Faster R-CNN. The performance of these algorithms, both with and without super-resolution enhancement, was rigorously evaluated in terms of detection accuracy. The results indicate a marked improvement in object detection accuracy when super-resolution preprocessing is applied. The improvement of object detection through the integration of super-resolution techniques yields significant benefits, especially for low-resolution scenarios, with a notable 12\% increase in mean Average Precision (mAP) at an IoU (Intersection over Union) range of 0.50:0.95 for 320x240 size images when increasing the resolution fourfold using RLFN. As the dimensions increase, the magnitude of improvement becomes more subdued; however, a discernible improvement in the quality of detection is consistently evident. Additionally, we discuss the implications of these findings for real-time sports analytics, player tracking, and the overall viewing experience. The study contributes to the growing field of sports technology by demonstrating the practical benefits and limitations of integrating super-resolution techniques in football analytics and broadcasting. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2309.12067 [pdf, ps, other]

Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

Authors: Karolina Seweryn, Anna Wróblewska, Szymon Łukasik

Abstract: Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly ava… ▽ More Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2305.16750 [pdf, other]

doi 10.1007/978-3-031-36024-4_5

Automating the Analysis of Institutional Design in International Agreements

Authors: Anna Wróblewska, Bartosz Pieliński, Karolina Seweryn, Sylwia Sysko-Romańczuk, Karol Saputa, Aleksandra Wichrowska, Hanna Schreiber

Abstract: This paper explores the automatic knowledge extraction of formal institutional design - norms, rules, and actors - from international agreements. The focus was to analyze the relationship between the visibility and centrality of actors in the formal institutional design in regulating critical aspects of cultural heritage relations. The developed tool utilizes techniques such as collecting legal do… ▽ More This paper explores the automatic knowledge extraction of formal institutional design - norms, rules, and actors - from international agreements. The focus was to analyze the relationship between the visibility and centrality of actors in the formal institutional design in regulating critical aspects of cultural heritage relations. The developed tool utilizes techniques such as collecting legal documents, annotating them with Institutional Grammar, and using graph analysis to explore the formal institutional design. The system was tested against the 2003 UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: 11 pages, 8 figures, accepted to ICCS 2023. arXiv admin note: substantial text overlap with arXiv:2209.00944

arXiv:2305.11070 [pdf, other]

Enriching language models with graph-based context information to better understand textual data

Authors: Albert Roethel, Maria Ganzha, Anna Wróblewska

Abstract: A considerable number of texts encountered daily are somehow connected with each other. For example, Wikipedia articles refer to other articles via hyperlinks, scientific papers relate to others via citations or (co)authors, while tweets relate via users that follow each other or reshare content. Hence, a graph-like structure can represent existing connections and be seen as capturing the "context… ▽ More A considerable number of texts encountered daily are somehow connected with each other. For example, Wikipedia articles refer to other articles via hyperlinks, scientific papers relate to others via citations or (co)authors, while tweets relate via users that follow each other or reshare content. Hence, a graph-like structure can represent existing connections and be seen as capturing the "context" of the texts. The question thus arises if extracting and integrating such context information into a language model might help facilitate a better automated understanding of the text. In this study, we experimentally demonstrate that incorporating graph-based contextualization into BERT model enhances its performance on an example of a classification task. Specifically, on Pubmed dataset, we observed a reduction in error from 8.51% to 7.96%, while increasing the number of parameters just by 1.6%. Our source code: https://github.com/tryptofanik/gc-bert △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: 12 pages, 5 figures

arXiv:2211.15202 [pdf, other]

Revisiting Distance Metric Learning for Few-Shot Natural Language Classification

Authors: Witold Sosnowski, Anna Wróblewska, Karolina Seweryn, Piotr Gawrysiak

Abstract: Distance Metric Learning (DML) has attracted much attention in image processing in recent years. This paper analyzes its impact on supervised fine-tuning language models for Natural Language Processing (NLP) classification tasks under few-shot learning settings. We investigated several DML loss functions in training RoBERTa language models on known SentEval Transfer Tasks datasets. We also analyze… ▽ More Distance Metric Learning (DML) has attracted much attention in image processing in recent years. This paper analyzes its impact on supervised fine-tuning language models for Natural Language Processing (NLP) classification tasks under few-shot learning settings. We investigated several DML loss functions in training RoBERTa language models on known SentEval Transfer Tasks datasets. We also analyzed the possibility of using proxy-based DML losses during model inference. Our systematic experiments have shown that under few-shot learning settings, particularly proxy-based DML losses can positively affect the fine-tuning and inference of a supervised language model. Models tuned with a combination of CCE (categorical cross-entropy loss) and ProxyAnchor Loss have, on average, the best performance and outperform models with only CCE by about 3.27 percentage points -- up to 10.38 percentage points depending on the training dataset. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.15195 [pdf, other]

Distance Metric Learning Loss Functions in Few-Shot Scenarios of Supervised Language Models Fine-Tuning

Authors: Witold Sosnowski, Karolina Seweryn, Anna Wróblewska, Piotr Gawrysiak

Abstract: This paper presents an analysis regarding an influence of the Distance Metric Learning (DML) loss functions on the supervised fine-tuning of the language models for classification tasks. We experimented with known datasets from SentEval Transfer Tasks. Our experiments show that applying the DML loss function can increase performance on downstream classification tasks of RoBERTa-large models in f… ▽ More This paper presents an analysis regarding an influence of the Distance Metric Learning (DML) loss functions on the supervised fine-tuning of the language models for classification tasks. We experimented with known datasets from SentEval Transfer Tasks. Our experiments show that applying the DML loss function can increase performance on downstream classification tasks of RoBERTa-large models in few-shot scenarios. Models fine-tuned with the use of SoftTriple loss can achieve better results than models with a standard categorical cross-entropy loss function by about 2.89 percentage points from 0.04 to 13.48 percentage points depending on the training dataset. Additionally, we accomplished a comprehensive analysis with explainability techniques to assess the models' reliability and explain their results. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2209.00944 [pdf, other]

Entity Graph Extraction from Legal Acts -- a Prototype for a Use Case in Policy Design Analysis

Authors: Anna Wróblewska, Bartosz Pieliński, Karolina Seweryn, Karol Saputa, Aleksandra Wichrowska, Sylwia Sysko-Romańczuk, Hanna Schreiber

Abstract: This paper presents research on a prototype developed to serve the quantitative study of public policy design. This sub-discipline of political science focuses on identifying actors, relations between them, and tools at their disposal in health, environmental, economic, and other policies. Our system aims to automate the process of gathering legal documents, annotating them with Institutional Gram… ▽ More This paper presents research on a prototype developed to serve the quantitative study of public policy design. This sub-discipline of political science focuses on identifying actors, relations between them, and tools at their disposal in health, environmental, economic, and other policies. Our system aims to automate the process of gathering legal documents, annotating them with Institutional Grammar, and using hypergraphs to analyse inter-relations between crucial entities. Our system is tested against the UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage from 2003, a legal document regulating essential aspects of international relations securing cultural heritage. △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: 17 pages, 10 figures

Report number: shortened version with more analysis - https://arxiv.longhoe.net/abs/2305.16750 MSC Class: 68U35

arXiv:2208.06262 [pdf, other]

doi 10.1109/ijcnn55064.2022.9892361

Identifying Substitute and Complementary Products for Assortment Optimization with Cleora Embeddings

Authors: Sergiy Tkachuk, Anna Wróblewska, Jacek Dąbrowski, Szymon Łukasik

Abstract: Recent years brought an increasing interest in the application of machine learning algorithms in e-commerce, omnichannel marketing, and the sales industry. It is not only to the algorithmic advances but also to data availability, representing transactions, users, and background product information. Finding products related in different ways, i.e., substitutes and complements is essential for users… ▽ More Recent years brought an increasing interest in the application of machine learning algorithms in e-commerce, omnichannel marketing, and the sales industry. It is not only to the algorithmic advances but also to data availability, representing transactions, users, and background product information. Finding products related in different ways, i.e., substitutes and complements is essential for users' recommendations at the vendor's site and for the vendor - to perform efficient assortment optimization. The paper introduces a novel method for finding products' substitutes and complements based on the graph embedding Cleora algorithm. We also provide its experimental evaluation with regards to the state-of-the-art Shopper algorithm, studying the relevance of recommendations with surveys from industry experts. It is concluded that the new approach presented here offers suitable choices of recommended products, requiring a minimal amount of additional information. The algorithm can be used in various enterprises, effectively identifying substitute and complementary product options. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 10 pages, 1 figure

Journal ref: revised version: International Joint Conference on Neural Networks (IJCNN 2022)

arXiv:2206.06367 [pdf, other]

Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis

Authors: Maciej Pawłowski, Anna Wróblewska, Sylwia Sysko-Romańczuk

Abstract: Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors sho… ▽ More Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors should be considered while choosing such a technique. This paper explores the most common techniques for building multimodal data representations -- the late fusion, the early fusion, and the sketch, and compares them in classification tasks. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general, our results confirm that multimodal representations are able to boost the performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both MovieLens datasets indicate the importance of the meaningful input data to the given task. In this article, we show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance, that comes with the proper modalities combination. Such choice relies on: the influence that each modality has on the analyzed machine learning (ML) problem; the type of the ML task; the memory constraints while training and predicting phase. △ Less

Submitted 9 June, 2022; originally announced June 2022.

arXiv:2205.15712 [pdf, other]

doi 10.1109/fuzz-ieee55066.2022.9882843

Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Authors: Michał Możdżonek, Anna Wróblewska, Sergiy Tkachuk, Szymon Łukasik

Abstract: Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching p… ▽ More Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets. △ Less

Submitted 1 June, 2022; v1 submitted 31 May, 2022; originally announced May 2022.

Comments: 11 pages, 5 figures

Journal ref: revised version: 2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) 2022

arXiv:2205.14919 [pdf, other]

doi 10.1007/978-3-031-11644-5_70

A Deep Learning Approach for Automatic Detection of Qualitative Features of Lecturing

Authors: Anna Wroblewska, Jozef Jasek, Bogdan Jastrzebski, Stanislaw Pawlak, Anna Grzywacz, Cheong Siew Ann, Tan Seng Chee, Tomasz Trzcinski, Janusz Holyst

Abstract: Artificial Intelligence in higher education opens new possibilities for improving the lecturing process, such as enriching didactic materials, hel** in assessing students' works or even providing directions to the teachers on how to enhance the lectures. We follow this research path, and in this work, we explore how an academic lecture can be assessed automatically by quantitative features. Firs… ▽ More Artificial Intelligence in higher education opens new possibilities for improving the lecturing process, such as enriching didactic materials, hel** in assessing students' works or even providing directions to the teachers on how to enhance the lectures. We follow this research path, and in this work, we explore how an academic lecture can be assessed automatically by quantitative features. First, we prepare a set of qualitative features based on teaching practices and then annotate the dataset of academic lecture videos collected for this purpose. We then show how these features could be detected automatically using machine learning and computer vision techniques. Our results show the potential usefulness of our work. △ Less

Submitted 30 May, 2022; originally announced May 2022.

Comments: 10 pages, 9 figures

Journal ref: International Conference on Artificial Intelligence in Education, AIED 2022

arXiv:2204.07775 [pdf, other]

TASTEset -- Recipe Dataset and Food Entities Recognition Benchmark

Authors: Ania Wróblewska, Agnieszka Kaliska, Maciej Pawłowski, Dawid Wiśniewski, Witold Sosnowski, Agnieszka Ławrynowicz

Abstract: Food Computing is currently a fast-growing field of research. Natural language processing (NLP) is also increasingly essential in this field, especially for recognising food entities. However, there are still only a few well-defined tasks that serve as benchmarks for solutions in this area. We introduce a new dataset -- called \textit{TASTEset} -- to bridge this gap. In this dataset, Named Entity… ▽ More Food Computing is currently a fast-growing field of research. Natural language processing (NLP) is also increasingly essential in this field, especially for recognising food entities. However, there are still only a few well-defined tasks that serve as benchmarks for solutions in this area. We introduce a new dataset -- called \textit{TASTEset} -- to bridge this gap. In this dataset, Named Entity Recognition (NER) models are expected to find or infer various types of entities helpful in processing recipes, e.g.~food products, quantities and their units, names of cooking processes, physical quality of ingredients, their purpose, taste. The dataset consists of 700 recipes with more than 13,000 entities to extract. We provide a few state-of-the-art baselines of named entity recognition models, which show that our dataset poses a solid challenge to existing models. The best model achieved, on average, 0.95 $F_1$ score, depending on the entity type -- from 0.781 to 0.982. We share the dataset and the task to encourage progress on more in-depth and complex information extraction from recipes. △ Less

Submitted 16 April, 2022; originally announced April 2022.

arXiv:2203.06746 [pdf, other]

ProtagonistTagger -- a Tool for Entity Linkage of Persons in Texts from Various Languages and Domains

Authors: Weronika Lajewska, Anna Wroblewska

Abstract: Named entities recognition (NER) and disambiguation (NED) can add semantic context to the recognized named entities in texts. Named entity linkage in texts, regardless of a domain, provides links between the entities mentioned in unstructured texts and individual instances of real-world objects. In this poster, we present a tool - protagonistTagger - for person NER and NED in texts. The tool was t… ▽ More Named entities recognition (NER) and disambiguation (NED) can add semantic context to the recognized named entities in texts. Named entity linkage in texts, regardless of a domain, provides links between the entities mentioned in unstructured texts and individual instances of real-world objects. In this poster, we present a tool - protagonistTagger - for person NER and NED in texts. The tool was tested on texts extracted from classic English novels and Polish Internet news. The tool's performance (both precision and recall) fluctuates between 78% and even 88%. △ Less

Submitted 13 March, 2022; originally announced March 2022.

arXiv:2203.04831 [pdf, other]

Automatic Language Identification for Celtic Texts

Authors: Olha Dovbnia, Anna Wróblewska

Abstract: Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to i… ▽ More Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set. We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with traditional statistical features alongside the output of clustering, autoencoder, and topic modelling methods. The analysis showed that the unsupervised features could serve as a valuable extension to the n-gram feature vectors. It led to an improvement in performance for more entangled classes. The best model achieved a 98\% F1 score and 97\% MCC. The dense neural network consistently outperformed the SVM model. The low-resource languages are also challenging due to the scarcity of available annotated training data. This work evaluated the performance of the classifiers using the unsupervised feature extraction on the reduced labelled dataset to handle this issue. The results uncovered that the unsupervised feature vectors are more robust to the labelled set reduction. Therefore, they proved to help achieve comparable classification performance with much less labelled data. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 14 pages, 6 figures

arXiv:2201.03521 [pdf, other]

doi 10.1017/S1351324923000220

Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Authors: Daniel Ziembicki, Anna Wróblewska, Karolina Seweryn

Abstract: Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entir… ▽ More Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entirely natural language utterances in Polish and gathers 2,432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative sample in regards to frequency of main verbs and other linguistic features (e.g. occurrence of internal negation). We found that transformer BERT-based models working on sentences obtained relatively good results ($\approx89\%$ F1 score). Even though better results were achieved using linguistic features ($\approx91\%$ F1 score), this model requires more human labour (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenon - e.g. cases with entitlement (E) and non-factive verbs - remain an open issue for further research. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2112.12913 [pdf, other]

Spoiler in a Textstack: How Much Can Transformers Help?

Authors: Anna Wróblewska, Paweł Rzepiński, Sylwia Sysko-Romańczuk

Abstract: This paper presents our research regarding spoiler detection in reviews. In this use case, we describe the method of fine-tuning and organizing the available text-based model tasks with the latest deep learning achievements and techniques to interpret the models' results. Until now, spoiler research has been rarely described in the literature. We tested the transfer learning approach and differe… ▽ More This paper presents our research regarding spoiler detection in reviews. In this use case, we describe the method of fine-tuning and organizing the available text-based model tasks with the latest deep learning achievements and techniques to interpret the models' results. Until now, spoiler research has been rarely described in the literature. We tested the transfer learning approach and different latest transformer architectures on two open datasets with annotated spoilers (ROC AUC above 81\% on TV Tropes Movies dataset, and Goodreads dataset above 88\%). We also collected data and assembled a new dataset with fine-grained annotations. To that end, we employed interpretability techniques and measures to assess the models' reliability and explain their results. △ Less

Submitted 23 December, 2021; originally announced December 2021.

MSC Class: 68T50; 68T07 ACM Class: I.2.7

arXiv:2112.08462 [pdf, other]

doi 10.15439/2022F185

Applying SoftTriple Loss for Supervised Language Model Fine Tuning

Authors: Witold Sosnowski, Anna Wroblewska, Piotr Gawrysiak

Abstract: We introduce a new loss function TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models based on cross-entropy and SoftTriple loss. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about (0.02% - 2.29%). Thorough tests on popular datasets indicate a steady gain. The fewer samples in th… ▽ More We introduce a new loss function TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models based on cross-entropy and SoftTriple loss. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about (0.02% - 2.29%). Thorough tests on popular datasets indicate a steady gain. The fewer samples in the training dataset, the higher gain -- thus, for small-sized dataset it is 0.78%, for medium-sized -- 0.86% for large -- 0.20% and for extra-large 0.04%. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Journal ref: 17th Conference on Computer Science and Intelligence Systems 2022. Series: ACSIS Annals of Computer Science and Information Systems

arXiv:2110.01349 [pdf, other]

Protagonists' Tagger in Literary Domain -- New Datasets and a Method for Person Entity Linkage

Authors: Weronika Łajewska, Anna Wróblewska

Abstract: Semantic annotation of long texts, such as novels, remains an open challenge in Natural Language Processing (NLP). This research investigates the problem of detecting person entities and assigning them unique identities, i.e., recognizing people (especially main characters) in novels. We prepared a method for person entity linkage (named entity recognition and disambiguation) and new testing datas… ▽ More Semantic annotation of long texts, such as novels, remains an open challenge in Natural Language Processing (NLP). This research investigates the problem of detecting person entities and assigning them unique identities, i.e., recognizing people (especially main characters) in novels. We prepared a method for person entity linkage (named entity recognition and disambiguation) and new testing datasets. The datasets comprise 1,300 sentences from 13 classic novels of different genres that a novel reader had manually annotated. Our process of identifying literary characters in a text, implemented in protagonistTagger, comprises two stages: (1) named entity recognition (NER) of persons, (2) named entity disambiguation (NED) - matching each recognized person with the literary character's full name, based on approximate text matching. The protagonistTagger achieves both precision and recall of above 83% on the prepared testing sets. Finally, we gathered a corpus of 13 full-text novels tagged with protagonistTagger that comprises more than 35,000 mentions of literary characters. △ Less

Submitted 4 October, 2021; originally announced October 2021.

arXiv:2109.05361 [pdf, other]

COMBO: State-of-the-Art Morphosyntactic Analysis

Authors: Mateusz Klimaszewski, Alina Wróblewska

Abstract: We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languag… ▽ More We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages. It maintains a balance between efficiency and quality. As it is an end-to-end system and its modules are jointly trained, its training is competitively fast. As its models are optimised for accuracy, they achieve often better prediction quality than SOTA. The COMBO library is available at: https://gitlab.clarin-pl.eu/syntactic-tools/combo. △ Less

Submitted 11 September, 2021; originally announced September 2021.

Comments: Accepted at EMNLP 2021 Demonstrations Program

arXiv:2107.03809 [pdf, other]

COMBO: a new module for EUD parsing

Authors: Mateusz Klimaszewski, Alina Wróblewska

Abstract: We introduce the COMBO-based approach for EUD parsing and its implementation, which took part in the IWPT 2021 EUD shared task. The goal of this task is to parse raw texts in 17 languages into Enhanced Universal Dependencies (EUD). The proposed approach uses COMBO to predict UD trees and EUD graphs. These structures are then merged into the final EUD graphs. Some EUD edge labels are extended with… ▽ More We introduce the COMBO-based approach for EUD parsing and its implementation, which took part in the IWPT 2021 EUD shared task. The goal of this task is to parse raw texts in 17 languages into Enhanced Universal Dependencies (EUD). The proposed approach uses COMBO to predict UD trees and EUD graphs. These structures are then merged into the final EUD graphs. Some EUD edge labels are extended with case information using a single language-independent expansion rule. In the official evaluation, the solution ranked fourth, achieving an average ELAS of 83.79%. The source code is available at https://gitlab.clarin-pl.eu/syntactic-tools/combo. △ Less

Submitted 8 July, 2021; originally announced July 2021.

Comments: Accepted at IWPT 2021

arXiv:2105.05796 [pdf, other]

doi 10.1007/978-3-030-86549-8_36

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Authors: Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek

Abstract: The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language docum… ▽ More The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks. △ Less

Submitted 12 May, 2021; originally announced May 2021.

Comments: accepted to ICDAR 2021

Journal ref: International Conference on Document Analysis and Recognition ICDAR 2021

arXiv:2105.01735 [pdf, other]

HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish

Authors: Robert Mroczkowski, Piotr Rybak, Alina Wróblewska, Ireneusz Gawlik

Abstract: BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out,… ▽ More BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out, but the vast majority of them concerned only the English language. A training procedure designed for English does not have to be universal and applicable to other especially typologically different languages. Therefore, this paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language. We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models. In addition to multilingual model initialization, other factors that possibly influence pretraining are also explored, i.e. training objective, corpus size, BPE-Dropout, and pretraining length. Based on the proposed procedure, a Polish BERT-based language model -- HerBERT -- is trained. This model achieves state-of-the-art results on multiple downstream tasks. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: Published in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

arXiv:2005.06331 [pdf, other]

doi 10.3390/electronics11091391

Multi-modal Embedding Fusion-based Recommender

Authors: Anna Wroblewska, Jacek Dabrowski, Michal Pastuszak, Andrzej Michalowski, Michal Daniluk, Barbara Rychalska, Mikolaj Wieczorek, Sylwia Sysko-Romanczuk

Abstract: Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with significant focus on e-commerce platforms. We have developed a machine learning-based recommendation platform, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our platform supports multiple types of interaction… ▽ More Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with significant focus on e-commerce platforms. We have developed a machine learning-based recommendation platform, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our platform supports multiple types of interaction data with multiple modalities of metadata natively. This is achieved through multi-modal fusion of various data representations. We deployed the platform into multiple e-commerce stores of different kinds, e.g. food and beverages, shoes, fashion items, telecom operators. Here, we present our system, its flexibility and performance. We also show benchmark results on open datasets, that significantly outperform state-of-the-art prior work. △ Less

Submitted 14 May, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

Comments: 7 pages, 8 figures

Journal ref: revised and improved version: Electronics MDPI - https://www.mdpi.com/2079-9292/11/9/1391

arXiv:2004.12450 [pdf, other]

doi 10.18653/v1/K18-2004

Semi-Supervised Neural System for Tagging, Parsing and Lematization

Authors: Piotr Rybak, Alina Wróblewska

Abstract: This paper describes the ICS PAS system which took part in CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The system consists of jointly trained tagger, lemmatizer, and dependency parser which are based on features extracted by a biLSTM network. The system uses both fully connected and dilated convolutional neural architectures. The novelty of our approach… ▽ More This paper describes the ICS PAS system which took part in CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The system consists of jointly trained tagger, lemmatizer, and dependency parser which are based on features extracted by a biLSTM network. The system uses both fully connected and dilated convolutional neural architectures. The novelty of our approach is the use of an additional loss function, which reduces the number of cycles in the predicted dependency graphs, and the use of self-training to increase the system performance. The proposed system, i.e. ICS PAS (Warszawa), ranked 3th/4th in the official evaluation obtaining the following overall results: 73.02 (LAS), 60.25 (MLAS) and 64.44 (BLEX). △ Less

Submitted 26 April, 2020; originally announced April 2020.

arXiv:2003.04094 [pdf, other]

doi 10.1007/978-3-030-63820-7_33

A Strong Baseline for Fashion Retrieval with Person Re-Identification Models

Authors: Mikolaj Wieczorek, Andrzej Michalowski, Anna Wroblewska, Jacek Dabrowski

Abstract: Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. Difficulties arise from the fine-grained nature of clothing items, very large intra-class and inter-class variance. Additionally, query and source images for the task usually come from different domains - street photos and catalogue photos respectively. Due to these differences, a signi… ▽ More Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. Difficulties arise from the fine-grained nature of clothing items, very large intra-class and inter-class variance. Additionally, query and source images for the task usually come from different domains - street photos and catalogue photos respectively. Due to these differences, a significant gap in quality, lighting, contrast, background clutter and item presentation exists between domains. As a result, fashion retrieval is an active field of research both in academia and the industry. Inspired by recent advancements in Person Re-Identification research, we adapt leading ReID models to be used in fashion retrieval tasks. We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results despite a much simpler architecture. We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results. Finally, we propose a cross-domain (cross-dataset) evaluation method to test the robustness of fashion retrieval models. △ Less

Submitted 9 March, 2020; originally announced March 2020.

Comments: 33 pages, 14 figures

Journal ref: short paper in Neural Information Processing, Communications in Computer and Information Science, 2020

arXiv:2003.02356 [pdf, other]

Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

Authors: Filip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek

Abstract: State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers,… ▽ More State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages. To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. Utilizing both textual and structural layout features, an NLP system must find the most important information, about various types of entities, in long formal documents. We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Moreover, we checked the most popular PDF processing tools for text extraction (pdf2djvu, Tesseract and Textract) in order to analyze behavior of IE system in presence of errors introduced by these tools. △ Less

Submitted 6 March, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

arXiv:1910.02403 [pdf, other]

doi 10.18653/v1/K19-1058

Named Entity Recognition -- Is there a glass ceiling?

Authors: Tomasz Stanislawek, Anna Wróblewska, Alicja Wójcicka, Daniel Ziembicki, Przemyslaw Biecek

Abstract: Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study reveals the weak and strong points of the Stanford, CMU, FL… ▽ More Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study reveals the weak and strong points of the Stanford, CMU, FLAIR, ELMO and BERT models, as well as their shared limitations. We also introduce new techniques for improving annotation, for training processes and for checking a model's quality and stability. Presented results are based on the CoNLL 2003 data set for the English language. A new enriched semantic annotation of errors for this data set and new diagnostic data sets are attached in the supplementary materials. △ Less

Submitted 30 October, 2019; v1 submitted 6 October, 2019; originally announced October 2019.

Comments: Accepted to CoNLL 2019

Journal ref: 23rd Conference on Computational Natural Language Learning, CoNLL 2019

arXiv:1809.03740 [pdf]

doi 10.18653/v1/W18-5436

Does it care what you asked? Understanding Importance of Verbs in Deep Learning QA System

Authors: Barbara Rychalska, Dominika Basaj, Przemyslaw Biecek, Anna Wroblewska

Abstract: In this paper we present the results of an investigation of the importance of verbs in a deep learning QA system trained on SQuAD dataset. We show that main verbs in questions carry little influence on the decisions made by the system - in over 90% of researched cases swap** verbs for their antonyms did not change system decision. We track this phenomenon down to the insides of the net, analyzin… ▽ More In this paper we present the results of an investigation of the importance of verbs in a deep learning QA system trained on SQuAD dataset. We show that main verbs in questions carry little influence on the decisions made by the system - in over 90% of researched cases swap** verbs for their antonyms did not change system decision. We track this phenomenon down to the insides of the net, analyzing the mechanism of self-attention and values contained in hidden layers of RNN. Finally, we recognize the characteristics of the SQuAD dataset as the source of the problem. Our work refers to the recently popular topic of adversarial examples in NLP, combined with investigating deep net structure. △ Less

Submitted 11 September, 2018; originally announced September 2018.

Comments: Accepted to Analyzing and interpreting neural networks for NLP workshop at EMNLP 2018

Journal ref: 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

arXiv:1809.03734 [pdf]

doi 10.18653/v1/W18-5435

How much should you ask? On the question structure in QA systems

Authors: Dominika Basaj, Barbara Rychalska, Przemyslaw Biecek, Anna Wroblewska

Abstract: Datasets that boosted state-of-the-art solutions for Question Answering (QA) systems prove that it is possible to ask questions in natural language manner. However, users are still used to query-like systems where they type in keywords to search for answer. In this study we validate which parts of questions are essential for obtaining valid answer. In order to conclude that, we take advantage of L… ▽ More Datasets that boosted state-of-the-art solutions for Question Answering (QA) systems prove that it is possible to ask questions in natural language manner. However, users are still used to query-like systems where they type in keywords to search for answer. In this study we validate which parts of questions are essential for obtaining valid answer. In order to conclude that, we take advantage of LIME - a framework that explains prediction by local approximation. We find that grammar and natural language is disregarded by QA. State-of-the-art model can answer properly even if 'asked' only with a few words with high coefficients calculated with LIME. According to our knowledge, it is the first time that QA model is being explained by LIME. △ Less

Submitted 11 September, 2018; originally announced September 2018.

Comments: Accepted to Analyzing and interpreting neural networks for NLP workshop at EMNLP 2018

Journal ref: 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

arXiv:1304.2920 [pdf, ps, other]

On the key exchange with nonlinear polynomial maps of stable degree

Authors: Vasyl Ustimenko, Aneta Wróblewska

Abstract: We say that the sequence $g_n$, $n\ge 3$, $n \rightarrow \infty$ of polynomial transformation bijective maps of free module $K^n$ over commutative ring $K$ is a sequence of stable degree if the order of $g_n$ is growing with $n$ and the degree of each nonidentical polynomial map of kind ${g_n}^k$ is an independent constant $c$. A transformation $b=τ {g_n}^k τ^{-1}$, where $τ$ is affine bijection… ▽ More We say that the sequence $g_n$, $n\ge 3$, $n \rightarrow \infty$ of polynomial transformation bijective maps of free module $K^n$ over commutative ring $K$ is a sequence of stable degree if the order of $g_n$ is growing with $n$ and the degree of each nonidentical polynomial map of kind ${g_n}^k$ is an independent constant $c$. A transformation $b=τ {g_n}^k τ^{-1}$, where $τ$ is affine bijection, $n$ is large and $k$ is relatively small, can be used as a base of group theoretical Diffie-Hellman key exchange algorithm for the Cremona group $C(K^n)$ of all regular automorphisms of $K^n$. The specific feature of this method is that the order of the base may be unknown for the adversary because of the complexity of its computation. The exchange can be implemented by tools of Computer Algebra (symbolic computations). The adversary can not use the degree of righthandside in $b^x=d$ to evaluate unknown $x$ in this form for the discrete logarithm problem. In the paper we introduce the explicit constructions of sequences of elements of stable degree for cases $c=3$ for each commutative ring $K$ containing at least 3 regular elements and discuss the implementation of related key exchange and public key algorithms. △ Less

Submitted 10 April, 2013; originally announced April 2013.

Comments: 19 pages

Showing 1–36 of 36 results for author: Wroblewska, A