-
Fake News Detection: It's All in the Data!
Authors:
Soveatin Kuntur,
Anna Wróblewska,
Marcin Paprzycki,
Maria Ganzha
Abstract:
This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, a…
▽ More
This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, and prevalent biases that can impact model performance. Additionally, it addresses critical ethical issues and best practices, offering a thorough overview of the current state of available datasets. Our contribution to this field is further enriched by the provision of GitHub repository, which consolidates publicly accessible datasets into a single, user-friendly portal. This repository is designed to facilitate and stimulate further research and development efforts aimed at combating the pervasive issue of fake news.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Deepfake tweets automatic detection
Authors:
Adam Frej,
Adrian Kaminski,
Piotr Marciniak,
Szymon Szmajdzinski,
Soveatin Kuntur,
Anna Wroblewska
Abstract:
This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for…
▽ More
This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for recognizing DeepFake content, thereby enhancing the integrity of digital communications. By develo** reliable methods for detecting AI-generated misinformation, this work contributes to a more trustworthy online information environment.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries
Authors:
Anna Wróblewska,
Marcel Witas,
Kinga Frańczak,
Arkadiusz Kniaź,
Siew Ann Cheong,
Tan Seng Chee,
Janusz Hołyst,
Marcin Paprzycki
Abstract:
Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The imp…
▽ More
Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The implemented prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings.
Specifically, users (teachers) can upload their lecture videos, which are preprocessed and analysed using machine learning models. Next, users can view summaries of recognized didactic features through interactive charts and tables. Additionally, stored ML-based prediction results support comparisons between lectures based on their didactic content. In the developed application text-based models trained on lecture transcriptions, with enhancements to the transcription quality, by adopting an automatic speech recognition solution are applied. Furthermore, the system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Mining United Nations General Assembly Debates
Authors:
Mateusz Grzyb,
Mateusz Krzyziński,
Bartłomiej Sobieski,
Mikołaj Spytek,
Bartosz Pieliński,
Daniel Dan,
Anna Wróblewska
Abstract:
This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with…
▽ More
This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with descriptive statistics and automatically extracted topics) from which political scientists can derive insights into international relations and have the opportunity to have a nuanced understanding of global diplomatic discourse.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Raman scattering by carbon nanotubes coupled to quantum dots via dipolar excitonic interaction
Authors:
Anna Wroblewska,
Niclas S. Mueller,
Mariusz Zdrojek,
Stephanie Reich,
Georgy Gordeev
Abstract:
The dipole-dipole interactions between excitons are of paramount importance in the nanoscale structures. When two excitons are placed together they can exchange the energy can manifest in the resonant Raman cross sections. We provide theoretical framework for such effects by combining the coupled oscillator model and perturbation theory. We apply this theory to a hybrid film comprising semiconduct…
▽ More
The dipole-dipole interactions between excitons are of paramount importance in the nanoscale structures. When two excitons are placed together they can exchange the energy can manifest in the resonant Raman cross sections. We provide theoretical framework for such effects by combining the coupled oscillator model and perturbation theory. We apply this theory to a hybrid film comprising semiconducting quantum dots and metallic carbon nanotubes. The quantum dots exciton has a fixed energy, while the nanotube resonances span across a larger range from 1.7 to \SI{1.93}{eV}. We acquire the resonant Raman profiles of the pristine nanotubes and hybrids and find a relative shift between them. The shift direction depends on the relative energies between the CNT and QD exciton energies, as predicted by our theory.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems
Authors:
Martyna Wiącek,
Piotr Rybak,
Łukasz Pszenny,
Alina Wróblewska
Abstract:
With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-…
▽ More
With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries. Aware of the shortcomings of existing NLPre evaluation approaches, we investigate a novel method of reliable and fair evaluation and performance reporting. Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools, while credibly tracking their performance. The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this benchmark, we conduct an extensive evaluation of a variety of Polish NLPre systems. To facilitate the construction of benchmarking environments for other languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full customization of the publicly released source code of the benchmarking system. The links to all the resources (deployed platforms, source code, trained models, datasets etc.) can be found on the project website: https://sites.google.com/view/nlpre-benchmark.
△ Less
Submitted 27 March, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Improving Object Detection Quality in Football Through Super-Resolution Techniques
Authors:
Karolina Seweryn,
Gabriel Chęć,
Szymon Łukasik,
Anna Wróblewska
Abstract:
This study explores the potential of super-resolution techniques in enhancing object detection accuracy in football. Given the sport's fast-paced nature and the critical importance of precise object (e.g. ball, player) tracking for both analysis and broadcasting, super-resolution could offer significant improvements. We investigate how advanced image processing through super-resolution impacts the…
▽ More
This study explores the potential of super-resolution techniques in enhancing object detection accuracy in football. Given the sport's fast-paced nature and the critical importance of precise object (e.g. ball, player) tracking for both analysis and broadcasting, super-resolution could offer significant improvements. We investigate how advanced image processing through super-resolution impacts the accuracy and reliability of object detection algorithms in processing football match footage.
Our methodology involved applying state-of-the-art super-resolution techniques to a diverse set of football match videos from SoccerNet, followed by object detection using Faster R-CNN. The performance of these algorithms, both with and without super-resolution enhancement, was rigorously evaluated in terms of detection accuracy.
The results indicate a marked improvement in object detection accuracy when super-resolution preprocessing is applied. The improvement of object detection through the integration of super-resolution techniques yields significant benefits, especially for low-resolution scenarios, with a notable 12\% increase in mean Average Precision (mAP) at an IoU (Intersection over Union) range of 0.50:0.95 for 320x240 size images when increasing the resolution fourfold using RLFN. As the dimensions increase, the magnitude of improvement becomes more subdued; however, a discernible improvement in the quality of detection is consistently evident. Additionally, we discuss the implications of these findings for real-time sports analytics, player tracking, and the overall viewing experience. The study contributes to the growing field of sports technology by demonstrating the practical benefits and limitations of integrating super-resolution techniques in football analytics and broadcasting.
△ Less
Submitted 31 January, 2024;
originally announced February 2024.
-
Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives
Authors:
Karolina Seweryn,
Anna Wróblewska,
Szymon Łukasik
Abstract:
Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly ava…
▽ More
Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Automating the Analysis of Institutional Design in International Agreements
Authors:
Anna Wróblewska,
Bartosz Pieliński,
Karolina Seweryn,
Sylwia Sysko-Romańczuk,
Karol Saputa,
Aleksandra Wichrowska,
Hanna Schreiber
Abstract:
This paper explores the automatic knowledge extraction of formal institutional design - norms, rules, and actors - from international agreements. The focus was to analyze the relationship between the visibility and centrality of actors in the formal institutional design in regulating critical aspects of cultural heritage relations. The developed tool utilizes techniques such as collecting legal do…
▽ More
This paper explores the automatic knowledge extraction of formal institutional design - norms, rules, and actors - from international agreements. The focus was to analyze the relationship between the visibility and centrality of actors in the formal institutional design in regulating critical aspects of cultural heritage relations. The developed tool utilizes techniques such as collecting legal documents, annotating them with Institutional Grammar, and using graph analysis to explore the formal institutional design. The system was tested against the 2003 UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Enriching language models with graph-based context information to better understand textual data
Authors:
Albert Roethel,
Maria Ganzha,
Anna Wróblewska
Abstract:
A considerable number of texts encountered daily are somehow connected with each other. For example, Wikipedia articles refer to other articles via hyperlinks, scientific papers relate to others via citations or (co)authors, while tweets relate via users that follow each other or reshare content. Hence, a graph-like structure can represent existing connections and be seen as capturing the "context…
▽ More
A considerable number of texts encountered daily are somehow connected with each other. For example, Wikipedia articles refer to other articles via hyperlinks, scientific papers relate to others via citations or (co)authors, while tweets relate via users that follow each other or reshare content. Hence, a graph-like structure can represent existing connections and be seen as capturing the "context" of the texts. The question thus arises if extracting and integrating such context information into a language model might help facilitate a better automated understanding of the text. In this study, we experimentally demonstrate that incorporating graph-based contextualization into BERT model enhances its performance on an example of a classification task. Specifically, on Pubmed dataset, we observed a reduction in error from 8.51% to 7.96%, while increasing the number of parameters just by 1.6%.
Our source code: https://github.com/tryptofanik/gc-bert
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Revisiting Distance Metric Learning for Few-Shot Natural Language Classification
Authors:
Witold Sosnowski,
Anna Wróblewska,
Karolina Seweryn,
Piotr Gawrysiak
Abstract:
Distance Metric Learning (DML) has attracted much attention in image processing in recent years. This paper analyzes its impact on supervised fine-tuning language models for Natural Language Processing (NLP) classification tasks under few-shot learning settings. We investigated several DML loss functions in training RoBERTa language models on known SentEval Transfer Tasks datasets. We also analyze…
▽ More
Distance Metric Learning (DML) has attracted much attention in image processing in recent years. This paper analyzes its impact on supervised fine-tuning language models for Natural Language Processing (NLP) classification tasks under few-shot learning settings. We investigated several DML loss functions in training RoBERTa language models on known SentEval Transfer Tasks datasets. We also analyzed the possibility of using proxy-based DML losses during model inference.
Our systematic experiments have shown that under few-shot learning settings, particularly proxy-based DML losses can positively affect the fine-tuning and inference of a supervised language model. Models tuned with a combination of CCE (categorical cross-entropy loss) and ProxyAnchor Loss have, on average, the best performance and outperform models with only CCE by about 3.27 percentage points -- up to 10.38 percentage points depending on the training dataset.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Distance Metric Learning Loss Functions in Few-Shot Scenarios of Supervised Language Models Fine-Tuning
Authors:
Witold Sosnowski,
Karolina Seweryn,
Anna Wróblewska,
Piotr Gawrysiak
Abstract:
This paper presents an analysis regarding an influence of the Distance Metric Learning (DML) loss functions on the supervised fine-tuning of the language models for classification tasks. We experimented with known datasets from SentEval Transfer Tasks.
Our experiments show that applying the DML loss function can increase performance on downstream classification tasks of RoBERTa-large models in f…
▽ More
This paper presents an analysis regarding an influence of the Distance Metric Learning (DML) loss functions on the supervised fine-tuning of the language models for classification tasks. We experimented with known datasets from SentEval Transfer Tasks.
Our experiments show that applying the DML loss function can increase performance on downstream classification tasks of RoBERTa-large models in few-shot scenarios. Models fine-tuned with the use of SoftTriple loss can achieve better results than models with a standard categorical cross-entropy loss function by about 2.89 percentage points from 0.04 to 13.48 percentage points depending on the training dataset. Additionally, we accomplished a comprehensive analysis with explainability techniques to assess the models' reliability and explain their results.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Entity Graph Extraction from Legal Acts -- a Prototype for a Use Case in Policy Design Analysis
Authors:
Anna Wróblewska,
Bartosz Pieliński,
Karolina Seweryn,
Karol Saputa,
Aleksandra Wichrowska,
Sylwia Sysko-Romańczuk,
Hanna Schreiber
Abstract:
This paper presents research on a prototype developed to serve the quantitative study of public policy design. This sub-discipline of political science focuses on identifying actors, relations between them, and tools at their disposal in health, environmental, economic, and other policies. Our system aims to automate the process of gathering legal documents, annotating them with Institutional Gram…
▽ More
This paper presents research on a prototype developed to serve the quantitative study of public policy design. This sub-discipline of political science focuses on identifying actors, relations between them, and tools at their disposal in health, environmental, economic, and other policies. Our system aims to automate the process of gathering legal documents, annotating them with Institutional Grammar, and using hypergraphs to analyse inter-relations between crucial entities. Our system is tested against the UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage from 2003, a legal document regulating essential aspects of international relations securing cultural heritage.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
Identifying Substitute and Complementary Products for Assortment Optimization with Cleora Embeddings
Authors:
Sergiy Tkachuk,
Anna Wróblewska,
Jacek Dąbrowski,
Szymon Łukasik
Abstract:
Recent years brought an increasing interest in the application of machine learning algorithms in e-commerce, omnichannel marketing, and the sales industry. It is not only to the algorithmic advances but also to data availability, representing transactions, users, and background product information. Finding products related in different ways, i.e., substitutes and complements is essential for users…
▽ More
Recent years brought an increasing interest in the application of machine learning algorithms in e-commerce, omnichannel marketing, and the sales industry. It is not only to the algorithmic advances but also to data availability, representing transactions, users, and background product information. Finding products related in different ways, i.e., substitutes and complements is essential for users' recommendations at the vendor's site and for the vendor - to perform efficient assortment optimization.
The paper introduces a novel method for finding products' substitutes and complements based on the graph embedding Cleora algorithm. We also provide its experimental evaluation with regards to the state-of-the-art Shopper algorithm, studying the relevance of recommendations with surveys from industry experts. It is concluded that the new approach presented here offers suitable choices of recommended products, requiring a minimal amount of additional information. The algorithm can be used in various enterprises, effectively identifying substitute and complementary product options.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis
Authors:
Maciej Pawłowski,
Anna Wróblewska,
Sylwia Sysko-Romańczuk
Abstract:
Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors sho…
▽ More
Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors should be considered while choosing such a technique. This paper explores the most common techniques for building multimodal data representations -- the late fusion, the early fusion, and the sketch, and compares them in classification tasks. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general, our results confirm that multimodal representations are able to boost the performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both MovieLens datasets indicate the importance of the meaningful input data to the given task. In this article, we show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance, that comes with the proper modalities combination. Such choice relies on: the influence that each modality has on the analyzed machine learning (ML) problem; the type of the ML task; the memory constraints while training and predicting phase.
△ Less
Submitted 9 June, 2022;
originally announced June 2022.
-
Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish
Authors:
Michał Możdżonek,
Anna Wróblewska,
Sergiy Tkachuk,
Szymon Łukasik
Abstract:
Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching p…
▽ More
Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better.
Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.
△ Less
Submitted 1 June, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
A Deep Learning Approach for Automatic Detection of Qualitative Features of Lecturing
Authors:
Anna Wroblewska,
Jozef Jasek,
Bogdan Jastrzebski,
Stanislaw Pawlak,
Anna Grzywacz,
Cheong Siew Ann,
Tan Seng Chee,
Tomasz Trzcinski,
Janusz Holyst
Abstract:
Artificial Intelligence in higher education opens new possibilities for improving the lecturing process, such as enriching didactic materials, hel** in assessing students' works or even providing directions to the teachers on how to enhance the lectures. We follow this research path, and in this work, we explore how an academic lecture can be assessed automatically by quantitative features. Firs…
▽ More
Artificial Intelligence in higher education opens new possibilities for improving the lecturing process, such as enriching didactic materials, hel** in assessing students' works or even providing directions to the teachers on how to enhance the lectures. We follow this research path, and in this work, we explore how an academic lecture can be assessed automatically by quantitative features. First, we prepare a set of qualitative features based on teaching practices and then annotate the dataset of academic lecture videos collected for this purpose. We then show how these features could be detected automatically using machine learning and computer vision techniques. Our results show the potential usefulness of our work.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
TASTEset -- Recipe Dataset and Food Entities Recognition Benchmark
Authors:
Ania Wróblewska,
Agnieszka Kaliska,
Maciej Pawłowski,
Dawid Wiśniewski,
Witold Sosnowski,
Agnieszka Ławrynowicz
Abstract:
Food Computing is currently a fast-growing field of research. Natural language processing (NLP) is also increasingly essential in this field, especially for recognising food entities. However, there are still only a few well-defined tasks that serve as benchmarks for solutions in this area. We introduce a new dataset -- called \textit{TASTEset} -- to bridge this gap. In this dataset, Named Entity…
▽ More
Food Computing is currently a fast-growing field of research. Natural language processing (NLP) is also increasingly essential in this field, especially for recognising food entities. However, there are still only a few well-defined tasks that serve as benchmarks for solutions in this area. We introduce a new dataset -- called \textit{TASTEset} -- to bridge this gap. In this dataset, Named Entity Recognition (NER) models are expected to find or infer various types of entities helpful in processing recipes, e.g.~food products, quantities and their units, names of cooking processes, physical quality of ingredients, their purpose, taste.
The dataset consists of 700 recipes with more than 13,000 entities to extract. We provide a few state-of-the-art baselines of named entity recognition models, which show that our dataset poses a solid challenge to existing models. The best model achieved, on average, 0.95 $F_1$ score, depending on the entity type -- from 0.781 to 0.982. We share the dataset and the task to encourage progress on more in-depth and complex information extraction from recipes.
△ Less
Submitted 16 April, 2022;
originally announced April 2022.
-
ProtagonistTagger -- a Tool for Entity Linkage of Persons in Texts from Various Languages and Domains
Authors:
Weronika Lajewska,
Anna Wroblewska
Abstract:
Named entities recognition (NER) and disambiguation (NED) can add semantic context to the recognized named entities in texts. Named entity linkage in texts, regardless of a domain, provides links between the entities mentioned in unstructured texts and individual instances of real-world objects. In this poster, we present a tool - protagonistTagger - for person NER and NED in texts. The tool was t…
▽ More
Named entities recognition (NER) and disambiguation (NED) can add semantic context to the recognized named entities in texts. Named entity linkage in texts, regardless of a domain, provides links between the entities mentioned in unstructured texts and individual instances of real-world objects. In this poster, we present a tool - protagonistTagger - for person NER and NED in texts. The tool was tested on texts extracted from classic English novels and Polish Internet news. The tool's performance (both precision and recall) fluctuates between 78% and even 88%.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Automatic Language Identification for Celtic Texts
Authors:
Olha Dovbnia,
Anna Wróblewska
Abstract:
Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family.
This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to i…
▽ More
Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family.
This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set.
We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with traditional statistical features alongside the output of clustering, autoencoder, and topic modelling methods. The analysis showed that the unsupervised features could serve as a valuable extension to the n-gram feature vectors. It led to an improvement in performance for more entangled classes. The best model achieved a 98\% F1 score and 97\% MCC. The dense neural network consistently outperformed the SVM model.
The low-resource languages are also challenging due to the scarcity of available annotated training data. This work evaluated the performance of the classifiers using the unsupervised feature extraction on the reduced labelled dataset to handle this issue. The results uncovered that the unsupervised feature vectors are more robust to the labelled set reduction. Therefore, they proved to help achieve comparable classification performance with much less labelled data.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks
Authors:
Daniel Ziembicki,
Anna Wróblewska,
Karolina Seweryn
Abstract:
Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entir…
▽ More
Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entirely natural language utterances in Polish and gathers 2,432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative sample in regards to frequency of main verbs and other linguistic features (e.g. occurrence of internal negation). We found that transformer BERT-based models working on sentences obtained relatively good results ($\approx89\%$ F1 score). Even though better results were achieved using linguistic features ($\approx91\%$ F1 score), this model requires more human labour (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenon - e.g. cases with entitlement (E) and non-factive verbs - remain an open issue for further research.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.
-
Spoiler in a Textstack: How Much Can Transformers Help?
Authors:
Anna Wróblewska,
Paweł Rzepiński,
Sylwia Sysko-Romańczuk
Abstract:
This paper presents our research regarding spoiler detection in reviews. In this use case, we describe the method of fine-tuning and organizing the available text-based model tasks with the latest deep learning achievements and techniques to interpret the models' results.
Until now, spoiler research has been rarely described in the literature. We tested the transfer learning approach and differe…
▽ More
This paper presents our research regarding spoiler detection in reviews. In this use case, we describe the method of fine-tuning and organizing the available text-based model tasks with the latest deep learning achievements and techniques to interpret the models' results.
Until now, spoiler research has been rarely described in the literature. We tested the transfer learning approach and different latest transformer architectures on two open datasets with annotated spoilers (ROC AUC above 81\% on TV Tropes Movies dataset, and Goodreads dataset above 88\%). We also collected data and assembled a new dataset with fine-grained annotations. To that end, we employed interpretability techniques and measures to assess the models' reliability and explain their results.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
Applying SoftTriple Loss for Supervised Language Model Fine Tuning
Authors:
Witold Sosnowski,
Anna Wroblewska,
Piotr Gawrysiak
Abstract:
We introduce a new loss function TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models based on cross-entropy and SoftTriple loss. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about (0.02% - 2.29%). Thorough tests on popular datasets indicate a steady gain. The fewer samples in th…
▽ More
We introduce a new loss function TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models based on cross-entropy and SoftTriple loss. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about (0.02% - 2.29%). Thorough tests on popular datasets indicate a steady gain. The fewer samples in the training dataset, the higher gain -- thus, for small-sized dataset it is 0.78%, for medium-sized -- 0.86% for large -- 0.20% and for extra-large 0.04%.
△ Less
Submitted 15 December, 2021;
originally announced December 2021.
-
Protagonists' Tagger in Literary Domain -- New Datasets and a Method for Person Entity Linkage
Authors:
Weronika Łajewska,
Anna Wróblewska
Abstract:
Semantic annotation of long texts, such as novels, remains an open challenge in Natural Language Processing (NLP). This research investigates the problem of detecting person entities and assigning them unique identities, i.e., recognizing people (especially main characters) in novels. We prepared a method for person entity linkage (named entity recognition and disambiguation) and new testing datas…
▽ More
Semantic annotation of long texts, such as novels, remains an open challenge in Natural Language Processing (NLP). This research investigates the problem of detecting person entities and assigning them unique identities, i.e., recognizing people (especially main characters) in novels. We prepared a method for person entity linkage (named entity recognition and disambiguation) and new testing datasets. The datasets comprise 1,300 sentences from 13 classic novels of different genres that a novel reader had manually annotated. Our process of identifying literary characters in a text, implemented in protagonistTagger, comprises two stages: (1) named entity recognition (NER) of persons, (2) named entity disambiguation (NED) - matching each recognized person with the literary character's full name, based on approximate text matching. The protagonistTagger achieves both precision and recall of above 83% on the prepared testing sets. Finally, we gathered a corpus of 13 full-text novels tagged with protagonistTagger that comprises more than 35,000 mentions of literary characters.
△ Less
Submitted 4 October, 2021;
originally announced October 2021.
-
COMBO: State-of-the-Art Morphosyntactic Analysis
Authors:
Mateusz Klimaszewski,
Alina Wróblewska
Abstract:
We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languag…
▽ More
We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector representations, extracted from hidden layers. COMBO is an easy to install Python package with automatically downloadable pre-trained models for over 40 languages. It maintains a balance between efficiency and quality. As it is an end-to-end system and its modules are jointly trained, its training is competitively fast. As its models are optimised for accuracy, they achieve often better prediction quality than SOTA. The COMBO library is available at: https://gitlab.clarin-pl.eu/syntactic-tools/combo.
△ Less
Submitted 11 September, 2021;
originally announced September 2021.
-
COMBO: a new module for EUD parsing
Authors:
Mateusz Klimaszewski,
Alina Wróblewska
Abstract:
We introduce the COMBO-based approach for EUD parsing and its implementation, which took part in the IWPT 2021 EUD shared task. The goal of this task is to parse raw texts in 17 languages into Enhanced Universal Dependencies (EUD). The proposed approach uses COMBO to predict UD trees and EUD graphs. These structures are then merged into the final EUD graphs. Some EUD edge labels are extended with…
▽ More
We introduce the COMBO-based approach for EUD parsing and its implementation, which took part in the IWPT 2021 EUD shared task. The goal of this task is to parse raw texts in 17 languages into Enhanced Universal Dependencies (EUD). The proposed approach uses COMBO to predict UD trees and EUD graphs. These structures are then merged into the final EUD graphs. Some EUD edge labels are extended with case information using a single language-independent expansion rule. In the official evaluation, the solution ranked fourth, achieving an average ELAS of 83.79%. The source code is available at https://gitlab.clarin-pl.eu/syntactic-tools/combo.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts
Authors:
Tomasz Stanisławek,
Filip Graliński,
Anna Wróblewska,
Dawid Lipiński,
Agnieszka Kaliska,
Paulina Rosalska,
Bartosz Topolski,
Przemysław Biecek
Abstract:
The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language docum…
▽ More
The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish
Authors:
Robert Mroczkowski,
Piotr Rybak,
Alina Wróblewska,
Ireneusz Gawlik
Abstract:
BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out,…
▽ More
BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out, but the vast majority of them concerned only the English language. A training procedure designed for English does not have to be universal and applicable to other especially typologically different languages. Therefore, this paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language. We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models. In addition to multilingual model initialization, other factors that possibly influence pretraining are also explored, i.e. training objective, corpus size, BPE-Dropout, and pretraining length. Based on the proposed procedure, a Polish BERT-based language model -- HerBERT -- is trained. This model achieves state-of-the-art results on multiple downstream tasks.
△ Less
Submitted 4 May, 2021;
originally announced May 2021.
-
Multi-modal Embedding Fusion-based Recommender
Authors:
Anna Wroblewska,
Jacek Dabrowski,
Michal Pastuszak,
Andrzej Michalowski,
Michal Daniluk,
Barbara Rychalska,
Mikolaj Wieczorek,
Sylwia Sysko-Romanczuk
Abstract:
Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with significant focus on e-commerce platforms. We have developed a machine learning-based recommendation platform, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our platform supports multiple types of interaction…
▽ More
Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with significant focus on e-commerce platforms. We have developed a machine learning-based recommendation platform, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our platform supports multiple types of interaction data with multiple modalities of metadata natively. This is achieved through multi-modal fusion of various data representations. We deployed the platform into multiple e-commerce stores of different kinds, e.g. food and beverages, shoes, fashion items, telecom operators. Here, we present our system, its flexibility and performance. We also show benchmark results on open datasets, that significantly outperform state-of-the-art prior work.
△ Less
Submitted 14 May, 2020; v1 submitted 13 May, 2020;
originally announced May 2020.
-
Semi-Supervised Neural System for Tagging, Parsing and Lematization
Authors:
Piotr Rybak,
Alina Wróblewska
Abstract:
This paper describes the ICS PAS system which took part in CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The system consists of jointly trained tagger, lemmatizer, and dependency parser which are based on features extracted by a biLSTM network. The system uses both fully connected and dilated convolutional neural architectures. The novelty of our approach…
▽ More
This paper describes the ICS PAS system which took part in CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The system consists of jointly trained tagger, lemmatizer, and dependency parser which are based on features extracted by a biLSTM network. The system uses both fully connected and dilated convolutional neural architectures. The novelty of our approach is the use of an additional loss function, which reduces the number of cycles in the predicted dependency graphs, and the use of self-training to increase the system performance. The proposed system, i.e. ICS PAS (Warszawa), ranked 3th/4th in the official evaluation obtaining the following overall results: 73.02 (LAS), 60.25 (MLAS) and 64.44 (BLEX).
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
A Strong Baseline for Fashion Retrieval with Person Re-Identification Models
Authors:
Mikolaj Wieczorek,
Andrzej Michalowski,
Anna Wroblewska,
Jacek Dabrowski
Abstract:
Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. Difficulties arise from the fine-grained nature of clothing items, very large intra-class and inter-class variance. Additionally, query and source images for the task usually come from different domains - street photos and catalogue photos respectively. Due to these differences, a signi…
▽ More
Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. Difficulties arise from the fine-grained nature of clothing items, very large intra-class and inter-class variance. Additionally, query and source images for the task usually come from different domains - street photos and catalogue photos respectively. Due to these differences, a significant gap in quality, lighting, contrast, background clutter and item presentation exists between domains. As a result, fashion retrieval is an active field of research both in academia and the industry.
Inspired by recent advancements in Person Re-Identification research, we adapt leading ReID models to be used in fashion retrieval tasks. We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results despite a much simpler architecture. We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results. Finally, we propose a cross-domain (cross-dataset) evaluation method to test the robustness of fashion retrieval models.
△ Less
Submitted 9 March, 2020;
originally announced March 2020.
-
Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout
Authors:
Filip Graliński,
Tomasz Stanisławek,
Anna Wróblewska,
Dawid Lipiński,
Agnieszka Kaliska,
Paulina Rosalska,
Bartosz Topolski,
Przemysław Biecek
Abstract:
State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers,…
▽ More
State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages.
To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. Utilizing both textual and structural layout features, an NLP system must find the most important information, about various types of entities, in long formal documents. We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Moreover, we checked the most popular PDF processing tools for text extraction (pdf2djvu, Tesseract and Textract) in order to analyze behavior of IE system in presence of errors introduced by these tools.
△ Less
Submitted 6 March, 2020; v1 submitted 4 March, 2020;
originally announced March 2020.
-
Named Entity Recognition -- Is there a glass ceiling?
Authors:
Tomasz Stanislawek,
Anna Wróblewska,
Alicja Wójcicka,
Daniel Ziembicki,
Przemyslaw Biecek
Abstract:
Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study reveals the weak and strong points of the Stanford, CMU, FL…
▽ More
Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study reveals the weak and strong points of the Stanford, CMU, FLAIR, ELMO and BERT models, as well as their shared limitations. We also introduce new techniques for improving annotation, for training processes and for checking a model's quality and stability. Presented results are based on the CoNLL 2003 data set for the English language. A new enriched semantic annotation of errors for this data set and new diagnostic data sets are attached in the supplementary materials.
△ Less
Submitted 30 October, 2019; v1 submitted 6 October, 2019;
originally announced October 2019.
-
Does it care what you asked? Understanding Importance of Verbs in Deep Learning QA System
Authors:
Barbara Rychalska,
Dominika Basaj,
Przemyslaw Biecek,
Anna Wroblewska
Abstract:
In this paper we present the results of an investigation of the importance of verbs in a deep learning QA system trained on SQuAD dataset. We show that main verbs in questions carry little influence on the decisions made by the system - in over 90% of researched cases swap** verbs for their antonyms did not change system decision. We track this phenomenon down to the insides of the net, analyzin…
▽ More
In this paper we present the results of an investigation of the importance of verbs in a deep learning QA system trained on SQuAD dataset. We show that main verbs in questions carry little influence on the decisions made by the system - in over 90% of researched cases swap** verbs for their antonyms did not change system decision. We track this phenomenon down to the insides of the net, analyzing the mechanism of self-attention and values contained in hidden layers of RNN. Finally, we recognize the characteristics of the SQuAD dataset as the source of the problem. Our work refers to the recently popular topic of adversarial examples in NLP, combined with investigating deep net structure.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
How much should you ask? On the question structure in QA systems
Authors:
Dominika Basaj,
Barbara Rychalska,
Przemyslaw Biecek,
Anna Wroblewska
Abstract:
Datasets that boosted state-of-the-art solutions for Question Answering (QA) systems prove that it is possible to ask questions in natural language manner. However, users are still used to query-like systems where they type in keywords to search for answer. In this study we validate which parts of questions are essential for obtaining valid answer. In order to conclude that, we take advantage of L…
▽ More
Datasets that boosted state-of-the-art solutions for Question Answering (QA) systems prove that it is possible to ask questions in natural language manner. However, users are still used to query-like systems where they type in keywords to search for answer. In this study we validate which parts of questions are essential for obtaining valid answer. In order to conclude that, we take advantage of LIME - a framework that explains prediction by local approximation. We find that grammar and natural language is disregarded by QA. State-of-the-art model can answer properly even if 'asked' only with a few words with high coefficients calculated with LIME. According to our knowledge, it is the first time that QA model is being explained by LIME.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
On the key exchange with nonlinear polynomial maps of stable degree
Authors:
Vasyl Ustimenko,
Aneta Wróblewska
Abstract:
We say that the sequence $g_n$, $n\ge 3$, $n \rightarrow \infty$ of polynomial transformation bijective maps of free module $K^n$ over commutative ring $K$ is a sequence of stable degree if the order of $g_n$ is growing with $n$ and the degree of each nonidentical polynomial map of kind ${g_n}^k$ is an independent constant $c$. A transformation $b=τ
{g_n}^k τ^{-1}$, where $τ$ is affine bijection…
▽ More
We say that the sequence $g_n$, $n\ge 3$, $n \rightarrow \infty$ of polynomial transformation bijective maps of free module $K^n$ over commutative ring $K$ is a sequence of stable degree if the order of $g_n$ is growing with $n$ and the degree of each nonidentical polynomial map of kind ${g_n}^k$ is an independent constant $c$. A transformation $b=τ
{g_n}^k τ^{-1}$, where $τ$ is affine bijection, $n$ is large and $k$ is relatively small, can be used as a base of group theoretical Diffie-Hellman key exchange algorithm for the Cremona group $C(K^n)$ of all regular automorphisms of $K^n$. The specific feature of this method is that the order of the base may be unknown for the adversary because of the complexity of its computation. The exchange can be implemented by tools of Computer Algebra (symbolic computations). The adversary can not use the degree of righthandside in $b^x=d$ to evaluate unknown $x$ in this form for the discrete logarithm problem.
In the paper we introduce the explicit constructions of sequences of elements of stable degree for cases $c=3$ for each commutative ring $K$ containing at least 3 regular elements and discuss the implementation of related key exchange and public key algorithms.
△ Less
Submitted 10 April, 2013;
originally announced April 2013.