Search | arXiv e-print repository

Multi-Task Learning for Features Extraction in Financial Annual Reports

Authors: Syrielle Montariol, Matej Martinc, Andraž Pelicon, Senja Pollak, Boshko Koloski, Igor Lončarski, Aljoša Valentinčič

Abstract: For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information. This textual data can provide valuable weak signals, for example through stylistic features, which can complement the quantitative data on financial performance or on Environmental, Social and Governance (ESG) c… ▽ More For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information. This textual data can provide valuable weak signals, for example through stylistic features, which can complement the quantitative data on financial performance or on Environmental, Social and Governance (ESG) criteria. In this work, we use various multi-task learning methods for financial text classification with the focus on financial sentiment, objectivity, forward-looking sentence prediction and ESG-content detection. We propose different methods to combine the information extracted from training jointly on different tasks; our best-performing method highlights the positive effect of explicitly adding auxiliary task predictions as features for the final target task during the multi-task training. Next, we use these classifiers to extract textual features from annual reports of FTSE350 companies and investigate the link between ESG quantitative scores and these features. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at MIDAS Workshop at ECML-PKDD 2022

arXiv:2402.16596 [pdf, other]

Semantic change detection for Slovene language: a novel dataset and an approach based on optimal transport

Authors: Marko Pranjić, Kaja Dobrovoljc, Senja Pollak, Matej Martinc

Abstract: In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insights into the evolution of the language caused by changes in society and culture. Recently, several systems have been proposed to aid in this study, but all depend on manually annotated gold standard datasets for e… ▽ More In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insights into the evolution of the language caused by changes in society and culture. Recently, several systems have been proposed to aid in this study, but all depend on manually annotated gold standard datasets for evaluation. In this paper, we present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3000 manually annotated sentence pairs. We evaluate several existing semantic change detection methods on this dataset and also propose a novel approach based on optimal transport that improves on the existing state-of-the-art systems with an error reduction rate of 22.8%. △ Less

Submitted 26 February, 2024; originally announced February 2024.

ACM Class: I.2.7

arXiv:2301.06767 [pdf, other]

The Recent Advances in Automatic Term Extraction: A survey

Authors: Hanh Thi Hong Tran, Matej Martinc, Jaya Caporusso, Antoine Doucet, Senja Pollak

Abstract: Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., in… ▽ More Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: 25 pages,4 figures, 3 tables

ACM Class: A.1

arXiv:2212.05696 [pdf, other]

doi 10.1007/978-3-031-21756-2_7

Ensembling Transformers for Cross-domain Automatic Term Extraction

Authors: Hanh Thi Hong Tran, Matej Martinc, Andraz Pelicon, Antoine Doucet, Senja Pollak

Abstract: Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and… ▽ More Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements. △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: 11 pages including references, 3 figures, 2 tables

Journal ref: International Conference on Asian Digital Libraries (ICADL 2022)

arXiv:2203.16885 [pdf]

A bilingual approach to specialised adjectives through word embeddings in the karstology domain

Authors: Larisa Grčić Simeunović, Matej Martinc, Špela Vintar

Abstract: We present an experiment in extracting adjectives which express a specific semantic relation using word embeddings. The results of the experiment are then thoroughly analysed and categorised into groups of adjectives exhibiting formal or semantic similarity. The experiment and analysis are performed for English and Croatian in the domain of karstology using data sets and methods developed in the T… ▽ More We present an experiment in extracting adjectives which express a specific semantic relation using word embeddings. The results of the experiment are then thoroughly analysed and categorised into groups of adjectives exhibiting formal or semantic similarity. The experiment and analysis are performed for English and Croatian in the domain of karstology using data sets and methods developed in the TermFrame project. The main original contributions of the article are twofold: firstly, proposing a new and promising method of extracting semantically related words relevant for terminology, and secondly, providing a detailed evaluation of the output so that we gain a better understanding of the domain-specific semantic structures on the one hand and the types of similarities extracted by word embeddings on the other. △ Less

Submitted 31 March, 2022; originally announced March 2022.

Comments: The paper is published as part of TOTH 2020 proceedings (https://btk.univ-smb.fr/livres/toth-2020/)

arXiv:2202.06650 [pdf, other]

Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?

Authors: Boshko Koloski, Senja Pollak, Blaž Škrlj, Matej Martinc

Abstract: Keyword extraction is the task of retrieving words that are essential to the content of a given document. Researchers proposed various approaches to tackle this problem. At the top-most level, approaches are divided into ones that require training - supervised and ones that do not - unsupervised. In this study, we are interested in settings, where for a language under investigation, no training da… ▽ More Keyword extraction is the task of retrieving words that are essential to the content of a given document. Researchers proposed various approaches to tackle this problem. At the top-most level, approaches are divided into ones that require training - supervised and ones that do not - unsupervised. In this study, we are interested in settings, where for a language under investigation, no training data is available. More specifically, we explore whether pretrained multilingual language models can be employed for zero-shot cross-lingual keyword extraction on low-resource languages with limited or no available labeled training data and whether they outperform state-of-the-art unsupervised keyword extractors. The comparison is conducted on six news article datasets covering two high-resource languages, English and Russian, and four low-resource languages, Croatian, Estonian, Latvian, and Slovenian. We find that the pretrained models fine-tuned on a multilingual corpus covering languages that do not appear in the test set (i.e. in a zero-shot setting), consistently outscore unsupervised models in all six languages. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2102.00472 [pdf, other]

Extending Neural Keyword Extraction with TF-IDF tagset matching

Authors: Boshko Koloski, Senja Pollak, Blaž Škrlj, Matej Martinc

Abstract: Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian and Russian). First, we perfor… ▽ More Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian and Russian). First, we perform evaluation of two supervised neural transformer-based methods (TNT-KID and BERT+BiLSTM CRF) and compare them to a baseline TF-IDF based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate to be used as a recommendation system in the media house environment. △ Less

Submitted 14 February, 2022; v1 submitted 31 January, 2021; originally announced February 2021.

Comments: The final formatted version of this publication was published in Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL 2021), Online, April, 2021 and is available online at https://www.aclweb.org/anthology/2021.hackashop-1.4

arXiv:2007.15681 [pdf, other]

doi 10.1007/978-3-030-61527-7_8

COVID-19 therapy target discovery with context-aware literature mining

Authors: Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, Senja Pollak

Abstract: The abundance of literature related to the widespread COVID-19 pandemic is beyond manual inspection of a single expert. Development of systems, capable of automatically processing tens of thousands of scientific publications with the aim to enrich existing empirical evidence with literature-based associations is challenging and relevant. We propose a system for contextualization of empirical expre… ▽ More The abundance of literature related to the widespread COVID-19 pandemic is beyond manual inspection of a single expert. Development of systems, capable of automatically processing tens of thousands of scientific publications with the aim to enrich existing empirical evidence with literature-based associations is challenging and relevant. We propose a system for contextualization of empirical expression data by approximating relations between entities, for which representations were learned from one of the largest COVID-19-related literature corpora. In order to exploit a larger scientific context by transfer learning, we propose a novel embedding generation technique that leverages SciBERT language model pretrained on a large multi-domain corpus of scientific publications and fine-tuned for domain adaptation on the CORD-19 dataset. The conducted manual evaluation by the medical expert and the quantitative evaluation based on therapy targets identified in the related work suggest that the proposed method can be successfully employed for COVID-19 therapy target discovery and that it outperforms the baseline FastText method by a large margin. △ Less

Submitted 9 November, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

Comments: Accepted to the 23rd International Conference on Discovery Science (DS 2020)

arXiv:2003.09166 [pdf, other]

doi 10.1017/S1351324921000127

TNT-KID: Transformer-based Neural Tagger for Keyword Identification

Authors: Matej Martinc, Blaž Škrlj, Senja Pollak

Abstract: With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization and summarization of these data has become a necessity. In this research we present a novel algorithm for keyword identification, i.e., an extraction of one or multi-word phrases representing key aspects of a given document, called Transformer-based Neural Tagger for Keyword IDen… ▽ More With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization and summarization of these data has become a necessity. In this research we present a novel algorithm for keyword identification, i.e., an extraction of one or multi-word phrases representing key aspects of a given document, called Transformer-based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best performing systems. This study also offers thorough error analysis with valuable insights into the inner workings of the model and an ablation study measuring the influence of specific components of the keyword identification workflow on the overall performance. △ Less

Submitted 30 November, 2021; v1 submitted 20 March, 2020; originally announced March 2020.

Comments: Accepted to Natural Language Engineering journal

Journal ref: Martinc, M., Škrlj, B., & Pollak, S. (2021). TNT-KID: Transformer-based neural tagger for keyword identification. Natural Language Engineering, 1-40. doi:10.1017/S1351324921000127

arXiv:2001.06629 [pdf, other]

doi 10.1145/3366424.3382186

Capturing Evolution in Word Usage: Just Add More Clusters?

Authors: Matej Martinc, Syrielle Montariol, Elaine Zosa, Lidia Pivovarova

Abstract: The way the words are used evolves through time, mirroring cultural or technological evolution of society. Semantic change detection is the task of detecting and analysing word evolution in textual data, even in short periods of time. In this paper we focus on a new set of methods relying on contextualised embeddings, a type of semantic modelling that revolutionised the NLP field recently. We leve… ▽ More The way the words are used evolves through time, mirroring cultural or technological evolution of society. Semantic change detection is the task of detecting and analysing word evolution in textual data, even in short periods of time. In this paper we focus on a new set of methods relying on contextualised embeddings, a type of semantic modelling that revolutionised the NLP field recently. We leverage the ability of the transformer-based BERT model to generate contextualised embeddings capable of detecting semantic change of words across time. Several approaches are compared in a common setting in order to establish strengths and weaknesses for each of them. We also propose several ideas for improvements, managing to drastically improve the performance of existing approaches. △ Less

Submitted 23 January, 2020; v1 submitted 18 January, 2020; originally announced January 2020.

Journal ref: WWW 20 Companion Proceedings of the Web Conference 2020 (April 2020) p. 343-349

arXiv:1912.01072 [pdf, other]

Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift

Authors: Matej Martinc, Petra Kralj Novak, Senja Pollak

Abstract: We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptat… ▽ More We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptation on large corpora. The results on the newly created Brexit news corpus suggest that the method can be successfully used for the detection of a short-term yearly semantic shift. And lastly, the model also shows promising results in a multilingual settings, where the task was to detect differences and similarities between diachronic semantic shifts in different languages. △ Less

Submitted 5 March, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: Accepted to Language Resources and Evaluation (LREC 2020)

arXiv:1907.11779 [pdf, other]

doi 10.1162/coli_a_00398

Supervised and Unsupervised Neural Approaches to Text Readability

Authors: Matej Martinc, Senja Pollak, Marko Robnik-Šikonja

Abstract: We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages and allows adaptation… ▽ More We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labelled readability datasets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements. △ Less

Submitted 11 March, 2021; v1 submitted 26 July, 2019; originally announced July 2019.

Comments: 39 pages, published in Computational Linguistic Journal

arXiv:1902.00438 [pdf, other]

doi 10.1016/j.csl.2020.101104

tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

Authors: Blaž Škrlj, Matej Martinc, Jan Kralj, Nada Lavrač, Senja Pollak

Abstract: The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy-based features, and demonstrate its use on six short text classification problems: pre… ▽ More The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy-based features, and demonstrate its use on six short text classification problems: prediction of gender, personality type, age, news topics, drug side effects and drug effectiveness. The constructed semantic features, in combination with fast linear classifiers, tested against strong baselines such as hierarchical attention neural networks, achieves comparable classification results on short text documents. The algorithm's performance is also tested in a few-shot learning setting, indicating that the inclusion of semantic features can improve the performance in data-scarce situations. The tax2vec capability to extract corpus-specific semantic keywords is also demonstrated. Finally, we investigate the semantic space of potential features, where we observe a similarity with the well known Zipf's law. △ Less

Submitted 23 April, 2020; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: Accepted at CSL journal

Showing 1–13 of 13 results for author: Martinc, M