Skip to main content

Showing 1–34 of 34 results for author: Pollak, S

.
  1. arXiv:2406.09128  [pdf, other

    cs.CL

    CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature

    Authors: Julien Delaunay, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Georgeta Bordea, Mathilde Ducos, Nicolas Sidere, Antoine Doucet, Senja Pollak, Olivier De Viron

    Abstract: The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classific… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2404.07036  [pdf, other

    cs.CL

    A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media

    Authors: Jaya Caporusso, Damar Hoogland, Mojca Brglez, Boshko Koloski, Matthew Purver, Senja Pollak

    Abstract: Dehumanisation involves the perception and or treatment of a social group's members as less than human. This phenomenon is rarely addressed with computational linguistic techniques. We adapt a recently proposed approach for English, making it easier to transfer to other languages and to evaluate, introducing a new sentiment resource, the use of zero-shot cross-lingual valence and arousal detection… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: The first authors have contributted equally. Accepted at LREC-COLING

  3. arXiv:2404.05281  [pdf, ps, other

    cs.CL

    Multi-Task Learning for Features Extraction in Financial Annual Reports

    Authors: Syrielle Montariol, Matej Martinc, Andraž Pelicon, Senja Pollak, Boshko Koloski, Igor Lončarski, Aljoša Valentinčič

    Abstract: For assessing various performance indicators of companies, the focus is shifting from strictly financial (quantitative) publicly disclosed information to qualitative (textual) information. This textual data can provide valuable weak signals, for example through stylistic features, which can complement the quantitative data on financial performance or on Environmental, Social and Governance (ESG) c… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at MIDAS Workshop at ECML-PKDD 2022

  4. arXiv:2402.16596  [pdf, other

    cs.CL

    Semantic change detection for Slovene language: a novel dataset and an approach based on optimal transport

    Authors: Marko Pranjić, Kaja Dobrovoljc, Senja Pollak, Matej Martinc

    Abstract: In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insights into the evolution of the language caused by changes in society and culture. Recently, several systems have been proposed to aid in this study, but all depend on manually annotated gold standard datasets for e… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    ACM Class: I.2.7

  5. arXiv:2312.15784  [pdf, other

    cs.CL cs.AI

    AHAM: Adapt, Help, Ask, Model -- Harvesting LLMs for literature mining

    Authors: Boshko Koloski, Nada Lavrač, Bojan Cestnik, Senja Pollak, Blaž Škrlj, Andrej Kastrin

    Abstract: In an era marked by a rapid increase in scientific publications, researchers grapple with the challenge of kee** pace with field-specific advances. We present the `AHAM' methodology and a metric that guides the domain-specific \textbf{adapt}ation of the BERTopic topic modeling framework to improve scientific text analysis. By utilizing the LLaMa2 generative language model, we generate topic defi… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: Submitted to IDA 2024

  6. arXiv:2309.15757  [pdf, other

    cs.LG cs.AI

    Latent Graphs for Semi-Supervised Learning on Biomedical Tabular Data

    Authors: Boshko Koloski, Nada Lavrač, Senja Pollak, Blaž Škrlj

    Abstract: In the domain of semi-supervised learning, the current approaches insufficiently exploit the potential of considering inter-instance relationships among (un)labeled data. In this work, we address this limitation by providing an approach for inferring latent graphs that capture the intrinsic data relationships. By leveraging graph-based representations, our approach facilitates the seamless propaga… ▽ More

    Submitted 14 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted at IJCLR 2023

  7. arXiv:2309.06089  [pdf, ps, other

    cs.CL cs.LG

    Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

    Authors: Boshko Koloski, Blaž Škrlj, Marko Robnik-Šikonja, Senja Pollak

    Abstract: The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual tr… ▽ More

    Submitted 15 April, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

  8. Detection of depression on social networks using transformers and ensembles

    Authors: Ilija Tavchioski, Marko Robnik-Šikonja, Senja Pollak

    Abstract: As the impact of technology on our lives is increasing, we witness increased use of social media that became an essential tool not only for communication but also for sharing information with community about our thoughts and feelings. This can be observed also for people with mental health disorders such as depression where they use social media for expressing their thoughts and asking for help. T… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

  9. XAI in Computational Linguistics: Understanding Political Leanings in the Slovenian Parliament

    Authors: Bojan Evkoski, Senja Pollak

    Abstract: The work covers the development and explainability of machine learning models for predicting political leanings through parliamentary transcriptions. We concentrate on the Slovenian parliament and the heated debate on the European migrant crisis, with transcriptions from 2014 to 2020. We develop both classical machine learning and transformer language models to predict the left- or right-leaning o… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: In 10th Language and Technology Conference (LTC 2023). April 2023

  10. arXiv:2304.05908  [pdf, other

    q-bio.NC

    Altered Topological Structure of the Brain White Matter in Maltreated Children through Topological Data Analysis

    Authors: Moo K. Chung, Tahmineh Azizi, Jamie L. Hanson, Andrew L. Alexander, Richard J. Davidson, Seth D. Pollak

    Abstract: Childhood maltreatment may adversely affect brain development and consequently influence behavioral, emotional, and psychological patterns during adulthood. In this study, we propose an analytical pipeline for modeling the altered topological structure of brain white matter in maltreated and typically develo** children. We perform topological data analysis (TDA) to assess the alteration in the g… ▽ More

    Submitted 14 November, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

  11. arXiv:2301.06767  [pdf, other

    cs.CL

    The Recent Advances in Automatic Term Extraction: A survey

    Authors: Hanh Thi Hong Tran, Matej Martinc, Jaya Caporusso, Antoine Doucet, Senja Pollak

    Abstract: Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., in… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

    Comments: 25 pages,4 figures, 3 tables

    ACM Class: A.1

  12. Ensembling Transformers for Cross-domain Automatic Term Extraction

    Authors: Hanh Thi Hong Tran, Matej Martinc, Andraz Pelicon, Antoine Doucet, Senja Pollak

    Abstract: Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and… ▽ More

    Submitted 11 December, 2022; originally announced December 2022.

    Comments: 11 pages including references, 3 figures, 2 tables

    Journal ref: International Conference on Asian Digital Libraries (ICADL 2022)

  13. Retrieval-efficiency trade-off of Unsupervised Keyword Extraction

    Authors: Blaž Škrlj, Boshko Koloski, Senja Pollak

    Abstract: Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amo… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

  14. arXiv:2204.09781  [pdf

    cs.DL cs.CL cs.IR cs.LG

    Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations

    Authors: Qingyu Chen, Alexis Allot, Robert Leaman, Rezarta Islamaj Doğan, **gcheng Du, Li Fang, Kai Wang, Shuo Xu, Yuefu Zhang, Parsa Bagherzadeh, Sabine Bergler, Aakash Bhatnagar, Nidhir Bhavsar, Yung-Chun Chang, Sheng-Jie Lin, Wentai Tang, Hongtong Zhang, Ilija Tavchioski, Senja Pollak, Shubo Tian, **feng Zhang, Yulia Otmakhova, Antonio Jimeno Yepes, Hang Dong, Honghan Wu , et al. (14 additional authors not shown)

    Abstract: The COVID-19 pandemic has been severely impacting global society since December 2019. Massive research has been undertaken to understand the characteristics of the virus and design vaccines and drugs. The related findings have been reported in biomedical literature at a rate of about 10,000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretatio… ▽ More

    Submitted 3 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  15. arXiv:2203.06665  [pdf, other

    q-bio.NC

    Statistical Analysis on Brain Surfaces

    Authors: Moo K. Chung, Jamie L. Hanson, Seth D. Pollak

    Abstract: In this paper, we review widely used statistical analysis frameworks for data defined along cortical and subcortical surfaces that have been developed in last two decades. The cerebral cortex has the topology of a 2D highly convoluted sheet. For data obtained along curved non-Euclidean surfaces, traditional statistical analysis and smoothing techniques based on the Euclidean metric structure are i… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

  16. arXiv:2202.06650  [pdf, other

    cs.CL cs.LG

    Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?

    Authors: Boshko Koloski, Senja Pollak, Blaž Škrlj, Matej Martinc

    Abstract: Keyword extraction is the task of retrieving words that are essential to the content of a given document. Researchers proposed various approaches to tackle this problem. At the top-most level, approaches are divided into ones that require training - supervised and ones that do not - unsupervised. In this study, we are interested in settings, where for a language under investigation, no training da… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

  17. Named entity recognition architecture combining contextual and global features

    Authors: Tran Thi Hong Hanh, Antoine Doucet, Nicolas Sidere, Jose G. Moreno, Senja Pollak

    Abstract: Named entity recognition (NER) is an information extraction technique that aims to locate and classify named entities (e.g., organizations, locations,...) within a document into predefined categories. Correctly identifying these phrases plays a significant role in simplifying information access. However, it remains a difficult task because named entities (NEs) have multiple forms and they are cont… ▽ More

    Submitted 15 December, 2021; originally announced December 2021.

  18. Knowledge Graph informed Fake News Classification via Heterogeneous Representation Ensembles

    Authors: Boshko Koloski, Timen Stepišnik-Perdih, Marko Robnik-Šikonja, Senja Pollak, Blaž Škrlj

    Abstract: Increasing amounts of freely available data both in textual and relational form offers exploration of richer document representations, potentially improving the model performance and robustness. An emerging problem in the modern era is fake news detection -- many easily available pieces of information are not necessarily factually correct, and can lead to wrong conclusions or are used for manipula… ▽ More

    Submitted 15 February, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

  19. Prioritization of COVID-19-related literature via unsupervised keyphrase extraction and document representation learning

    Authors: Blaž Škrlj, Marko Jukič, Nika Eržen, Senja Pollak, Nada Lavrač

    Abstract: The COVID-19 pandemic triggered a wave of novel scientific literature that is impossible to inspect and study in a reasonable time frame manually. Current machine learning methods offer to project such body of literature into the vector space, where similar documents are located close to each other, offering an insightful exploration of scientific papers and other knowledge sources associated with… ▽ More

    Submitted 17 October, 2021; originally announced October 2021.

  20. arXiv:2107.10614  [pdf, ps, other

    cs.CL

    Evaluation of contextual embeddings on less-resourced languages

    Authors: Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž Repar, Senja Pollak, Matthew Purver, Marko Robnik-Šikonja

    Abstract: The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysi… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: 45 pages

  21. JSI at the FinSim-2 task: Ontology-Augmented Financial Concept Classification

    Authors: Timen Stepišnik Perdih, Senja Pollak, Blaž \v{Skrlj}

    Abstract: Ontologies are increasingly used for machine reasoning over the last few years. They can provide explanations of concepts or be used for concept classification if there exists a map** from the desired labels to the relevant ontology. Another advantage of using ontologies is that they do not need a learning process, meaning that we do not need the train data or time before using them. This paper… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

  22. arXiv:2102.00472  [pdf, other

    cs.CL

    Extending Neural Keyword Extraction with TF-IDF tagset matching

    Authors: Boshko Koloski, Senja Pollak, Blaž Škrlj, Matej Martinc

    Abstract: Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian and Russian). First, we perfor… ▽ More

    Submitted 14 February, 2022; v1 submitted 31 January, 2021; originally announced February 2021.

    Comments: The final formatted version of this publication was published in Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (EACL 2021), Online, April, 2021 and is available online at https://www.aclweb.org/anthology/2021.hackashop-1.4

  23. Identification of COVID-19 related Fake News via Neural Stacking

    Authors: Boshko Koloski, Timen Stepišnik Perdih, Senja Pollak, Blaž Škrlj

    Abstract: Identification of Fake News plays a prominent role in the ongoing pandemic, impacting multiple aspects of day-to-day life. In this work we present a solution to the shared task titled COVID19 Fake News Detection in English, scoring the 50th place amongst 168 submissions. The solution was within 1.5% of the best performing solution. The proposed solution employs a heterogeneous representation ensem… ▽ More

    Submitted 30 June, 2021; v1 submitted 11 January, 2021; originally announced January 2021.

    Comments: Published at CONSTRAIN 2021 (AAAI)

  24. COVID-19 therapy target discovery with context-aware literature mining

    Authors: Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, Senja Pollak

    Abstract: The abundance of literature related to the widespread COVID-19 pandemic is beyond manual inspection of a single expert. Development of systems, capable of automatically processing tens of thousands of scientific publications with the aim to enrich existing empirical evidence with literature-based associations is challenging and relevant. We propose a system for contextualization of empirical expre… ▽ More

    Submitted 9 November, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

    Comments: Accepted to the 23rd International Conference on Discovery Science (DS 2020)

  25. arXiv:2005.05716  [pdf, other

    cs.LG stat.ML

    AttViz: Online exploration of self-attention for transparent neural language modeling

    Authors: Blaž Škrlj, Nika Eržen, Shane Sheehan, Saturnino Luz, Marko Robnik-Šikonja, Senja Pollak

    Abstract: Neural language models are becoming the prevailing methodology for the tasks of query answering, text classification, disambiguation, completion and translation. Commonly comprised of hundreds of millions of parameters, these neural network models offer state-of-the-art performance at the cost of interpretability; humans are no longer capable of tracing and understanding how decisions are being ma… ▽ More

    Submitted 12 May, 2020; originally announced May 2020.

  26. TNT-KID: Transformer-based Neural Tagger for Keyword Identification

    Authors: Matej Martinc, Blaž Škrlj, Senja Pollak

    Abstract: With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization and summarization of these data has become a necessity. In this research we present a novel algorithm for keyword identification, i.e., an extraction of one or multi-word phrases representing key aspects of a given document, called Transformer-based Neural Tagger for Keyword IDen… ▽ More

    Submitted 30 November, 2021; v1 submitted 20 March, 2020; originally announced March 2020.

    Comments: Accepted to Natural Language Engineering journal

    Journal ref: Martinc, M., Škrlj, B., & Pollak, S. (2021). TNT-KID: Transformer-based neural tagger for keyword identification. Natural Language Engineering, 1-40. doi:10.1017/S1351324921000127

  27. arXiv:1912.05320  [pdf, other

    cs.CL

    CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

    Authors: Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

    Abstract: State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous… ▽ More

    Submitted 29 October, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    ACM Class: I.2.7

    Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (2020) 5878-5886

  28. arXiv:1912.01072  [pdf, other

    cs.CL

    Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift

    Authors: Matej Martinc, Petra Kralj Novak, Senja Pollak

    Abstract: We propose a new method that leverages contextual embeddings for the task of diachronic semantic shift detection by generating time specific word representations from BERT embeddings. The results of our experiments in the domain specific LiverpoolFC corpus suggest that the proposed method has performance comparable to the current state-of-the-art without requiring any time consuming domain adaptat… ▽ More

    Submitted 5 March, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Comments: Accepted to Language Resources and Evaluation (LREC 2020)

  29. arXiv:1908.10623  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Emotion Recognition in Low-Resource Settings: An Evaluation of Automatic Feature Selection Methods

    Authors: Fasih Haider, Senja Pollak, Pierre Albert, Saturnino Luz

    Abstract: Research in automatic affect recognition has seldom addressed the issue of computational resource utilization. With the advent of ambient intelligence technology which employs a variety of low-power, resource-constrained devices, this issue is increasingly gaining interest. This is especially the case in the context of health and elderly care technologies, where interventions may rely on monitorin… ▽ More

    Submitted 29 May, 2020; v1 submitted 28 August, 2019; originally announced August 2019.

  30. Supervised and Unsupervised Neural Approaches to Text Readability

    Authors: Matej Martinc, Senja Pollak, Marko Robnik-Šikonja

    Abstract: We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages and allows adaptation… ▽ More

    Submitted 11 March, 2021; v1 submitted 26 July, 2019; originally announced July 2019.

    Comments: 39 pages, published in Computational Linguistic Journal

  31. Language comparison via network topology

    Authors: Blaž Škrlj, Senja Pollak

    Abstract: Modeling relations between languages can offer understanding of language characteristics and uncover similarities and differences between languages. Automated methods applied to large textual corpora can be seen as opportunities for novel statistical studies of language development over time, as well as for improving cross-lingual natural language processing techniques. In this work, we first prop… ▽ More

    Submitted 23 December, 2019; v1 submitted 16 July, 2019; originally announced July 2019.

  32. RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation

    Authors: Blaž Škrlj, Andraž Repar, Senja Pollak

    Abstract: Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and… ▽ More

    Submitted 11 November, 2019; v1 submitted 15 July, 2019; originally announced July 2019.

    Comments: The final authenticated publication is available online at https://doi.org/10.1007/978-3-030-31372-2_26

    Journal ref: Statistical Language and Speech Processing 2019 Proceedings

  33. tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

    Authors: Blaž Škrlj, Matej Martinc, Jan Kralj, Nada Lavrač, Senja Pollak

    Abstract: The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy-based features, and demonstrate its use on six short text classification problems: pre… ▽ More

    Submitted 23 April, 2020; v1 submitted 1 February, 2019; originally announced February 2019.

    Comments: Accepted at CSL journal

  34. Persistent Homology in Sparse Regression and its Application to Brain Morphometry

    Authors: Moo K. Chung, Jamie L. Hanson, Jie** Ye, Richard J. Davidson, Seth D. Pollak

    Abstract: Sparse systems are usually parameterized by a tuning parameter that determines the sparsity of the system. How to choose the right tuning parameter is a fundamental and difficult problem in learning the sparse system. In this paper, by treating the the tuning parameter as an additional dimension, persistent homological structures over the parameter space is introduced and explored. The structures… ▽ More

    Submitted 9 March, 2015; v1 submitted 30 August, 2014; originally announced September 2014.

    Comments: submitted to IEEE Transactions on Medical Imaging

    Journal ref: IEEE Transactions on Medical Imaging 2015 34:1928-1939