Search | arXiv e-print repository

Multi-Label Adaptive Batch Selection by Highlighting Hard and Imbalanced Samples

Authors: Ao Zhou, Bin Liu, ** Wang, Grigorios Tsoumakas

Abstract: Deep neural network models have demonstrated their effectiveness in classifying multi-label data from various domains. Typically, they employ a training mode that combines mini-batches with optimizers, where each sample is randomly selected with equal probability when constructing mini-batches. However, the intrinsic class imbalance in multi-label data may bias the model towards majority labels, s… ▽ More Deep neural network models have demonstrated their effectiveness in classifying multi-label data from various domains. Typically, they employ a training mode that combines mini-batches with optimizers, where each sample is randomly selected with equal probability when constructing mini-batches. However, the intrinsic class imbalance in multi-label data may bias the model towards majority labels, since samples relevant to minority labels may be underrepresented in each mini-batch. Meanwhile, during the training process, we observe that instances associated with minority labels tend to induce greater losses. Existing heuristic batch selection methods, such as priority selection of samples with high contribution to the objective function, i.e., samples with high loss, have been proven to accelerate convergence while reducing the loss and test error in single-label data. However, batch selection methods have not yet been applied and validated in multi-label data. In this study, we introduce a simple yet effective adaptive batch selection algorithm tailored to multi-label deep learning models. It adaptively selects each batch by prioritizing hard samples related to minority labels. A variant of our method also takes informative label correlations into consideration. Comprehensive experiments combining five multi-label deep learning models on thirteen benchmark datasets show that our method converges faster and performs better than random batch selection. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2312.05172 [pdf, other]

From Lengthy to Lucid: A Systematic Literature Review on NLP Techniques for Taming Long Sentences

Authors: Tatiana Passali, Efstathios Chatzikyriakidis, Stelios Andreadis, Thanos G. Stavropoulos, Anastasia Matonaki, Anestis Fachantidis, Grigorios Tsoumakas

Abstract: Long sentences have been a persistent issue in written communication for many years since they make it challenging for readers to grasp the main points or follow the initial intention of the writer. This survey, conducted using the PRISMA guidelines, systematically reviews two main strategies for addressing the issue of long sentences: a) sentence compression and b) sentence splitting. An increase… ▽ More Long sentences have been a persistent issue in written communication for many years since they make it challenging for readers to grasp the main points or follow the initial intention of the writer. This survey, conducted using the PRISMA guidelines, systematically reviews two main strategies for addressing the issue of long sentences: a) sentence compression and b) sentence splitting. An increased trend of interest in this area has been observed since 2005, with significant growth after 2017. Current research is dominated by supervised approaches for both sentence compression and splitting. Yet, there is a considerable gap in weakly and self-supervised techniques, suggesting an opportunity for further research, especially in domains with limited data. In this survey, we categorize and group the most representative methods into a comprehensive taxonomy. We also conduct a comparative evaluation analysis of these methods on common sentence compression and splitting datasets. Finally, we discuss the challenges and limitations of current methods, providing valuable insights for future research directions. This survey is meant to serve as a comprehensive resource for addressing the complexities of long sentences. We aim to enable researchers to make further advancements in the field until long sentences are no longer a barrier to effective communication. △ Less

Submitted 8 December, 2023; originally announced December 2023.

Comments: Author's Version, Submitted to ACM CSUR

arXiv:2303.16506 [pdf, other]

Local Interpretability of Random Forests for Multi-Target Regression

Authors: Avraam Bardos, Nikolaos Mylonas, Ioannis Mollas, Grigorios Tsoumakas

Abstract: Multi-target regression is useful in a plethora of applications. Although random forest models perform well in these tasks, they are often difficult to interpret. Interpretability is crucial in machine learning, especially when it can directly impact human well-being. Although model-agnostic techniques exist for multi-target regression, specific techniques tailored to random forest models are not… ▽ More Multi-target regression is useful in a plethora of applications. Although random forest models perform well in these tasks, they are often difficult to interpret. Interpretability is crucial in machine learning, especially when it can directly impact human well-being. Although model-agnostic techniques exist for multi-target regression, specific techniques tailored to random forest models are not available. To address this issue, we propose a technique that provides rule-based interpretations for instances made by a random forest model for multi-target regression, influenced by a recent model-specific technique for random forest interpretability. The proposed technique was evaluated through extensive experiments and shown to offer competitive interpretations compared to state-of-the-art techniques. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: 8 pages, 1 figure, 2 tables, to be submitted to XAI conference 2023 as an extended abstract

ACM Class: I.2.0; I.2.6

arXiv:2302.13034 [pdf, other]

Does Noise Affect Housing Prices? A Case Study in the Urban Area of Thessaloniki

Authors: Georgios Kamtziridis, Dimitris Vrakas, Grigorios Tsoumakas

Abstract: Real estate markets depend on various methods to predict housing prices, including models that have been trained on datasets of residential or commercial properties. Most studies endeavor to create more accurate machine learning models by utilizing data such as basic property characteristics as well as urban features like distances from amenities and road accessibility. Even though environmental f… ▽ More Real estate markets depend on various methods to predict housing prices, including models that have been trained on datasets of residential or commercial properties. Most studies endeavor to create more accurate machine learning models by utilizing data such as basic property characteristics as well as urban features like distances from amenities and road accessibility. Even though environmental factors like noise pollution can potentially affect prices, the research around this topic is limited. One of the reasons is the lack of data. In this paper, we reconstruct and make publicly available a general purpose noise pollution dataset based on published studies conducted by the Hellenic Ministry of Environment and Energy for the city of Thessaloniki, Greece. Then, we train ensemble machine learning models, like XGBoost, on property data for different areas of Thessaloniki to investigate the way noise influences prices through interpretability evaluation techniques. Our study provides a new noise pollution dataset that not only demonstrates the impact noise has on housing prices, but also indicates that the influence of noise on prices significantly varies among different areas of the same city. △ Less

Submitted 25 February, 2023; originally announced February 2023.

Comments: 24 pages, 20 figures, 7 tables

arXiv:2301.09350 [pdf, other]

doi 10.1016/j.jbi.2023.104499

Large-scale investigation of weakly-supervised deep learning for the fine-grained semantic indexing of biomedical literature

Authors: Anastasios Nentidis, Thomas Chatzopoulos, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

Abstract: Objective: Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors with several related but distinct biomedical concepts often grouped together and treated as a single topic. This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts. Methods: Lacking labelled data, we rely on weak supervision based on conc… ▽ More Objective: Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors with several related but distinct biomedical concepts often grouped together and treated as a single topic. This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts. Methods: Lacking labelled data, we rely on weak supervision based on concept occurrence in the abstract of an article, which is also enhanced by dictionary-based heuristics. In addition, we investigate deep learning approaches, making design choices to tackle the particular challenges of this task. The new method is evaluated on a large-scale retrospective scenario, based on concepts that have been promoted to descriptors. Results: In our experiments concept occurrence was the strongest heuristic achieving a macro-F1 score of about 0.63 across several labels. The proposed method improved it further by more than 4pp. Conclusion: The results suggest that concept occurrence is a strong heuristic for refining the coarse-grained labels at the level of MeSH concepts and the proposed method improves it further. △ Less

Submitted 5 October, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

Comments: 26 pages, 5 figures, 4 tables. A more concise version

Journal ref: Journal of Biomedical Informatics, Volume 146, 2023, 104499, ISSN 1532-0464

arXiv:2212.03513 [pdf, other]

Truthful Meta-Explanations for Local Interpretability of Machine Learning Models

Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: Automated Machine Learning-based systems' integration into a wide range of tasks has expanded as a result of their performance and speed. Although there are numerous advantages to employing ML-based systems, if they are not interpretable, they should not be used in critical, high-risk applications where human lives are at risk. To address this issue, researchers and businesses have been focusing o… ▽ More Automated Machine Learning-based systems' integration into a wide range of tasks has expanded as a result of their performance and speed. Although there are numerous advantages to employing ML-based systems, if they are not interpretable, they should not be used in critical, high-risk applications where human lives are at risk. To address this issue, researchers and businesses have been focusing on finding ways to improve the interpretability of complex ML systems, and several such methods have been developed. Indeed, there are so many developed techniques that it is difficult for practitioners to choose the best among them for their applications, even when using evaluation metrics. As a result, the demand for a selection tool, a meta-explanation technique based on a high-quality evaluation metric, is apparent. In this paper, we present a local meta-explanation technique which builds on top of the truthfulness metric, which is a faithfulness-based metric. We demonstrate the effectiveness of both the technique and the metric by concretely defining all the concepts and through experimentation. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 22 pages, 5 figures, 9 tables, submitted to Applied Intelligence Journal

ACM Class: I.2.0; I.2.6

arXiv:2212.00543 [pdf, other]

Fine-Grained Selective Similarity Integration for Drug-Target Interaction Prediction

Authors: Bin Liu, ** Wang, Kaiwei Sun, Grigorios Tsoumakas

Abstract: The discovery of drug-target interactions (DTIs) is a pivotal process in pharmaceutical development. Computational approaches are a promising and efficient alternative to tedious and costly wet-lab experiments for predicting novel DTIs from numerous candidates. Recently, with the availability of abundant heterogeneous biological information from diverse data sources, computational methods have bee… ▽ More The discovery of drug-target interactions (DTIs) is a pivotal process in pharmaceutical development. Computational approaches are a promising and efficient alternative to tedious and costly wet-lab experiments for predicting novel DTIs from numerous candidates. Recently, with the availability of abundant heterogeneous biological information from diverse data sources, computational methods have been able to leverage multiple drug and target similarities to boost the performance of DTI prediction. Similarity integration is an effective and flexible strategy to extract crucial information across complementary similarity views, providing a compressed input for any similarity-based DTI prediction model. However, existing similarity integration methods filter and fuse similarities from a global perspective, neglecting the utility of similarity views for each drug and target. In this study, we propose a Fine-Grained Selective similarity integration approach, called FGS, which employs a local interaction consistency-based weight matrix to capture and exploit the importance of similarities at a finer granularity in both similarity selection and combination steps. We evaluate FGS on five DTI prediction datasets under various prediction settings. Experimental results show that our method not only outperforms similarity integration competitors with comparable computational costs, but also achieves better prediction performance than state-of-the-art DTI prediction approaches by collaborating with conventional base models. Furthermore, case studies on the analysis of similarity weights and on the verification of novel predictions confirm the practical ability of FGS. △ Less

Submitted 21 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

arXiv:2211.09235 [pdf, other]

Artificial Disfluency Detection, Uh No, Disfluency Generation for the Masses

Authors: T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S. Vrochidis

Abstract: Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all… ▽ More Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all the different types of disfluencies (repetitions, replacements and restarts) based on the reparandum/interregnum annotation scheme. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic context-aware artificial disfluencies. Since the proposed method requires only fluent text, it can be used directly for training, bypassing the requirement of annotated disfluent data. Our empirical evaluation demonstrates that LARD can indeed be effectively used when no or only a few data are available. Furthermore, our detailed analysis suggests that the proposed method generates realistic disfluencies and increases the accuracy of existing disfluency detectors. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 10 pages

arXiv:2209.10876 [pdf, other]

An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification

Authors: Nikolaos Mylonas, Ioannis Mollas, Grigorios Tsoumakas

Abstract: Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations between (sub)words. However, transformers are difficult to interpret. Being able to provide reasoning for its decisions is an important property for a model in domain… ▽ More Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations between (sub)words. However, transformers are difficult to interpret. Being able to provide reasoning for its decisions is an important property for a model in domains where human lives are affected. With transformers finding wide use in such fields, the need for interpretability techniques tailored to them arises. We propose a new technique that selects the most faithful attention-based interpretation among the several ones that can be obtained by combining different head, layer and matrix operations. In addition, two variations are introduced towards (i) reducing the computational complexity, thus being faster and friendlier to the environment, and (ii) enhancing the performance in multi-label data. We further propose a new faithfulness metric that is more suitable for transformer models and exhibits high correlation with the area under the precision-recall curve based on ground truth rationales. We validate the utility of our contributions with a series of quantitative and qualitative experiments on seven datasets. △ Less

Submitted 28 November, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

Comments: 16 pages, 7 figures, 5 tables, Submitted to DAMI Journal (ECMLPKDD2023 Special Issue)

arXiv:2207.01994 [pdf, other]

doi 10.1007/978-3-031-23618-1_25

Local Multi-Label Explanations for Random Forest

Authors: Nikolaos Mylonas, Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: Multi-label classification is a challenging task, particularly in domains where the number of labels to be predicted is large. Deep neural networks are often effective at multi-label classification of images and textual data. When dealing with tabular data, however, conventional machine learning algorithms, such as tree ensembles, appear to outperform competition. Random forest, being a popular en… ▽ More Multi-label classification is a challenging task, particularly in domains where the number of labels to be predicted is large. Deep neural networks are often effective at multi-label classification of images and textual data. When dealing with tabular data, however, conventional machine learning algorithms, such as tree ensembles, appear to outperform competition. Random forest, being a popular ensemble algorithm, has found use in a wide range of real-world problems. Such problems include fraud detection in the financial domain, crime hotspot detection in the legal sector, and in the biomedical field, disease probability prediction when patient records are accessible. Since they have an impact on people's lives, these domains usually require decision-making systems to be explainable. Random Forest falls short on this property, especially when a large number of tree predictors are used. This issue was addressed in a recent research named LionForests, regarding single label classification and regression. In this work, we adapt this technique to multi-label classification problems, by employing three different strategies regarding the labels that the explanation covers. Finally, we provide a set of qualitative and quantitative experiments to assess the efficacy of this approach. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: 11 pages, 1 figues, 8 tables, submitted to XKDD (workshop of ECML PKDD 2022)

ACM Class: I.2.0; I.2.6

arXiv:2206.04317 [pdf, other]

Topic-Controllable Summarization: Topic-Aware Evaluation and Transformer Methods

Authors: Tatiana Passali, Grigorios Tsoumakas

Abstract: Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. For example, the majority of existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to… ▽ More Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. For example, the majority of existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to the model's architecture for controlling the topic. At the same time, there is currently no established evaluation metric designed specifically for topic-controllable summarization. This work proposes a new topic-oriented evaluation measure to automatically evaluate the generated summaries based on the topic affinity between the generated summary and the desired topic. The reliability of the proposed measure is demonstrated through appropriately designed human evaluation. In addition, we adapt topic embeddings to work with powerful Transformer architectures and propose a novel and efficient approach for guiding the summary generation through control tokens. Experimental results reveal that control tokens can achieve better performance compared to more complicated embedding-based approaches while also being significantly faster. △ Less

Submitted 17 April, 2024; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: Accepted at LREC-COLING 2024

arXiv:2204.14012 [pdf, other]

Local Explanation of Dimensionality Reduction

Authors: Avraam Bardos, Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: Dimensionality reduction (DR) is a popular method for preparing and analyzing high-dimensional data. Reduced data representations are less computationally intensive and easier to manage and visualize, while retaining a significant percentage of their original information. Aside from these advantages, these reduced representations can be difficult or impossible to interpret in most circumstances, e… ▽ More Dimensionality reduction (DR) is a popular method for preparing and analyzing high-dimensional data. Reduced data representations are less computationally intensive and easier to manage and visualize, while retaining a significant percentage of their original information. Aside from these advantages, these reduced representations can be difficult or impossible to interpret in most circumstances, especially when the DR approach does not provide further information about which features of the original space led to their construction. This problem is addressed by Interpretable Machine Learning, a subfield of Explainable Artificial Intelligence that addresses the opacity of machine learning models. However, current research on Interpretable Machine Learning has been focused on supervised tasks, leaving unsupervised tasks like Dimensionality Reduction unexplored. In this paper, we introduce LXDR, a technique capable of providing local interpretations of the output of DR techniques. Experiment results and two LXDR use case examples are presented to evaluate its usefulness. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Comments: 13 Pages, 12 Figures, 6 Tables, Submitted to SETN2022

arXiv:2201.09508 [pdf, other]

doi 10.1093/bib/bbac353

Multiple Similarity Drug-Target Interaction Prediction with Random Walks and Matrix Factorization

Authors: Bin Liu, Dimitrios Papadopoulos, Fragkiskos D. Malliaros, Grigorios Tsoumakas, Apostolos N. Papadopoulos

Abstract: The discovery of drug-target interactions (DTIs) is a very promising area of research with great potential. The accurate identification of reliable interactions among drugs and proteins via computational methods, which typically leverage heterogeneous information retrieved from diverse data sources, can boost the development of effective pharmaceuticals. Although random walk and matrix factorizati… ▽ More The discovery of drug-target interactions (DTIs) is a very promising area of research with great potential. The accurate identification of reliable interactions among drugs and proteins via computational methods, which typically leverage heterogeneous information retrieved from diverse data sources, can boost the development of effective pharmaceuticals. Although random walk and matrix factorization techniques are widely used in DTI prediction, they have several limitations. Random walk-based embedding generation is usually conducted in an unsupervised manner, while the linear similarity combination in matrix factorization distorts individual insights offered by different views. To tackle these issues, we take a multi-layered network approach to handle diverse drug and target similarities, and propose a novel optimization framework, called Multiple similarity DeepWalk-based Matrix Factorization (MDMF), for DTI prediction. The framework unifies embedding generation and interaction prediction, learning vector representations of drugs and targets that not only retain higher-order proximity across all hyper-layers and layer-specific local invariance, but also approximate the interactions with their inner product. Furthermore, we develop an ensemble method (MDMF2A) that integrates two instantiations of the MDMF model, optimizing the area under the precision-recall curve (AUPR) and the area under the receiver operating characteristic curve (AUC) respectively. The empirical study on real-world DTI datasets shows that our method achieves statistically significant improvement over current state-of-the-art approaches in four different settings. Moreover, the validation of highly ranked non-interacting pairs also demonstrates the potential of MDMF2A to discover novel DTIs. △ Less

Submitted 8 August, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Journal ref: Briefings in Bioinformatics, Volume 23, Issue 5, 2022

arXiv:2201.05041 [pdf, other]

LARD: Large-scale Artificial Disfluency Generation

Authors: T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S. Vrochidis

Abstract: Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this p… ▽ More Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this paper. To this end, we propose LARD, a method for generating complex and realistic artificial disfluencies with little effort. The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts. In addition, we release a new large-scale dataset with disfluencies that can be used on four different tasks: disfluency detection, classification, extraction and correction. Experimental results on the LARD dataset demonstrate that the data produced by the proposed method can be effectively used for detecting and removing disfluencies, while also addressing limitations of existing datasets. △ Less

Submitted 3 May, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

Comments: Accepted at LREC 2022

arXiv:2110.04480 [pdf, other]

Bayesian Active Summarization

Authors: Alexios Gidiotis, Grigorios Tsoumakas

Abstract: Bayesian Active Learning has had significant impact to various NLP problems, but nevertheless it's application to text summarization has been explored very little. We introduce Bayesian Active Summarization (BAS), as a method of combining active learning methods with state-of-the-art summarization models. Our findings suggest that BAS achieves better and more robust performance, compared to random… ▽ More Bayesian Active Learning has had significant impact to various NLP problems, but nevertheless it's application to text summarization has been explored very little. We introduce Bayesian Active Summarization (BAS), as a method of combining active learning methods with state-of-the-art summarization models. Our findings suggest that BAS achieves better and more robust performance, compared to random selection, particularly for small and very small data annotation budgets. Using BAS we showcase it is possible to leverage large summarization models to effectively solve real-world problems with very limited annotated data. △ Less

Submitted 9 October, 2021; originally announced October 2021.

arXiv:2107.03825 [pdf, other]

Short-term Renewable Energy Forecasting in Greece using Prophet Decomposition and Tree-based Ensembles

Authors: Argyrios Vartholomaios, Stamatis Karlos, Eleftherios Kouloumpris, Grigorios Tsoumakas

Abstract: Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile energy sources, since it leads to higher reliability a… ▽ More Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile energy sources, since it leads to higher reliability and reduced ancillary operational costs for power systems. This paper presents a new dataset for solar and wind energy generation forecast in Greece and introduces a feature engineering pipeline that enriches the dimensional space of the dataset. In addition, we propose a novel method that utilizes the innovative Prophet model, an end-to-end forecasting tool that considers several kinds of nonlinear trends in decomposing the energy time series before a tree-based ensemble provides short-term predictions. The performance of the system is measured through representative evaluation metrics, and by estimating the model's generalization under an industryprovided scheme of absolute error thresholds. The proposed hybrid model competes with baseline persistence models, tree-based regression ensembles, and the Prophet model, managing to outperform them, presenting both lower error rates and more favorable error distribution. △ Less

Submitted 8 July, 2021; originally announced July 2021.

Comments: 11 pages, 7 figures

arXiv:2106.00302 [pdf, other]

Harvesting the Public MeSH Note field

Authors: Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

Abstract: In this document, we report an analysis of the Public MeSH Note field of the new descriptors introduced in the MeSH thesaurus between 2006 and 2020. The aim of this analysis was to extract information about the previous status of these new descriptors as Supplementary Concept Records. The Public MeSH Note field contains information in semi-structured text, meant to be read by humans. Therefore, we… ▽ More In this document, we report an analysis of the Public MeSH Note field of the new descriptors introduced in the MeSH thesaurus between 2006 and 2020. The aim of this analysis was to extract information about the previous status of these new descriptors as Supplementary Concept Records. The Public MeSH Note field contains information in semi-structured text, meant to be read by humans. Therefore, we adopted a semi-automated approach, based on regular expressions, to extract information from it. In the large majority of cases, we managed to minimize the required manual effort for extracting the previous state of a new descriptor as a Supplementary Concept Record. The source code for this analysis is openly available on GitHub. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: 3 pages, 1 figure, 1 table. Technical report

arXiv:2105.10155 [pdf, other]

Should We Trust This Summary? Bayesian Abstractive Summarization to The Rescue

Authors: Alexios Gidiotis, Grigorios Tsoumakas

Abstract: We explore the notion of uncertainty in the context of modern abstractive summarization models, using the tools of Bayesian Deep Learning. Our approach approximates Bayesian inference by first extending state-of-the-art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes. Based on Bayesian inference we are able to effectively quantify unc… ▽ More We explore the notion of uncertainty in the context of modern abstractive summarization models, using the tools of Bayesian Deep Learning. Our approach approximates Bayesian inference by first extending state-of-the-art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes. Based on Bayesian inference we are able to effectively quantify uncertainty at prediction time. Having a reliable uncertainty measure, we can improve the experience of the end user by filtering out generated summaries of high uncertainty. Furthermore, uncertainty estimation could be used as a criterion for selecting samples for annotation, and can be paired nicely with active learning and human-in-the-loop approaches. Finally, Bayesian inference enables us to find a Bayesian summary which performs better than a deterministic one and is more robust to uncertainty. In practice, we show that our Variational Bayesian equivalents of BART and PEGASUS can outperform their deterministic counterparts on multiple benchmark datasets. △ Less

Submitted 3 May, 2022; v1 submitted 21 May, 2021; originally announced May 2021.

arXiv:2105.05527 [pdf, other]

Self-citation Analysis using Sentence Embeddings

Authors: Athanasios Lagopoulos, Grigorios Tsoumakas

Abstract: The purpose of citation indexes and metrics is intended to be a measure for scientific innovation and quality for researchers, journals, and institutions. However, those metrics are often prone to abuse and manipulation by excessive and unethical self-citations induced by authors, reviewers, editors, or journals. Identifying whether there are or not legitimate reasons for self-citations is normall… ▽ More The purpose of citation indexes and metrics is intended to be a measure for scientific innovation and quality for researchers, journals, and institutions. However, those metrics are often prone to abuse and manipulation by excessive and unethical self-citations induced by authors, reviewers, editors, or journals. Identifying whether there are or not legitimate reasons for self-citations is normally determined during the review process, where the participating parts may have intrinsic incentives, rendering the legitimacy of self-citations, after publication, questionable. In this paper, we conduct a large-scale analysis of journal self-citations while taking into consideration the similarity between a publication and its references. Specifically, we look into PubMed Central articles published since 1990 and compute similarities of article-reference pairs using sentence embeddings. We examine journal self-citations with an aim to distinguish between justifiable and unethical self-citations. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2105.01545 [pdf, other]

Optimizing Area Under the Curve Measures via Matrix Factorization for Predicting Drug-Target Interaction with Multiple Similarities

Authors: Bin Liu, Grigorios Tsoumakas

Abstract: In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although it has been shown that fusing heterogeneous drug and target similarities can… ▽ More In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although it has been shown that fusing heterogeneous drug and target similarities can improve the prediction ability, the existing similarity combination methods ignore the interaction consistency for neighbour entities which is more crucial for the DTI prediction model. Furthermore, area under the precision-recall curve (AUPR) that emphasizes the accuracy of top-ranked pairs and area under the receiver operating characteristic curve (AUC) that heavily punishes the existence of low ranked interacting pairs are two widely used evaluation metrics in DTI prediction. However, the two metrics are seldom considered as losses within existing DTI prediction methods. This paper first proposes two matrix factorization (MF) methods that optimize AUPR and AUC using convex surrogate losses respectively, and then develops an ensemble MF approach takes advantage of the two area under the curve metrics by combining the two single metric based MF models. Both three proposed approaches incorporate a novel local interaction consistency aware similarity interaction method to generate fused drug and target similarities that preserve vital information from the more reliable view. Experimental results over five datasets under different prediction settings show that the proposed methods outperform various competitors in terms of the metric(s) they optimize. In addition, the validation on the top ranked novel predictions confirms the ability of our methods to discover potential new DTIs. △ Less

Submitted 14 January, 2022; v1 submitted 1 May, 2021; originally announced May 2021.

arXiv:2104.06057 [pdf, other]

doi 10.1007/s10489-022-03351-4

LioNets: A Neural-Specific Local Interpretation Technique Exploiting Penultimate Layer Information

Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: Artificial Intelligence (AI) has a tremendous impact on the unexpected growth of technology in almost every aspect. AI-powered systems are monitoring and deciding about sensitive economic and societal issues. The future is towards automation, and it must not be prevented. However, this is a conflicting viewpoint for a lot of people, due to the fear of uncontrollable AI systems. This concern could… ▽ More Artificial Intelligence (AI) has a tremendous impact on the unexpected growth of technology in almost every aspect. AI-powered systems are monitoring and deciding about sensitive economic and societal issues. The future is towards automation, and it must not be prevented. However, this is a conflicting viewpoint for a lot of people, due to the fear of uncontrollable AI systems. This concern could be reasonable if it was originating from considerations associated with social issues, like gender-biased, or obscure decision-making systems. Explainable AI (XAI) is recently treated as a huge step towards reliable systems, enhancing the trust of people to AI. Interpretable machine learning (IML), a subfield of XAI, is also an urgent topic of research. This paper presents a small but significant contribution to the IML community, focusing on a local-based, neural-specific interpretation process applied to textual and time-series data. The proposed methodology introduces new approaches to the presentation of feature importance based interpretations, as well as the production of counterfactual words on textual datasets. Eventually, an improved evaluation metric is introduced for the assessment of interpretation techniques, which supports an extensive set of qualitative and quantitative experiments. △ Less

Submitted 13 April, 2021; originally announced April 2021.

Comments: 23 pages, 22 figures, 2 tables, submitted to Information Fusion Journal

ACM Class: I.2.0; I.2.6; I.2.7

arXiv:2104.06040 [pdf, other]

Conclusive Local Interpretation Rules for Random Forests

Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be able to provide clear interpretations for their decisions. Otherwise, their obscure decision-making processes can lead to socioethical issues as they interfere with people's lives. In the aforementioned sectors, random forest algorithms strive… ▽ More In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be able to provide clear interpretations for their decisions. Otherwise, their obscure decision-making processes can lead to socioethical issues as they interfere with people's lives. In the aforementioned sectors, random forest algorithms strive, thus their ability to explain themselves is an obvious requirement. In this paper, we present LionForests, which relies on a preliminary work of ours. LionForests is a random forest-specific interpretation technique, which provides rules as explanations. It is applicable from binary classification tasks to multi-class classification and regression tasks, and it is supported by a stable theoretical background. Experimentation, including sensitivity analysis and comparison with state-of-the-art techniques, is also performed to demonstrate the efficacy of our contribution. Finally, we highlight a unique property of LionForests, called conclusiveness, that provides interpretation validity and distinguishes it from previous techniques. △ Less

Submitted 13 April, 2021; originally announced April 2021.

Comments: 32 pages, 31 figures, 4 Tables, submitted to data mining and knowledge discovery journal

ACM Class: I.2.0; I.2.6

arXiv:2103.17003 [pdf, other]

doi 10.24963/ijcai.2021/713

VisioRed: A Visualisation Tool for Interpretable Predictive Maintenance

Authors: Spyridon Paraschos, Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: The use of machine learning rapidly increases in high-risk scenarios where decisions are required, for example in healthcare or industrial monitoring equipment. In crucial situations, a model that can offer meaningful explanations of its decision-making is essential. In industrial facilities, the equipment's well-timed maintenance is vital to ensure continuous operation to prevent money loss. Usin… ▽ More The use of machine learning rapidly increases in high-risk scenarios where decisions are required, for example in healthcare or industrial monitoring equipment. In crucial situations, a model that can offer meaningful explanations of its decision-making is essential. In industrial facilities, the equipment's well-timed maintenance is vital to ensure continuous operation to prevent money loss. Using machine learning, predictive and prescriptive maintenance attempt to anticipate and prevent eventual system failures. This paper introduces a visualisation tool incorporating interpretations to display information derived from predictive maintenance models, trained on time-series data. △ Less

Submitted 14 April, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

Comments: 4 pages, 2 figures, Submitted to IJCAI

ACM Class: I.2.0; I.2.6; H.5.2

arXiv:2103.04156 [pdf, other]

Improving Zero-Shot Entity Retrieval through Effective Dense Representations

Authors: Eleni Partalidou, Despina Christou, Grigorios Tsoumakas

Abstract: Entity Linking (EL) seeks to align entity mentions in text to entries in a knowledge-base and is usually comprised of two phases: candidate generation and candidate ranking. While most methods focus on the latter, it is the candidate generation phase that sets an upper bound to both time and accuracy performance of the overall EL system. This work's contribution is a significant improvement in can… ▽ More Entity Linking (EL) seeks to align entity mentions in text to entries in a knowledge-base and is usually comprised of two phases: candidate generation and candidate ranking. While most methods focus on the latter, it is the candidate generation phase that sets an upper bound to both time and accuracy performance of the overall EL system. This work's contribution is a significant improvement in candidate generation which thus raises the performance threshold for EL, by generating candidates that include the gold entity in the least candidate set (top-K). We propose a simple approach that efficiently embeds mention-entity pairs in dense space through a BERT-based bi-encoder. Specifically, we extend (Wu et al., 2020) by introducing a new pooling function and incorporating entity type side-information. We achieve a new state-of-the-art 84.28% accuracy on top-50 candidates on the Zeshel dataset, compared to the previous 82.06% on the top-64 of (Wu et al., 2020). We report the results from extensive experimentation using our proposed model on both seen and unseen entity datasets. Our results suggest that our method could be a useful complement to existing EL approaches. △ Less

Submitted 6 March, 2021; originally announced March 2021.

Comments: 8 pages, 2 figures

ACM Class: I.2.7; H.3.3

arXiv:2102.01156 [pdf, other]

Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embeddings

Authors: Despina Christou, Grigorios Tsoumakas

Abstract: Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information, but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose REDSandT (Relation Extraction with Distant Supervi… ▽ More Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information, but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose REDSandT (Relation Extraction with Distant Supervision and Transformers), a novel distantly-supervised transformer-based RE method, that manages to capture a wider set of relations through highly informative instance and label embeddings for RE, by exploiting BERT's pre-trained model, and the relationship between labels and entities, respectively. We guide REDSandT to focus solely on relational tokens by fine-tuning BERT on a structured input, including the sub-tree connecting an entity pair and the entities' types. Using the extracted informative vectors, we shape label embeddings, which we also use as attention mechanism over instances to further reduce noise. Finally, we represent sentences by concatenating relation and instance embeddings. Experiments in the NYT-10 dataset show that REDSandT captures a broader set of relations with higher confidence, achieving state-of-the-art AUC (0.424). △ Less

Submitted 1 February, 2021; originally announced February 2021.

Comments: 10 pages, 4 figures

ACM Class: I.2.7; H.3.3

arXiv:2101.08293 [pdf, other]

doi 10.1007/s00799-021-00304-z

What is all this new MeSH about? Exploring the semantic provenance of new descriptors in the MeSH thesaurus

Authors: Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

Abstract: The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual provenance of these new descriptors. In particular,… ▽ More The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual provenance of these new descriptors. In particular, we investigate whether such new descriptors have been previously covered by older descriptors and what is their current relation to them. To this end, we propose a framework to categorize new descriptors based on their current relation to older descriptors. Based on the proposed classification scheme, we quantify, analyse and present the different types of new descriptors introduced in MeSH during the last fifteen years. The results show that only about 25% of new MeSH descriptors correspond to new emerging concepts, whereas the rest were previously covered by one or more existing descriptors, either implicitly or explicitly. Most of them were covered by a single existing descriptor and they usually end up as descendants of it in the current hierarchy, gradually leading towards a more fine-grained MeSH vocabulary. These insights about the dynamics of the thesaurus are useful for the retrospective study of scientific articles annotated with MeSH, but could also be used to inform the policy of updating the thesaurus in the future. △ Less

Submitted 27 July, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

Comments: 18 pages, 14 figures, 2 tables

arXiv:2012.12325 [pdf, other]

Drug-Target Interaction Prediction via an Ensemble of Weighted Nearest Neighbors with Interaction Recovery

Authors: Bin Liu, Konstantinos Pliakos, Celine Vens, Grigorios Tsoumakas

Abstract: Predicting drug-target interactions (DTI) via reliable computational methods is an effective and efficient way to mitigate the enormous costs and time of the drug discovery process. Structure-based drug similarities and sequence-based target protein similarities are the commonly used information for DTI prediction. Among numerous computational methods, neighborhood-based chemogenomic approaches th… ▽ More Predicting drug-target interactions (DTI) via reliable computational methods is an effective and efficient way to mitigate the enormous costs and time of the drug discovery process. Structure-based drug similarities and sequence-based target protein similarities are the commonly used information for DTI prediction. Among numerous computational methods, neighborhood-based chemogenomic approaches that leverage drug and target similarities to perform predictions directly are simple but promising ones. However, existing similarity-based methods need to be re-trained to predict interactions for any new drugs or targets and cannot directly perform predictions for both new drugs, new targets, and new drug-target pairs. Furthermore, a large amount of missing (undetected) interactions in current DTI datasets hinders most DTI prediction methods. To address these issues, we propose a new method denoted as Weighted k-Nearest Neighbor with Interaction Recovery (WkNNIR). Not only can WkNNIR estimate interactions of any new drugs and/or new targets without any need of re-training, but it can also recover missing interactions (false negatives). In addition, WkNNIR exploits local imbalance to promote the influence of more reliable similarities on the interaction recovery and prediction processes. We also propose a series of ensemble methods that employ diverse sampling strategies and could be coupled with WkNNIR as well as any other DTI prediction method to improve performance. Experimental results over five benchmark datasets demonstrate the effectiveness of our approaches in predicting drug-target interactions. Lastly, we confirm the practical prediction ability of proposed methods to discover reliable interactions that were not reported in the original benchmark datasets. △ Less

Submitted 9 July, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

arXiv:2011.09752 [pdf, other]

From Protocol to Screening: A Hybrid Learning Approach for Technology-Assisted Systematic Literature Reviews

Authors: Athanasios Lagopoulos, Grigorios Tsoumakas

Abstract: In the medical domain, a Systematic Literature Review (SLR) attempts to collect all empirical evidence, that fit pre-specified eligibility criteria, in order to answer a specific research question. The process of preparing an SLR consists of multiple tasks that are labor-intensive and time-consuming, involving large monetary costs. Technology-assisted review (TAR) methods automate the different pr… ▽ More In the medical domain, a Systematic Literature Review (SLR) attempts to collect all empirical evidence, that fit pre-specified eligibility criteria, in order to answer a specific research question. The process of preparing an SLR consists of multiple tasks that are labor-intensive and time-consuming, involving large monetary costs. Technology-assisted review (TAR) methods automate the different processes of creating an SLR and they are particularly focused on reducing the burden of screening for reviewers. We present a novel method for TAR that implements a full pipeline from the research protocol to the screening of the relevant papers. Our pipeline overcomes the need of a Boolean query constructed by specialists and consists of three different components: the primary retrieval engine, the inter-review ranker and the intra-review ranker, combining learning-to-rank techniques with a relevance feedback method. In addition, we contribute an updated version of the Task 2 of the CLEF 2019 eHealth Lab dataset, which we make publicly available. Empirical results on this dataset show that our approach can achieve state-of-the-art results. △ Less

Submitted 19 November, 2020; originally announced November 2020.

arXiv:2010.07650 [pdf, other]

Altruist: Argumentative Explanations through Local Interpretations of Predictive Models

Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

Abstract: Explainable AI is an emerging field providing solutions for acquiring insights into automated systems' rationale. It has been put on the AI map by suggesting ways to tackle key ethical and societal issues. Existing explanation techniques are often not comprehensible to the end user. Lack of evaluation and selection criteria also makes it difficult for the end user to choose the most suitable techn… ▽ More Explainable AI is an emerging field providing solutions for acquiring insights into automated systems' rationale. It has been put on the AI map by suggesting ways to tackle key ethical and societal issues. Existing explanation techniques are often not comprehensible to the end user. Lack of evaluation and selection criteria also makes it difficult for the end user to choose the most suitable technique. In this study, we combine logic-based argumentation with Interpretable Machine Learning, introducing a preliminary meta-explanation methodology that identifies the truthful parts of feature importance oriented interpretations. This approach, in addition to being used as a meta-explanation technique, can be used as an evaluation or selection tool for multiple feature importance techniques. Experimentation strongly indicates that an ensemble of multiple interpretation techniques yields considerably more truthful explanations. △ Less

Submitted 29 April, 2022; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: Submitted to SETN2022

ACM Class: I.2.0; I.2.6

arXiv:2008.09513 [pdf, other]

Keywords lie far from the mean of all words in local vector space

Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas, Apostolos N. Papadopoulos

Abstract: Keyword extraction is an important document process that aims at finding a small set of terms that concisely describe a document's topics. The most popular state-of-the-art unsupervised approaches belong to the family of the graph-based methods that build a graph-of-words and use various centrality measures to score the nodes (candidate keywords). In this work, we follow a different path to detect… ▽ More Keyword extraction is an important document process that aims at finding a small set of terms that concisely describe a document's topics. The most popular state-of-the-art unsupervised approaches belong to the family of the graph-based methods that build a graph-of-words and use various centrality measures to score the nodes (candidate keywords). In this work, we follow a different path to detect the keywords from a text document by modeling the main distribution of the document's words using local word vector representations. Then, we rank the candidates based on their position in the text and the distance between the corresponding local vectors and the main distribution's center. We confirm the high performance of our approach compared to strong baselines and state-of-the-art unsupervised keyword extraction methods, through an extended experimental study, investigating the properties of the local representations. △ Less

Submitted 21 August, 2020; originally announced August 2020.

arXiv:2006.08328 [pdf, other]

doi 10.1007/s40747-021-00608-2

ETHOS: an Online Hate Speech Detection Dataset

Authors: Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, Grigorios Tsoumakas

Abstract: Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of user… ▽ More Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of users log in every day, and protection from exposure to similar phenomena appears to be necessary in order to comply with the corresponding legislation and maintain a high level of service quality. A robust and reliable system for detecting and preventing the uploading of relevant content will have a significant impact on our digitally interconnected society. Several aspects of our daily lives are undeniably linked to our social profiles, making us vulnerable to abusive behaviours. As a result, the lack of accurate hate speech detection mechanisms would severely degrade the overall user experience, although its erroneous operation would pose many ethical concerns. In this paper, we present 'ETHOS', a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform. Furthermore, we present the annotation protocol used to create this dataset: an active sampling procedure for balancing our data in relation to the various aspects defined. Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material. △ Less

Submitted 6 July, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 16 Pages, 3 Figures, 9 Tables, Submitted to the special issue on "Intelligent Systems for Safer Social Media" of Complex & Intelligent Systems

ACM Class: I.2.6; I.2.7; I.5.4; H.2.4

arXiv:2005.07638 [pdf, other]

doi 10.1016/j.ipm.2020.102282

Beyond MeSH: Fine-Grained Semantic Indexing of Biomedical Literature based on Weak Supervision

Authors: Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

Abstract: In this work, we propose a method for the automated refinement of subject annotations in biomedical literature at the level of concepts. Semantic indexing and search of biomedical articles in MEDLINE/PubMed are based on semantic subject annotations with MeSH descriptors that may correspond to several related but distinct biomedical concepts. Such semantic annotations do not adhere to the level of… ▽ More In this work, we propose a method for the automated refinement of subject annotations in biomedical literature at the level of concepts. Semantic indexing and search of biomedical articles in MEDLINE/PubMed are based on semantic subject annotations with MeSH descriptors that may correspond to several related but distinct biomedical concepts. Such semantic annotations do not adhere to the level of detail available in the domain knowledge and may not be sufficient to fulfil the information needs of experts in the domain. To this end, we propose a new method that uses weak supervision to train a concept annotator on the literature available for a particular disease. We test this method on the MeSH descriptors for two diseases: Alzheimer's Disease and Duchenne Muscular Dystrophy. The results indicate that concept-occurrence is a strong heuristic for automated subject annotation refinement and its use as weak supervision can lead to improved concept-level annotations. The fine-grained semantic annotations can enable more precise literature retrieval, sustain the semantic integration of subject annotations with other domain resources and ease the maintenance of consistent subject annotations, as new more detailed entries are added in the MeSH thesaurus over time. △ Less

Submitted 18 May, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: 36 pages, 8 figures; Dictionary-based baselines added and conclusions updated

Journal ref: Information Processing and Management 57 (2020) 102282

arXiv:2005.03240 [pdf, other]

Multi-Label Sampling based on Local Label Imbalance

Authors: Bin Liu, Konstantinos Blekas, Grigorios Tsoumakas

Abstract: Class imbalance is an inherent characteristic of multi-label data that hinders most multi-label learning methods. One efficient and flexible strategy to deal with this problem is to employ sampling techniques before training a multi-label learning model. Although existing multi-label sampling approaches alleviate the global imbalance of multi-label datasets, it is actually the imbalance level with… ▽ More Class imbalance is an inherent characteristic of multi-label data that hinders most multi-label learning methods. One efficient and flexible strategy to deal with this problem is to employ sampling techniques before training a multi-label learning model. Although existing multi-label sampling approaches alleviate the global imbalance of multi-label datasets, it is actually the imbalance level within the local neighbourhood of minority class examples that plays a key role in performance degradation. To address this issue, we propose a novel measure to assess the local label imbalance of multi-label datasets, as well as two multi-label sampling approaches based on the local label imbalance, namely MLSOL and MLUL. By considering all informative labels, MLSOL creates more diverse and better labeled synthetic instances for difficult examples, while MLUL eliminates instances that are harmful to their local region. Experimental results on 13 multi-label datasets demonstrate the effectiveness of the proposed measure and sampling approaches for a variety of evaluation metrics, particularly in the case of an ensemble of classifiers trained on repeated samples of the original data. △ Less

Submitted 19 May, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: arXiv admin note: text overlap with arXiv:1905.00609

arXiv:2004.06190 [pdf, other]

A Divide-and-Conquer Approach to the Summarization of Long Documents

Authors: Alexios Gidiotis, Grigorios Tsoumakas

Abstract: We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, which are used for training a model that learns to sum… ▽ More We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, which are used for training a model that learns to summarize each part of the document separately. These partial summaries are then combined in order to produce a final complete summary. With this approach we can decompose the problem of long document summarization into smaller and simpler problems, reducing computational complexity and creating more training examples, which at the same time contain less noise in the target summaries compared to the standard approach. We demonstrate that this approach paired with different summarization models, including sequence-to-sequence RNNs and Transformers, can lead to improved summarization performance. Our best models achieve results that are on par with the state-of-the-art in two two publicly available datasets of academic articles. △ Less

Submitted 23 September, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

arXiv:1911.08780 [pdf, other]

LionForests: Local Interpretation of Random Forests

Authors: Ioannis Mollas, Nick Bassiliades, Ioannis Vlahavas, Grigorios Tsoumakas

Abstract: Towards a future where machine learning systems will integrate into every aspect of people's lives, researching methods to interpret such systems is necessary, instead of focusing exclusively on enhancing their performance. Enriching the trust between these systems and people will accelerate this integration process. Many medical and retail banking/finance applications use state-of-the-art machine… ▽ More Towards a future where machine learning systems will integrate into every aspect of people's lives, researching methods to interpret such systems is necessary, instead of focusing exclusively on enhancing their performance. Enriching the trust between these systems and people will accelerate this integration process. Many medical and retail banking/finance applications use state-of-the-art machine learning techniques to predict certain aspects of new instances. Tree ensembles, like random forests, are widely acceptable solutions on these tasks, while at the same time they are avoided due to their black-box uninterpretable nature, creating an unreasonable paradox. In this paper, we provide a methodology for shedding light on the predictions of the misjudged family of tree ensemble algorithms. Using classic unsupervised learning techniques and an enhanced similarity metric, to wander among transparent trees inside a forest following breadcrumbs, the interpretable essence of tree ensembles arises. An interpretation provided by these systems using our approach, which we call "LionForests", can be a simple, comprehensive rule. △ Less

Submitted 23 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: 8 Pages, 4 Tables, 6 Figures, Submitted to NeHuAI-2020 Workshop of ECAI2020

ACM Class: I.2.0; I.2.6

Journal ref: Proceedings of the First International Workshop on New Foundations for Human-Centered AI (NeHuAI) co-located with 24th European Conference on Artificial Intelligence (ECAI 2020), http://ceur-ws.org/Vol-2659/ [p.17-24]

arXiv:1906.06566 [pdf, other]

doi 10.1007/978-3-030-43823-4_23

LioNets: Local Interpretation of Neural Networks through Penultimate Layer Decoding

Authors: Ioannis Mollas, Nikolaos Bassiliades, Grigorios Tsoumakas

Abstract: Technological breakthroughs on smart homes, self-driving cars, health care and robotic assistants, in addition to reinforced law regulations, have critically influenced academic research on explainable machine learning. A sufficient number of researchers have implemented ways to explain indifferently any black box model for classification tasks. A drawback of building agnostic explanators is that… ▽ More Technological breakthroughs on smart homes, self-driving cars, health care and robotic assistants, in addition to reinforced law regulations, have critically influenced academic research on explainable machine learning. A sufficient number of researchers have implemented ways to explain indifferently any black box model for classification tasks. A drawback of building agnostic explanators is that the neighbourhood generation process is universal and consequently does not guarantee true adjacency between the generated neighbours and the instance. This paper explores a methodology on providing explanations for a neural network's decisions, in a local scope, through a process that actively takes into consideration the neural network's architecture on creating an instance's neighbourhood, that assures the adjacency among the generated neighbours and the instance. △ Less

Submitted 8 August, 2019; v1 submitted 15 June, 2019; originally announced June 2019.

Comments: Submitted and accepted to AIMLAI-XKDD-ECMLPKDD19

ACM Class: I.2.0; I.2.6; I.2.7

arXiv:1905.07695 [pdf, ps, other]

Structured Summarization of Academic Publications

Authors: Alexios Gidiotis, Grigorios Tsoumakas

Abstract: We propose SUSIE, a novel summarization method that can work with state-of-the-art summarization models in order to produce structured scientific summaries for academic articles. We also created PMC-SA, a new dataset of academic publications, suitable for the task of structured summarization with neural networks. We apply SUSIE combined with three different summarization models on the new PMC-SA d… ▽ More We propose SUSIE, a novel summarization method that can work with state-of-the-art summarization models in order to produce structured scientific summaries for academic articles. We also created PMC-SA, a new dataset of academic publications, suitable for the task of structured summarization with neural networks. We apply SUSIE combined with three different summarization models on the new PMC-SA dataset and we show that the proposed method improves the performance of all models by as much as 4 ROUGE points. △ Less

Submitted 21 June, 2019; v1 submitted 19 May, 2019; originally announced May 2019.

arXiv:1905.05044 [pdf, other]

A Review of Keyphrase Extraction

Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas

Abstract: Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the key aspects of its content. Keyphrases constitute a succinct conceptual summary of a document, which is very useful in digital information management systems for semantic indexing, faceted search, document clustering… ▽ More Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the key aspects of its content. Keyphrases constitute a succinct conceptual summary of a document, which is very useful in digital information management systems for semantic indexing, faceted search, document clustering and classification. This article introduces keyphrase extraction, provides a well-structured review of the existing work, offers interesting insights on the different evaluation approaches, highlights open issues and presents a comparative experimental study of popular unsupervised techniques on five datasets. △ Less

Submitted 30 July, 2019; v1 submitted 13 May, 2019; originally announced May 2019.

Comments: author pre-print version

arXiv:1905.00609 [pdf, other]

Synthetic Oversampling of Multi-Label Data based on Local Label Distribution

Authors: Bin Liu, Grigorios Tsoumakas

Abstract: Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multilabel sampling methods alleviate the (global) imbalance of multi-label datasets. However, performance degradation is mainly due… ▽ More Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multilabel sampling methods alleviate the (global) imbalance of multi-label datasets. However, performance degradation is mainly due to rare subconcepts and overlap** of classes that could be analysed by looking at the local characteristics of the minority examples, rather than the imbalance of the whole dataset. We propose a new method for synthetic oversampling of multi-label data that focuses on local label distribution to generate more diverse and better labeled instances. Experimental results on 13 multi-label datasets demonstrate the effectiveness of the proposed approach in a variety of evaluation measures, particularly in the case of an ensemble of classifiers trained on repeated samples of the original data. △ Less

Submitted 20 June, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

Journal ref: ECML-PKDD 2019

arXiv:1808.03712 [pdf, other]

Unsupervised Keyphrase Extraction from Scientific Publications

Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas

Abstract: We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection. It starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the s… ▽ More We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection. It starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art and recent unsupervised keyphrase extraction methods. △ Less

Submitted 12 July, 2020; v1 submitted 10 August, 2018; originally announced August 2018.

Comments: author pre-print version

arXiv:1807.11393 [pdf, other]

Making Classifier Chains Resilient to Class Imbalance

Authors: Bin Liu, Grigorios Tsoumakas

Abstract: Class imbalance is an intrinsic characteristic of multi-label data. Most of the labels in multi-label data sets are associated with a small number of training examples, much smaller compared to the size of the data set. Class imbalance poses a key challenge that plagues most multi-label learning methods. Ensemble of Classifier Chains (ECC), one of the most prominent multi-label learning methods, i… ▽ More Class imbalance is an intrinsic characteristic of multi-label data. Most of the labels in multi-label data sets are associated with a small number of training examples, much smaller compared to the size of the data set. Class imbalance poses a key challenge that plagues most multi-label learning methods. Ensemble of Classifier Chains (ECC), one of the most prominent multi-label learning methods, is no exception to this rule, as each of the binary models it builds is trained from all positive and negative examples of a label. To make ECC resilient to class imbalance, we first couple it with random undersampling. We then present two extensions of this basic approach, where we build a varying number of binary models per label and construct chains of different sizes, in order to improve the exploitation of majority examples with approximately the same computational budget. Experimental results on 16 multi-label datasets demonstrate the effectiveness of the proposed approaches in a variety of evaluation metrics. △ Less

Submitted 6 November, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

arXiv:1711.05098 [pdf, other]

Web Robot Detection in Academic Publishing

Authors: Athanasios Lagopoulos, Grigorios Tsoumakas, Georgios Papadopoulos

Abstract: Recent industry reports assure the rise of web robots which comprise more than half of the total web traffic. They not only threaten the security, privacy and efficiency of the web but they also distort analytics and metrics, doubting the veracity of the information being promoted. In the academic publishing domain, this can cause articles to be faulty presented as prominent and influential. In th… ▽ More Recent industry reports assure the rise of web robots which comprise more than half of the total web traffic. They not only threaten the security, privacy and efficiency of the web but they also distort analytics and metrics, doubting the veracity of the information being promoted. In the academic publishing domain, this can cause articles to be faulty presented as prominent and influential. In this paper, we present our approach on detecting web robots in academic publishing websites. We use different supervised learning algorithms with a variety of characteristics deriving from both the log files of the server and the content served by the website. Our approach relies on the assumption that human users will be interested in specific domains or articles, while web robots crawl a web library incoherently. We experiment with features adopted in previous studies with the addition of novel semantic characteristics which derive after performing a semantic analysis using the Latent Dirichlet Allocation (LDA) algorithm. Our real-world case study shows promising results, pinpointing the significance of semantic features in the web robot detection problem. △ Less

Submitted 14 November, 2017; originally announced November 2017.

arXiv:1710.07503 [pdf, other]

Local Word Vectors Guiding Keyphrase Extraction

Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas

Abstract: Automated keyphrase extraction is a fundamental textual information processing task concerned with the selection of representative phrases from a document that summarize its content. This work presents a novel unsupervised method for keyphrase extraction, whose main innovation is the use of local word embeddings (in particular GloVe vectors), i.e., embeddings trained from the single document under… ▽ More Automated keyphrase extraction is a fundamental textual information processing task concerned with the selection of representative phrases from a document that summarize its content. This work presents a novel unsupervised method for keyphrase extraction, whose main innovation is the use of local word embeddings (in particular GloVe vectors), i.e., embeddings trained from the single document under consideration. We argue that such local representation of words and keyphrases are able to accurately capture their semantics in the context of the document they are part of, and therefore can help in improving keyphrase extraction quality. Empirical results offer evidence that indeed local representations lead to better keyphrase extraction results compared to both embeddings trained on very large third corpora or larger corpora consisting of several documents of the same scientific field and to other state-of-the-art unsupervised keyphrase extraction methods. △ Less

Submitted 13 April, 2018; v1 submitted 20 October, 2017; originally announced October 2017.

Comments: author pre-print version

arXiv:1709.05480 [pdf, other]

Subset Labeled LDA for Large-Scale Multi-Label Classification

Authors: Yannis Papanikolaou, Grigorios Tsoumakas

Abstract: Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address multi-label learning tasks. Previous work has shown it to perform in par with other state-of-the-art multi-label methods. Nonetheless, with increasing label sets sizes LLDA encounters scalability issues. In this work, we introduce Subset LLDA, a simple var… ▽ More Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address multi-label learning tasks. Previous work has shown it to perform in par with other state-of-the-art multi-label methods. Nonetheless, with increasing label sets sizes LLDA encounters scalability issues. In this work, we introduce Subset LLDA, a simple variant of the standard LLDA algorithm, that not only can effectively scale up to problems with hundreds of thousands of labels but also improves over the LLDA state-of-the-art. We conduct extensive experiments on eight data sets, with label sets sizes ranging from hundreds to hundreds of thousands, comparing our proposed algorithm with the previously proposed LLDA algorithms (Prior--LDA, Dep--LDA), as well as the state of the art in extreme multi-label classification. The results show a steady advantage of our method over the other LLDA algorithms and competitive results compared to the extreme multi-label classification algorithms. △ Less

Submitted 16 September, 2017; originally announced September 2017.

arXiv:1704.05271 [pdf, other]

Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models

Authors: Yannis Papanikolaou, Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, Ioannis Vlahavas

Abstract: Background: In this paper we present the approaches and methods employed in order to deal with a large scale multi-label semantic indexing task of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge of 2014. Methods: The main contribution of this work is a multi-label ensemble method that incorporates a McNemar statistical significance test in order to va… ▽ More Background: In this paper we present the approaches and methods employed in order to deal with a large scale multi-label semantic indexing task of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge of 2014. Methods: The main contribution of this work is a multi-label ensemble method that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper adaptation of the algorithms used to deal with this challenging classification task. Results: The ensemble method we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. During the BioASQ 2014 challenge we obtained the first place during the first batch and the third in the two following batches. Our success in the BioASQ challenge proved that a fully automated machine-learning approach, which does not implement any heuristics and rule-based approaches, can be highly competitive and outperform other approaches in similar challenging contexts. △ Less

Submitted 18 April, 2017; originally announced April 2017.

arXiv:1612.06083 [pdf, other]

Hierarchical Partitioning of the Output Space in Multi-label Data

Authors: Yannis Papanikolaou, Ioannis Katakis, Grigorios Tsoumakas

Abstract: Hierarchy Of Multi-label classifiers (HOMER) is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and secondly employing a given base multi-label classifier (MLC) to the resulting sub-problems. The primary goal is to effectively address class imbalance and scalability issues that of… ▽ More Hierarchy Of Multi-label classifiers (HOMER) is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and secondly employing a given base multi-label classifier (MLC) to the resulting sub-problems. The primary goal is to effectively address class imbalance and scalability issues that often arise in real-world multi-label classification problems. In this work, we present the general setup for a HOMER model and a simple extension of the algorithm that is suited for MLCs that output rankings. Furthermore, we provide a detailed analysis of the properties of the algorithm, both from an aspect of effectiveness and computational complexity. A secondary contribution involves the presentation of a balanced variant of the k means algorithm, which serves in the first step of the label hierarchy construction. We conduct extensive experiments on six real-world datasets, studying empirically HOMER's parameters and providing examples of instantiations of the algorithm with different clustering approaches and MLCs, The empirical results demonstrate a significant improvement over the given base MLC. △ Less

Submitted 19 December, 2016; originally announced December 2016.

arXiv:1404.5065 [pdf, other]

doi 10.1007/978-3-662-44845-8_15

Multi-Target Regression via Random Linear Target Combinations

Authors: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Aikaterini Vrekou, Ioannis Vlahavas

Abstract: Multi-target regression is concerned with the simultaneous prediction of multiple continuous target variables based on the same set of input variables. It arises in several interesting industrial and environmental application domains, such as ecological modelling and energy forecasting. This paper presents an ensemble method for multi-target regression that constructs new target variables via rand… ▽ More Multi-target regression is concerned with the simultaneous prediction of multiple continuous target variables based on the same set of input variables. It arises in several interesting industrial and environmental application domains, such as ecological modelling and energy forecasting. This paper presents an ensemble method for multi-target regression that constructs new target variables via random linear combinations of existing targets. We discuss the connection of our approach with multi-label classification algorithms, in particular RA$k$EL, which originally inspired this work, and a family of recent multi-label classification algorithms that involve output coding. Experimental results on 12 multi-target datasets show that it performs significantly better than a strong baseline that learns a single model for each target using gradient boosting and compares favourably to multi-objective random forest approach, which is a state-of-the-art approach. The experiments further show that our approach improves more when stronger unconditional dependencies exist among the targets. △ Less

Submitted 20 April, 2014; originally announced April 2014.

Journal ref: ECML PKDD Proceedings, Part III (2014) 225-240

arXiv:1404.4038 [pdf, other]

Discovering and Exploiting Entailment Relationships in Multi-Label Learning

Authors: Christina Papagiannopoulou, Grigorios Tsoumakas, Ioannis Tsamardinos

Abstract: This work presents a sound probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailement: pairs of labels, where the presence of one implies the presence of t… ▽ More This work presents a sound probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailement: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented with a Bayesian network. Marginal probabilities are entered as soft evidence in the network and adjusted through probabilistic inference. Our approach offers robust improvements in mean average precision compared to the standard binary relavance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself. △ Less

Submitted 17 April, 2014; v1 submitted 15 April, 2014; originally announced April 2014.

arXiv:1211.6581 [pdf, other]

doi 10.1007/s10994-016-5546-z

Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs

Authors: Eleftherios Spyromitros-Xioufis, Grigorios Tsoumakas, William Groves, Ioannis Vlahavas

Abstract: In many practical applications of supervised learning the task involves the prediction of multiple target variables from a common set of input variables. When the prediction targets are binary the task is called multi-label classification, while when the targets are continuous the task is called multi-target regression. In both tasks, target variables often exhibit statistical dependencies and exp… ▽ More In many practical applications of supervised learning the task involves the prediction of multiple target variables from a common set of input variables. When the prediction targets are binary the task is called multi-label classification, while when the targets are continuous the task is called multi-target regression. In both tasks, target variables often exhibit statistical dependencies and exploiting them in order to improve predictive accuracy is a core challenge. A family of multi-label classification methods address this challenge by building a separate model for each target on an expanded input space where other targets are treated as additional input variables. Despite the success of these methods in the multi-label classification domain, their applicability and effectiveness in multi-target regression has not been studied until now. In this paper, we introduce two new methods for multi-target regression, called Stacked Single-Target and Ensemble of Regressor Chains, by adapting two popular multi-label classification methods of this family. Furthermore, we highlight an inherent problem of these methods - a discrepancy of the values of the additional input variables between training and prediction - and develop extensions that use out-of-sample estimates of the target variables during training in order to tackle this problem. The results of an extensive experimental evaluation carried out on a large and diverse collection of datasets show that, when the discrepancy is appropriately mitigated, the proposed methods attain consistent improvements over the independent regressions baseline. Moreover, two versions of Ensemble of Regression Chains perform significantly better than four state-of-the-art methods including regularization-based multi-task learning methods and a multi-objective random forest approach. △ Less

Submitted 27 January, 2016; v1 submitted 28 November, 2012; originally announced November 2012.

Comments: Accepted for publication in Machine Learning journal. This replacement contains major improvements compared to the previous version, including a deeper theoretical and experimental analysis and an extended discussion of related work

Showing 1–49 of 49 results for author: Tsoumakas, G