Skip to main content

Showing 1–49 of 49 results for author: Tsoumakas, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.18192  [pdf, other

    cs.LG

    Multi-Label Adaptive Batch Selection by Highlighting Hard and Imbalanced Samples

    Authors: Ao Zhou, Bin Liu, ** Wang, Grigorios Tsoumakas

    Abstract: Deep neural network models have demonstrated their effectiveness in classifying multi-label data from various domains. Typically, they employ a training mode that combines mini-batches with optimizers, where each sample is randomly selected with equal probability when constructing mini-batches. However, the intrinsic class imbalance in multi-label data may bias the model towards majority labels, s… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  2. arXiv:2312.05172  [pdf, other

    cs.CL

    From Lengthy to Lucid: A Systematic Literature Review on NLP Techniques for Taming Long Sentences

    Authors: Tatiana Passali, Efstathios Chatzikyriakidis, Stelios Andreadis, Thanos G. Stavropoulos, Anastasia Matonaki, Anestis Fachantidis, Grigorios Tsoumakas

    Abstract: Long sentences have been a persistent issue in written communication for many years since they make it challenging for readers to grasp the main points or follow the initial intention of the writer. This survey, conducted using the PRISMA guidelines, systematically reviews two main strategies for addressing the issue of long sentences: a) sentence compression and b) sentence splitting. An increase… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

    Comments: Author's Version, Submitted to ACM CSUR

  3. arXiv:2303.16506  [pdf, other

    cs.LG cs.AI

    Local Interpretability of Random Forests for Multi-Target Regression

    Authors: Avraam Bardos, Nikolaos Mylonas, Ioannis Mollas, Grigorios Tsoumakas

    Abstract: Multi-target regression is useful in a plethora of applications. Although random forest models perform well in these tasks, they are often difficult to interpret. Interpretability is crucial in machine learning, especially when it can directly impact human well-being. Although model-agnostic techniques exist for multi-target regression, specific techniques tailored to random forest models are not… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: 8 pages, 1 figure, 2 tables, to be submitted to XAI conference 2023 as an extended abstract

    ACM Class: I.2.0; I.2.6

  4. arXiv:2302.13034  [pdf, other

    cs.LG

    Does Noise Affect Housing Prices? A Case Study in the Urban Area of Thessaloniki

    Authors: Georgios Kamtziridis, Dimitris Vrakas, Grigorios Tsoumakas

    Abstract: Real estate markets depend on various methods to predict housing prices, including models that have been trained on datasets of residential or commercial properties. Most studies endeavor to create more accurate machine learning models by utilizing data such as basic property characteristics as well as urban features like distances from amenities and road accessibility. Even though environmental f… ▽ More

    Submitted 25 February, 2023; originally announced February 2023.

    Comments: 24 pages, 20 figures, 7 tables

  5. Large-scale investigation of weakly-supervised deep learning for the fine-grained semantic indexing of biomedical literature

    Authors: Anastasios Nentidis, Thomas Chatzopoulos, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

    Abstract: Objective: Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors with several related but distinct biomedical concepts often grouped together and treated as a single topic. This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts. Methods: Lacking labelled data, we rely on weak supervision based on conc… ▽ More

    Submitted 5 October, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

    Comments: 26 pages, 5 figures, 4 tables. A more concise version

    Journal ref: Journal of Biomedical Informatics, Volume 146, 2023, 104499, ISSN 1532-0464

  6. arXiv:2212.03513  [pdf, other

    cs.LG cs.AI cs.LO

    Truthful Meta-Explanations for Local Interpretability of Machine Learning Models

    Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: Automated Machine Learning-based systems' integration into a wide range of tasks has expanded as a result of their performance and speed. Although there are numerous advantages to employing ML-based systems, if they are not interpretable, they should not be used in critical, high-risk applications where human lives are at risk. To address this issue, researchers and businesses have been focusing o… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: 22 pages, 5 figures, 9 tables, submitted to Applied Intelligence Journal

    ACM Class: I.2.0; I.2.6

  7. arXiv:2212.00543  [pdf, other

    cs.AI

    Fine-Grained Selective Similarity Integration for Drug-Target Interaction Prediction

    Authors: Bin Liu, ** Wang, Kaiwei Sun, Grigorios Tsoumakas

    Abstract: The discovery of drug-target interactions (DTIs) is a pivotal process in pharmaceutical development. Computational approaches are a promising and efficient alternative to tedious and costly wet-lab experiments for predicting novel DTIs from numerous candidates. Recently, with the availability of abundant heterogeneous biological information from diverse data sources, computational methods have bee… ▽ More

    Submitted 21 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

  8. arXiv:2211.09235  [pdf, other

    cs.CL

    Artificial Disfluency Detection, Uh No, Disfluency Generation for the Masses

    Authors: T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S. Vrochidis

    Abstract: Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that can be encountered in real-world scenarios. This work proposes LARD, a method for automatically generating artificial disfluencies from fluent text. LARD can simulate all… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

    Comments: 10 pages

  9. arXiv:2209.10876  [pdf, other

    cs.CL cs.LG

    An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification

    Authors: Nikolaos Mylonas, Ioannis Mollas, Grigorios Tsoumakas

    Abstract: Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations between (sub)words. However, transformers are difficult to interpret. Being able to provide reasoning for its decisions is an important property for a model in domain… ▽ More

    Submitted 28 November, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

    Comments: 16 pages, 7 figures, 5 tables, Submitted to DAMI Journal (ECMLPKDD2023 Special Issue)

  10. Local Multi-Label Explanations for Random Forest

    Authors: Nikolaos Mylonas, Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: Multi-label classification is a challenging task, particularly in domains where the number of labels to be predicted is large. Deep neural networks are often effective at multi-label classification of images and textual data. When dealing with tabular data, however, conventional machine learning algorithms, such as tree ensembles, appear to outperform competition. Random forest, being a popular en… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: 11 pages, 1 figues, 8 tables, submitted to XKDD (workshop of ECML PKDD 2022)

    ACM Class: I.2.0; I.2.6

  11. arXiv:2206.04317  [pdf, other

    cs.CL

    Topic-Controllable Summarization: Topic-Aware Evaluation and Transformer Methods

    Authors: Tatiana Passali, Grigorios Tsoumakas

    Abstract: Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. For example, the majority of existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to… ▽ More

    Submitted 17 April, 2024; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: Accepted at LREC-COLING 2024

  12. arXiv:2204.14012  [pdf, other

    cs.LG cs.AI cs.IR

    Local Explanation of Dimensionality Reduction

    Authors: Avraam Bardos, Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: Dimensionality reduction (DR) is a popular method for preparing and analyzing high-dimensional data. Reduced data representations are less computationally intensive and easier to manage and visualize, while retaining a significant percentage of their original information. Aside from these advantages, these reduced representations can be difficult or impossible to interpret in most circumstances, e… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

    Comments: 13 Pages, 12 Figures, 6 Tables, Submitted to SETN2022

  13. arXiv:2201.09508  [pdf, other

    q-bio.QM cs.LG

    Multiple Similarity Drug-Target Interaction Prediction with Random Walks and Matrix Factorization

    Authors: Bin Liu, Dimitrios Papadopoulos, Fragkiskos D. Malliaros, Grigorios Tsoumakas, Apostolos N. Papadopoulos

    Abstract: The discovery of drug-target interactions (DTIs) is a very promising area of research with great potential. The accurate identification of reliable interactions among drugs and proteins via computational methods, which typically leverage heterogeneous information retrieved from diverse data sources, can boost the development of effective pharmaceuticals. Although random walk and matrix factorizati… ▽ More

    Submitted 8 August, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

    Journal ref: Briefings in Bioinformatics, Volume 23, Issue 5, 2022

  14. arXiv:2201.05041  [pdf, other

    cs.CL

    LARD: Large-scale Artificial Disfluency Generation

    Authors: T. Passali, T. Mavropoulos, G. Tsoumakas, G. Meditskos, S. Vrochidis

    Abstract: Disfluency detection is a critical task in real-time dialogue systems. However, despite its importance, it remains a relatively unexplored field, mainly due to the lack of appropriate datasets. At the same time, existing datasets suffer from various issues, including class imbalance issues, which can significantly affect the performance of the model on rare classes, as it is demonstrated in this p… ▽ More

    Submitted 3 May, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: Accepted at LREC 2022

  15. arXiv:2110.04480  [pdf, other

    cs.CL

    Bayesian Active Summarization

    Authors: Alexios Gidiotis, Grigorios Tsoumakas

    Abstract: Bayesian Active Learning has had significant impact to various NLP problems, but nevertheless it's application to text summarization has been explored very little. We introduce Bayesian Active Summarization (BAS), as a method of combining active learning methods with state-of-the-art summarization models. Our findings suggest that BAS achieves better and more robust performance, compared to random… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

  16. arXiv:2107.03825  [pdf, other

    cs.LG stat.AP

    Short-term Renewable Energy Forecasting in Greece using Prophet Decomposition and Tree-based Ensembles

    Authors: Argyrios Vartholomaios, Stamatis Karlos, Eleftherios Kouloumpris, Grigorios Tsoumakas

    Abstract: Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile energy sources, since it leads to higher reliability a… ▽ More

    Submitted 8 July, 2021; originally announced July 2021.

    Comments: 11 pages, 7 figures

  17. arXiv:2106.00302  [pdf, other

    cs.DL cs.IR

    Harvesting the Public MeSH Note field

    Authors: Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

    Abstract: In this document, we report an analysis of the Public MeSH Note field of the new descriptors introduced in the MeSH thesaurus between 2006 and 2020. The aim of this analysis was to extract information about the previous status of these new descriptors as Supplementary Concept Records. The Public MeSH Note field contains information in semi-structured text, meant to be read by humans. Therefore, we… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: 3 pages, 1 figure, 1 table. Technical report

  18. arXiv:2105.10155  [pdf, other

    cs.CL

    Should We Trust This Summary? Bayesian Abstractive Summarization to The Rescue

    Authors: Alexios Gidiotis, Grigorios Tsoumakas

    Abstract: We explore the notion of uncertainty in the context of modern abstractive summarization models, using the tools of Bayesian Deep Learning. Our approach approximates Bayesian inference by first extending state-of-the-art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes. Based on Bayesian inference we are able to effectively quantify unc… ▽ More

    Submitted 3 May, 2022; v1 submitted 21 May, 2021; originally announced May 2021.

  19. arXiv:2105.05527  [pdf, other

    cs.DL

    Self-citation Analysis using Sentence Embeddings

    Authors: Athanasios Lagopoulos, Grigorios Tsoumakas

    Abstract: The purpose of citation indexes and metrics is intended to be a measure for scientific innovation and quality for researchers, journals, and institutions. However, those metrics are often prone to abuse and manipulation by excessive and unethical self-citations induced by authors, reviewers, editors, or journals. Identifying whether there are or not legitimate reasons for self-citations is normall… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  20. arXiv:2105.01545  [pdf, other

    q-bio.QM cs.LG

    Optimizing Area Under the Curve Measures via Matrix Factorization for Predicting Drug-Target Interaction with Multiple Similarities

    Authors: Bin Liu, Grigorios Tsoumakas

    Abstract: In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although it has been shown that fusing heterogeneous drug and target similarities can… ▽ More

    Submitted 14 January, 2022; v1 submitted 1 May, 2021; originally announced May 2021.

  21. LioNets: A Neural-Specific Local Interpretation Technique Exploiting Penultimate Layer Information

    Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: Artificial Intelligence (AI) has a tremendous impact on the unexpected growth of technology in almost every aspect. AI-powered systems are monitoring and deciding about sensitive economic and societal issues. The future is towards automation, and it must not be prevented. However, this is a conflicting viewpoint for a lot of people, due to the fear of uncontrollable AI systems. This concern could… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: 23 pages, 22 figures, 2 tables, submitted to Information Fusion Journal

    ACM Class: I.2.0; I.2.6; I.2.7

  22. arXiv:2104.06040  [pdf, other

    cs.LG cs.AI

    Conclusive Local Interpretation Rules for Random Forests

    Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be able to provide clear interpretations for their decisions. Otherwise, their obscure decision-making processes can lead to socioethical issues as they interfere with people's lives. In the aforementioned sectors, random forest algorithms strive… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: 32 pages, 31 figures, 4 Tables, submitted to data mining and knowledge discovery journal

    ACM Class: I.2.0; I.2.6

  23. VisioRed: A Visualisation Tool for Interpretable Predictive Maintenance

    Authors: Spyridon Paraschos, Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: The use of machine learning rapidly increases in high-risk scenarios where decisions are required, for example in healthcare or industrial monitoring equipment. In crucial situations, a model that can offer meaningful explanations of its decision-making is essential. In industrial facilities, the equipment's well-timed maintenance is vital to ensure continuous operation to prevent money loss. Usin… ▽ More

    Submitted 14 April, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

    Comments: 4 pages, 2 figures, Submitted to IJCAI

    ACM Class: I.2.0; I.2.6; H.5.2

  24. arXiv:2103.04156  [pdf, other

    cs.CL cs.IR cs.LG

    Improving Zero-Shot Entity Retrieval through Effective Dense Representations

    Authors: Eleni Partalidou, Despina Christou, Grigorios Tsoumakas

    Abstract: Entity Linking (EL) seeks to align entity mentions in text to entries in a knowledge-base and is usually comprised of two phases: candidate generation and candidate ranking. While most methods focus on the latter, it is the candidate generation phase that sets an upper bound to both time and accuracy performance of the overall EL system. This work's contribution is a significant improvement in can… ▽ More

    Submitted 6 March, 2021; originally announced March 2021.

    Comments: 8 pages, 2 figures

    ACM Class: I.2.7; H.3.3

  25. arXiv:2102.01156  [pdf, other

    cs.CL cs.IR cs.LG

    Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embeddings

    Authors: Despina Christou, Grigorios Tsoumakas

    Abstract: Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information, but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose REDSandT (Relation Extraction with Distant Supervi… ▽ More

    Submitted 1 February, 2021; originally announced February 2021.

    Comments: 10 pages, 4 figures

    ACM Class: I.2.7; H.3.3

  26. What is all this new MeSH about? Exploring the semantic provenance of new descriptors in the MeSH thesaurus

    Authors: Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

    Abstract: The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual provenance of these new descriptors. In particular,… ▽ More

    Submitted 27 July, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

    Comments: 18 pages, 14 figures, 2 tables

  27. arXiv:2012.12325  [pdf, other

    cs.LG q-bio.QM

    Drug-Target Interaction Prediction via an Ensemble of Weighted Nearest Neighbors with Interaction Recovery

    Authors: Bin Liu, Konstantinos Pliakos, Celine Vens, Grigorios Tsoumakas

    Abstract: Predicting drug-target interactions (DTI) via reliable computational methods is an effective and efficient way to mitigate the enormous costs and time of the drug discovery process. Structure-based drug similarities and sequence-based target protein similarities are the commonly used information for DTI prediction. Among numerous computational methods, neighborhood-based chemogenomic approaches th… ▽ More

    Submitted 9 July, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

  28. arXiv:2011.09752  [pdf, other

    cs.IR

    From Protocol to Screening: A Hybrid Learning Approach for Technology-Assisted Systematic Literature Reviews

    Authors: Athanasios Lagopoulos, Grigorios Tsoumakas

    Abstract: In the medical domain, a Systematic Literature Review (SLR) attempts to collect all empirical evidence, that fit pre-specified eligibility criteria, in order to answer a specific research question. The process of preparing an SLR consists of multiple tasks that are labor-intensive and time-consuming, involving large monetary costs. Technology-assisted review (TAR) methods automate the different pr… ▽ More

    Submitted 19 November, 2020; originally announced November 2020.

  29. arXiv:2010.07650  [pdf, other

    cs.LG cs.AI cs.LO

    Altruist: Argumentative Explanations through Local Interpretations of Predictive Models

    Authors: Ioannis Mollas, Nick Bassiliades, Grigorios Tsoumakas

    Abstract: Explainable AI is an emerging field providing solutions for acquiring insights into automated systems' rationale. It has been put on the AI map by suggesting ways to tackle key ethical and societal issues. Existing explanation techniques are often not comprehensible to the end user. Lack of evaluation and selection criteria also makes it difficult for the end user to choose the most suitable techn… ▽ More

    Submitted 29 April, 2022; v1 submitted 15 October, 2020; originally announced October 2020.

    Comments: Submitted to SETN2022

    ACM Class: I.2.0; I.2.6

  30. arXiv:2008.09513  [pdf, other

    cs.CL

    Keywords lie far from the mean of all words in local vector space

    Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas, Apostolos N. Papadopoulos

    Abstract: Keyword extraction is an important document process that aims at finding a small set of terms that concisely describe a document's topics. The most popular state-of-the-art unsupervised approaches belong to the family of the graph-based methods that build a graph-of-words and use various centrality measures to score the nodes (candidate keywords). In this work, we follow a different path to detect… ▽ More

    Submitted 21 August, 2020; originally announced August 2020.

  31. arXiv:2006.08328  [pdf, other

    cs.CL cs.LG stat.ML

    ETHOS: an Online Hate Speech Detection Dataset

    Authors: Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, Grigorios Tsoumakas

    Abstract: Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of user… ▽ More

    Submitted 6 July, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: 16 Pages, 3 Figures, 9 Tables, Submitted to the special issue on "Intelligent Systems for Safer Social Media" of Complex & Intelligent Systems

    ACM Class: I.2.6; I.2.7; I.5.4; H.2.4

  32. Beyond MeSH: Fine-Grained Semantic Indexing of Biomedical Literature based on Weak Supervision

    Authors: Anastasios Nentidis, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras

    Abstract: In this work, we propose a method for the automated refinement of subject annotations in biomedical literature at the level of concepts. Semantic indexing and search of biomedical articles in MEDLINE/PubMed are based on semantic subject annotations with MeSH descriptors that may correspond to several related but distinct biomedical concepts. Such semantic annotations do not adhere to the level of… ▽ More

    Submitted 18 May, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: 36 pages, 8 figures; Dictionary-based baselines added and conclusions updated

    Journal ref: Information Processing and Management 57 (2020) 102282

  33. arXiv:2005.03240  [pdf, other

    cs.LG stat.ML

    Multi-Label Sampling based on Local Label Imbalance

    Authors: Bin Liu, Konstantinos Blekas, Grigorios Tsoumakas

    Abstract: Class imbalance is an inherent characteristic of multi-label data that hinders most multi-label learning methods. One efficient and flexible strategy to deal with this problem is to employ sampling techniques before training a multi-label learning model. Although existing multi-label sampling approaches alleviate the global imbalance of multi-label datasets, it is actually the imbalance level with… ▽ More

    Submitted 19 May, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

    Comments: arXiv admin note: text overlap with arXiv:1905.00609

  34. arXiv:2004.06190  [pdf, other

    cs.CL

    A Divide-and-Conquer Approach to the Summarization of Long Documents

    Authors: Alexios Gidiotis, Grigorios Tsoumakas

    Abstract: We present a novel divide-and-conquer method for the neural summarization of long documents. Our method exploits the discourse structure of the document and uses sentence similarity to split the problem into an ensemble of smaller summarization problems. In particular, we break a long document and its summary into multiple source-target pairs, which are used for training a model that learns to sum… ▽ More

    Submitted 23 September, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

  35. arXiv:1911.08780  [pdf, other

    cs.LG cs.AI stat.ML

    LionForests: Local Interpretation of Random Forests

    Authors: Ioannis Mollas, Nick Bassiliades, Ioannis Vlahavas, Grigorios Tsoumakas

    Abstract: Towards a future where machine learning systems will integrate into every aspect of people's lives, researching methods to interpret such systems is necessary, instead of focusing exclusively on enhancing their performance. Enriching the trust between these systems and people will accelerate this integration process. Many medical and retail banking/finance applications use state-of-the-art machine… ▽ More

    Submitted 23 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

    Comments: 8 Pages, 4 Tables, 6 Figures, Submitted to NeHuAI-2020 Workshop of ECAI2020

    ACM Class: I.2.0; I.2.6

    Journal ref: Proceedings of the First International Workshop on New Foundations for Human-Centered AI (NeHuAI) co-located with 24th European Conference on Artificial Intelligence (ECAI 2020), http://ceur-ws.org/Vol-2659/ [p.17-24]

  36. LioNets: Local Interpretation of Neural Networks through Penultimate Layer Decoding

    Authors: Ioannis Mollas, Nikolaos Bassiliades, Grigorios Tsoumakas

    Abstract: Technological breakthroughs on smart homes, self-driving cars, health care and robotic assistants, in addition to reinforced law regulations, have critically influenced academic research on explainable machine learning. A sufficient number of researchers have implemented ways to explain indifferently any black box model for classification tasks. A drawback of building agnostic explanators is that… ▽ More

    Submitted 8 August, 2019; v1 submitted 15 June, 2019; originally announced June 2019.

    Comments: Submitted and accepted to AIMLAI-XKDD-ECMLPKDD19

    ACM Class: I.2.0; I.2.6; I.2.7

  37. arXiv:1905.07695  [pdf, ps, other

    cs.CL

    Structured Summarization of Academic Publications

    Authors: Alexios Gidiotis, Grigorios Tsoumakas

    Abstract: We propose SUSIE, a novel summarization method that can work with state-of-the-art summarization models in order to produce structured scientific summaries for academic articles. We also created PMC-SA, a new dataset of academic publications, suitable for the task of structured summarization with neural networks. We apply SUSIE combined with three different summarization models on the new PMC-SA d… ▽ More

    Submitted 21 June, 2019; v1 submitted 19 May, 2019; originally announced May 2019.

  38. arXiv:1905.05044  [pdf, other

    cs.CL cs.IR

    A Review of Keyphrase Extraction

    Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas

    Abstract: Keyphrase extraction is a textual information processing task concerned with the automatic extraction of representative and characteristic phrases from a document that express all the key aspects of its content. Keyphrases constitute a succinct conceptual summary of a document, which is very useful in digital information management systems for semantic indexing, faceted search, document clustering… ▽ More

    Submitted 30 July, 2019; v1 submitted 13 May, 2019; originally announced May 2019.

    Comments: author pre-print version

  39. arXiv:1905.00609  [pdf, other

    cs.LG stat.ML

    Synthetic Oversampling of Multi-Label Data based on Local Label Distribution

    Authors: Bin Liu, Grigorios Tsoumakas

    Abstract: Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multilabel sampling methods alleviate the (global) imbalance of multi-label datasets. However, performance degradation is mainly due… ▽ More

    Submitted 20 June, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

    Journal ref: ECML-PKDD 2019

  40. arXiv:1808.03712  [pdf, other

    cs.CL

    Unsupervised Keyphrase Extraction from Scientific Publications

    Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas

    Abstract: We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection. It starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the s… ▽ More

    Submitted 12 July, 2020; v1 submitted 10 August, 2018; originally announced August 2018.

    Comments: author pre-print version

  41. arXiv:1807.11393  [pdf, other

    cs.LG stat.ML

    Making Classifier Chains Resilient to Class Imbalance

    Authors: Bin Liu, Grigorios Tsoumakas

    Abstract: Class imbalance is an intrinsic characteristic of multi-label data. Most of the labels in multi-label data sets are associated with a small number of training examples, much smaller compared to the size of the data set. Class imbalance poses a key challenge that plagues most multi-label learning methods. Ensemble of Classifier Chains (ECC), one of the most prominent multi-label learning methods, i… ▽ More

    Submitted 6 November, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

  42. arXiv:1711.05098  [pdf, other

    cs.AI cs.DL

    Web Robot Detection in Academic Publishing

    Authors: Athanasios Lagopoulos, Grigorios Tsoumakas, Georgios Papadopoulos

    Abstract: Recent industry reports assure the rise of web robots which comprise more than half of the total web traffic. They not only threaten the security, privacy and efficiency of the web but they also distort analytics and metrics, doubting the veracity of the information being promoted. In the academic publishing domain, this can cause articles to be faulty presented as prominent and influential. In th… ▽ More

    Submitted 14 November, 2017; originally announced November 2017.

  43. arXiv:1710.07503  [pdf, other

    cs.CL

    Local Word Vectors Guiding Keyphrase Extraction

    Authors: Eirini Papagiannopoulou, Grigorios Tsoumakas

    Abstract: Automated keyphrase extraction is a fundamental textual information processing task concerned with the selection of representative phrases from a document that summarize its content. This work presents a novel unsupervised method for keyphrase extraction, whose main innovation is the use of local word embeddings (in particular GloVe vectors), i.e., embeddings trained from the single document under… ▽ More

    Submitted 13 April, 2018; v1 submitted 20 October, 2017; originally announced October 2017.

    Comments: author pre-print version

  44. arXiv:1709.05480  [pdf, other

    stat.ML cs.LG

    Subset Labeled LDA for Large-Scale Multi-Label Classification

    Authors: Yannis Papanikolaou, Grigorios Tsoumakas

    Abstract: Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address multi-label learning tasks. Previous work has shown it to perform in par with other state-of-the-art multi-label methods. Nonetheless, with increasing label sets sizes LLDA encounters scalability issues. In this work, we introduce Subset LLDA, a simple var… ▽ More

    Submitted 16 September, 2017; originally announced September 2017.

  45. arXiv:1704.05271  [pdf, other

    stat.ML cs.LG

    Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models

    Authors: Yannis Papanikolaou, Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, Ioannis Vlahavas

    Abstract: Background: In this paper we present the approaches and methods employed in order to deal with a large scale multi-label semantic indexing task of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge of 2014. Methods: The main contribution of this work is a multi-label ensemble method that incorporates a McNemar statistical significance test in order to va… ▽ More

    Submitted 18 April, 2017; originally announced April 2017.

  46. arXiv:1612.06083  [pdf, other

    stat.ML cs.LG

    Hierarchical Partitioning of the Output Space in Multi-label Data

    Authors: Yannis Papanikolaou, Ioannis Katakis, Grigorios Tsoumakas

    Abstract: Hierarchy Of Multi-label classifiers (HOMER) is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and secondly employing a given base multi-label classifier (MLC) to the resulting sub-problems. The primary goal is to effectively address class imbalance and scalability issues that of… ▽ More

    Submitted 19 December, 2016; originally announced December 2016.

  47. Multi-Target Regression via Random Linear Target Combinations

    Authors: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Aikaterini Vrekou, Ioannis Vlahavas

    Abstract: Multi-target regression is concerned with the simultaneous prediction of multiple continuous target variables based on the same set of input variables. It arises in several interesting industrial and environmental application domains, such as ecological modelling and energy forecasting. This paper presents an ensemble method for multi-target regression that constructs new target variables via rand… ▽ More

    Submitted 20 April, 2014; originally announced April 2014.

    Journal ref: ECML PKDD Proceedings, Part III (2014) 225-240

  48. arXiv:1404.4038  [pdf, other

    cs.LG

    Discovering and Exploiting Entailment Relationships in Multi-Label Learning

    Authors: Christina Papagiannopoulou, Grigorios Tsoumakas, Ioannis Tsamardinos

    Abstract: This work presents a sound probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailement: pairs of labels, where the presence of one implies the presence of t… ▽ More

    Submitted 17 April, 2014; v1 submitted 15 April, 2014; originally announced April 2014.

  49. Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs

    Authors: Eleftherios Spyromitros-Xioufis, Grigorios Tsoumakas, William Groves, Ioannis Vlahavas

    Abstract: In many practical applications of supervised learning the task involves the prediction of multiple target variables from a common set of input variables. When the prediction targets are binary the task is called multi-label classification, while when the targets are continuous the task is called multi-target regression. In both tasks, target variables often exhibit statistical dependencies and exp… ▽ More

    Submitted 27 January, 2016; v1 submitted 28 November, 2012; originally announced November 2012.

    Comments: Accepted for publication in Machine Learning journal. This replacement contains major improvements compared to the previous version, including a deeper theoretical and experimental analysis and an extended discussion of related work