Skip to main content

Showing 1–50 of 60 results for author: Androutsopoulos, I

.
  1. arXiv:2406.14164  [pdf, other

    cs.AI cs.CL

    A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning

    Authors: Panagiotis Kaliosis, John Pavlopoulos, Foivos Charalampakos, Georgios Moschovis, Ion Androutsopoulos

    Abstract: Diagnostic Captioning (DC) automatically generates a diagnostic text from one or more medical images (e.g., X-rays, MRIs) of a patient. Treated as a draft, the generated text may assist clinicians, by providing an initial estimation of the patient's condition, speeding up and hel** safeguard the diagnostic process. The accuracy of a diagnostic text, however, strongly depends on how well the key… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: [Pre-print] ACL Findings 2024, 17 pages, 7 figures, 7 tables

  2. arXiv:2406.06127  [pdf, other

    cs.CL cs.AI

    Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems

    Authors: Christos Vlachos, Themos Stafylakis, Ion Androutsopoulos

    Abstract: Creating effective and reliable task-oriented dialog systems (ToDSs) is challenging, not only because of the complex structure of these systems, but also due to the scarcity of training data, especially when several modules need to be trained separately, each one with its own input/output training examples. Data augmentation (DA), whereby synthetic training examples are added to the training data,… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: There are 25 pages in total, 23 tables, 18 figures. Accepted in ACL 2024

  3. arXiv:2405.08502  [pdf, other

    cs.CL

    Archimedes-AUEB at SemEval-2024 Task 5: LLM explains Civil Procedure

    Authors: Odysseas S. Chlapanis, Ion Androutsopoulos, Dimitrios Galanis

    Abstract: The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: To be published in SemEval-2024

  4. arXiv:2402.06948  [pdf, other

    cs.CL

    Should I try multiple optimizers when fine-tuning pre-trained Transformers for NLP tasks? Should I tune their hyperparameters?

    Authors: Nefeli Gkouti, Prodromos Malakasiotis, Stavros Toumpis, Ion Androutsopoulos

    Abstract: NLP research has explored different neural model architectures and sizes, datasets, training objectives, and transfer learning techniques. However, the choice of optimizer during training has not been explored as extensively. Typically, some variant of Stochastic Gradient Descent (SGD) is employed, selected among numerous variants, using unclear criteria, often with minimal or no tuning of the opt… ▽ More

    Submitted 10 February, 2024; originally announced February 2024.

    Comments: Accepted at EACL 2024

  5. arXiv:2310.13395  [pdf, other

    cs.CL

    Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models

    Authors: Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: Prompting Large Language Models (LLMs) performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs. However, such services currently require a paymen… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: Short paper (5 pages), accepted at Findings of EMNLP 2023

  6. arXiv:2211.00974  [pdf, other

    cs.CL

    Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer

    Authors: Dimitris Mamakas, Petros Tsotsi, Ion Androutsopoulos, Ilias Chalkidis

    Abstract: Pre-trained Transformers currently dominate most NLP tasks. They impose, however, limits on the maximum input length (512 sub-words in BERT), which are too restrictive in the legal domain. Even sparse-attention models, such as Longformer and BigBird, which increase the maximum input length to 4,096 sub-words, severely truncate texts in three of the six datasets of LexGLUE. Simpler linear classifie… ▽ More

    Submitted 10 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: 9 pages, long paper at NLLP Workshop 2022 proceedings

  7. arXiv:2206.03785  [pdf, other

    cs.CL

    Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

    Authors: Stratos Xenouleas, Alexia Tsoukara, Giannis Panagiotakis, Ilias Chalkidis, Ion Androutsopoulos

    Abstract: We consider zero-shot cross-lingual transfer in legal topic classification using the recent MultiEURLEX dataset. Since the original dataset contains parallel documents, which is unrealistic for zero-shot cross-lingual transfer, we develop a new version of the dataset without parallel documents. We use it to show that translation-based methods vastly outperform cross-lingual fine-tuning of multilin… ▽ More

    Submitted 8 June, 2022; originally announced June 2022.

    Comments: 4 pages, short paper at the 12th Hellenic Conference on Artificial Intelligence (SETN 2022)

  8. arXiv:2204.04711  [pdf, other

    cs.CL cs.AI

    Data Augmentation for Biomedical Factoid Question Answering

    Authors: Dimitris Pappas, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: We study the effect of seven data augmentation (da) methods in factoid question answering, focusing on the biomedical domain, where obtaining training instances is particularly difficult. We experiment with data from the BioASQ challenge, which we augment with training instances obtained from an artificial biomedical machine reading comprehension dataset, or via back-translation, information retri… ▽ More

    Submitted 10 April, 2022; originally announced April 2022.

  9. FiNER: Financial Numeric Entity Recognition for XBRL Tagging

    Authors: Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, Georgios Paliouras

    Abstract: Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER… ▽ More

    Submitted 19 April, 2022; v1 submitted 12 March, 2022; originally announced March 2022.

    Comments: 13 pages, long paper at ACL 2022

  10. arXiv:2111.10223  [pdf, other

    cs.CL

    Toxicity Detection can be Sensitive to the Conversational Context

    Authors: Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, Leo Laugier

    Abstract: User posts whose perceived toxicity depends on the conversational context are rare in current toxicity detection datasets. Hence, toxicity detectors trained on existing datasets will also tend to disregard context, making the detection of context-sensitive toxicity harder when it does occur. We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels: (i) annotato… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: 13 pages, 8 figures

  11. arXiv:2110.00976  [pdf, other

    cs.CL

    LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

    Authors: Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, Nikolaos Aletras

    Abstract: Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeav… ▽ More

    Submitted 8 November, 2022; v1 submitted 3 October, 2021; originally announced October 2021.

    Comments: 9 pages, long paper at ACL 2022 proceedings. LexGLUE benchmark is available at: https://huggingface.co/datasets/lex_glue. Code is available at: https://github.com/coastalcph/lex-glue. Update TFIDF-SVM scores in the last version

  12. EDGAR-CORPUS: Billions of Tokens Make The World Go Round

    Authors: Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, Prodromos Malakasiotis

    Abstract: We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CO… ▽ More

    Submitted 1 October, 2021; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: 6 pages, short paper at ECONLP 2021 Workshop, in conjunction with EMNLP 2021

  13. arXiv:2109.00904  [pdf, other

    cs.CL

    MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

    Authors: Ilias Chalkidis, Manos Fergadiotis, Ion Androutsopoulos

    Abstract: We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zer… ▽ More

    Submitted 6 September, 2021; v1 submitted 2 September, 2021; originally announced September 2021.

    Comments: 9 pages, long paper at EMNLP 2021 proceedings

  14. arXiv:2106.08908  [pdf, other

    cs.IR cs.LG

    A Neural Model for Joint Document and Snippet Ranking in Question Answering for Large Document Collections

    Authors: Dimitris Pappas, Ion Androutsopoulos

    Abstract: Question answering (QA) systems for large document collections typically use pipelines that (i) retrieve possibly relevant documents, (ii) re-rank them, (iii) rank paragraphs or other snippets of the top-ranked documents, and (iv) select spans of the top-ranked snippets as exact answers. Pipelines are conceptually simple, but errors propagate from one component to the next, without later component… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: 12 pages, 3 figures, 4 tables, ACL-IJCNLP 2021

    MSC Class: 68P20; 68P10; 68T50; 68T07 ACM Class: H.3.3

  15. arXiv:2105.12530  [pdf, ps, other

    cs.CL

    Deception detection in text and its relation to the cultural dimension of individualism/collectivism

    Authors: Katerina Papantoniou, Panagiotis Papadakos, Theodore Patkos, Giorgos Flouris, Ion Androutsopoulos, Dimitris Plexousakis

    Abstract: Deception detection is a task with many applications both in direct physical and in computer-mediated communication. Our focus is on automatic deception detection in text across cultures. We view culture through the prism of the individualism/collectivism dimension and we approximate culture by using country as a proxy. Having as a starting point recent conclusions drawn from the social psychology… ▽ More

    Submitted 26 May, 2021; originally announced May 2021.

    Comments: Accepted for publication in Natural Language Engineering journal

  16. arXiv:2103.13084  [pdf, other

    cs.CL

    Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases

    Authors: Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, Prodromos Malakasiotis

    Abstract: Interpretability or explainability is an emerging research field in NLP. From a user-centric point of view, the goal is to build models that provide proper justification for their decisions, similar to those of humans, by requiring the models to satisfy additional constraints. To this end, we introduce a new application on legal text where, contrary to mainstream literature targeting word-level ra… ▽ More

    Submitted 24 March, 2021; originally announced March 2021.

    Comments: 9 pages, long paper at NAACL 2021 proceedings

  17. arXiv:2101.07299  [pdf, other

    cs.CV

    Diagnostic Captioning: A Survey

    Authors: John Pavlopoulos, Vasiliki Kougia, Ion Androutsopoulos, Dimitris Papamichail

    Abstract: Diagnostic Captioning (DC) concerns the automatic generation of a diagnostic text from a set of medical images of a patient collected during an examination. DC can assist inexperienced physicians, reducing clinical errors. It can also help experienced physicians produce diagnostic reports faster. Following the advances of deep learning, especially in generic image captioning, DC has recently attra… ▽ More

    Submitted 18 January, 2021; originally announced January 2021.

  18. arXiv:2101.04355  [pdf, other

    cs.CL

    Neural Contract Element Extraction Revisited: Letters from Sesame Street

    Authors: Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: We investigate contract element extraction. We show that LSTM-based encoders perform better than dilated CNNs, Transformers, and BERT in this task. We also find that domain-specific WORD2VEC embeddings outperform generic pre-trained GLOVE embeddings. Morpho-syntactic features in the form of POS tag and token shape embeddings, as well as context-aware ELMO embeddings do not improve performance. Sev… ▽ More

    Submitted 22 February, 2021; v1 submitted 12 January, 2021; originally announced January 2021.

    Comments: 6 pages

    Journal ref: updated version of the paper presented at Document Intelligence Workshop (NeurIPS 2019 Workshop)

  19. arXiv:2010.02559  [pdf, other

    cs.CL

    LEGAL-BERT: The Muppets straight out of Law School

    Authors: Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos

    Abstract: BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tunin… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: 5 pages, short paper in Findings of EMNLP 2020

  20. arXiv:2010.01653  [pdf, other

    cs.CL

    An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels

    Authors: Ilias Chalkidis, Manos Fergadiotis, Sotiris Kotitsas, Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos

    Abstract: Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications and presents interesting challenges. First, not all labels are well represented in the training set, due to the very large label set and the skewed label distributions of LMTC datasets. Also, label hierarchies and differences in human labelling guidelines may affect graph-aware ann… ▽ More

    Submitted 4 October, 2020; originally announced October 2020.

    Comments: 9 pages, long paper at EMNLP 2020 proceedings

  21. arXiv:2009.13366  [pdf, other

    cs.LG stat.ML

    Domain Adversarial Fine-Tuning as an Effective Regularizer

    Authors: Giorgos Vernikos, Katerina Margatina, Alexandra Chronopoulou, Ion Androutsopoulos

    Abstract: In Natural Language Processing (NLP), pretrained language models (LMs) that are transferred to downstream tasks have been recently shown to achieve state-of-the-art results. However, standard fine-tuning can degrade the general-domain representations captured during pretraining. To address this issue, we introduce a new regularization technique, AFTER; domain Adversarial Fine-Tuning as an Effectiv… ▽ More

    Submitted 5 October, 2020; v1 submitted 28 September, 2020; originally announced September 2020.

    Comments: EMNLP 2020, Findings of EMNLP

  22. GREEK-BERT: The Greeks visiting Sesame Street

    Authors: John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural language processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models have mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT, a monolingual BERT-based language model for mode… ▽ More

    Submitted 3 September, 2020; v1 submitted 27 August, 2020; originally announced August 2020.

    Comments: 8 pages, 1 figure, 11th Hellenic Conference on Artificial Intelligence (SETN 2020)

  23. arXiv:2006.00998  [pdf, other

    cs.CL

    Toxicity Detection: Does Context Really Matter?

    Authors: John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, Ion Androutsopoulos

    Abstract: Moderation is crucial to promoting healthy on-line discussions. Although several `toxicity' detection datasets and models have been published, most of them ignore the context of the posts, implicitly assuming that comments maybe judged independently. We investigate this assumption by focusing on two questions: (a) does context affect the human judgement, and (b) does conditioning on context improv… ▽ More

    Submitted 1 June, 2020; originally announced June 2020.

  24. arXiv:2005.06376  [pdf, other

    cs.CL cs.LG stat.ML

    BIOMRC: A Dataset for Biomedical Machine Reading Comprehension

    Authors: Petros Stavropoulos, Dimitris Pappas, Ion Androutsopoulos, Ryan McDonald

    Abstract: We introduce BIOMRC, a large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018). Experiments show that simple heuristics do not perform well on the new dataset, and that two neural MRC models that had been tested on BIOREAD perform much better on BIOMRC, indicating that the new dataset is indeed less noisy or a… ▽ More

    Submitted 13 May, 2020; originally announced May 2020.

    Comments: 10 pages, 4 figures, 5 tables

  25. arXiv:1909.00578  [pdf, other

    cs.CL

    SumQE: a BERT-based Summary Quality Estimation Model

    Authors: Stratos Xenouleas, Prodromos Malakasiotis, Marianna Apidianaki, Ion Androutsopoulos

    Abstract: We propose SumQE, a novel Quality Estimation model for summarization based on BERT. The model addresses linguistic quality aspects that are only indirectly captured by content-based approaches to summary evaluation, without involving comparison with human references. SumQE achieves very high correlations with human ratings, outperforming simpler models addressing these linguistic aspects. Predicti… ▽ More

    Submitted 2 September, 2019; originally announced September 2019.

    Comments: In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China, 2019

  26. arXiv:1906.07544  [pdf, other

    cs.CL

    Transfer Learning for Causal Sentence Detection

    Authors: Manolis Kyriakakis, Ion Androutsopoulos, Joan Ginés i Ametllé, Artur Saudabayev

    Abstract: We consider the task of detecting sentences that express causality, as a step towards mining causal relations from texts. To bypass the scarcity of causal instances in relation extraction datasets, we exploit transfer learning, namely ELMO and BERT, using a bidirectional GRU with self-attention (BIGRUATT) as a baseline. We experiment with both generic public relation extraction datasets and a new… ▽ More

    Submitted 20 June, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: 5 pages, short paper at BioNLP 2019 workshop

  27. arXiv:1906.05939  [pdf, other

    cs.AI cs.CL

    Embedding Biomedical Ontologies by Jointly Encoding Network Structure and Textual Node Descriptors

    Authors: Sotiris Kotitsas, Dimitris Pappas, Ion Androutsopoulos, Ryan McDonald, Marianna Apidianaki

    Abstract: Network Embedding (NE) methods, which map network nodes to low-dimensional feature vectors, have wide applications in network analysis and bioinformatics. Many existing NE methods rely only on network structure, overlooking other information associated with the nodes, e.g., text describing the nodes. Recent attempts to combine the two sources of information only consider local network structure. W… ▽ More

    Submitted 20 June, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: Proceedings of the 18th Workshop on Biomedical Natural Language Processing (BioNLP 2019) of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 2019

  28. arXiv:1906.02192  [pdf, other

    cs.CL

    Large-Scale Multi-Label Text Classification on EU Legislation

    Authors: Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: We consider Large-Scale Multi-Label Text Classification (LMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, annotated with ~4.3k EUROVOC labels, which is suitable for LMTC, few- and zero-shot learning. Experimenting with several neural classifiers, we show that BIGRUs with label-wise attention perform better than other current state of the art methods. Do… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Comments: 9 pages, short paper at ACL 2019. arXiv admin note: text overlap with arXiv:1905.10892

  29. arXiv:1906.02059  [pdf, other

    cs.CL

    Neural Legal Judgment Prediction in English

    Authors: Ilias Chalkidis, Ion Androutsopoulos, Nikolaos Aletras

    Abstract: Legal judgment prediction is the task of automatically predicting the outcome of a court case, given a text describing the case's facts. Previous work on using neural models for this task has focused on Chinese; only feature-based models (e.g., using bags of words and topics) have been considered in English. We release a new English legal judgment prediction dataset, containing cases from the Euro… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Comments: 7 pages, short paper at ACL 2019

  30. arXiv:1905.13302  [pdf, other

    cs.CV cs.AI

    A Survey on Biomedical Image Captioning

    Authors: Vasiliki Kougia, John Pavlopoulos, Ion Androutsopoulos

    Abstract: Image captioning applied to biomedical images can assist and accelerate the diagnosis process followed by clinicians. This article is the first survey of biomedical image captioning, discussing datasets, evaluation measures, and state of the art methods. Additionally, we suggest two baselines, a weak and a stronger one; the latter outperforms all current state of the art systems on one of the data… ▽ More

    Submitted 26 May, 2019; originally announced May 2019.

    Comments: SiVL 2019

  31. arXiv:1905.10892  [pdf, other

    cs.CL

    Extreme Multi-Label Legal Text Classification: A case study in EU Legislation

    Authors: Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos

    Abstract: We consider the task of Extreme Multi-Label Text Classification (XMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, the European Union's public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. The dataset is substantially larger than previous EURLEX datasets and suitable for XMTC, few-shot and zero-shot learning. Exp… ▽ More

    Submitted 26 May, 2019; originally announced May 2019.

    Comments: 10 pages, long paper at NLLP Workshop of NAACL-HLT 2019

  32. arXiv:1904.03651  [pdf, other

    cs.CL

    SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression

    Authors: Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros Potamianos

    Abstract: Neural sequence-to-sequence models are currently the dominant approach in several natural language processing tasks, but require large parallel corpora. We present a sequence-to-sequence-to-sequence autoencoder (SEQ^3), consisting of two chained encoder-decoder pairs, with words used as a sequence of discrete latent variables. We apply the proposed model to unsupervised abstractive sentence compre… ▽ More

    Submitted 9 June, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

    Comments: Accepted to NAACL 2019

  33. arXiv:1811.00051  [pdf, other

    cs.CL

    Generating Texts with Integer Linear Programming

    Authors: Gerasimos Lampouras, Ion Androutsopoulos

    Abstract: Concept-to-text generation typically employs a pipeline architecture, which often leads to suboptimal texts. Content selection, for example, may greedily select the most important facts, which may require, however, too many words to express, and this may be undesirable when space is limited or expensive. Selecting other facts, possibly only slightly less important, may allow the lexicalization sta… ▽ More

    Submitted 31 October, 2018; originally announced November 2018.

  34. arXiv:1810.13414  [pdf, other

    cs.CL

    Extracting Linguistic Resources from the Web for Concept-to-Text Generation

    Authors: Gerasimos Lampouras, Ion Androutsopoulos

    Abstract: Many concept-to-text generation systems require domain-specific linguistic resources to produce high quality texts, but manually constructing these resources can be tedious and costly. Focusing on NaturalOWL, a publicly available state of the art natural language generator for OWL ontologies, we propose methods to extract from the Web sentence plans and natural language names, two of the most impo… ▽ More

    Submitted 31 October, 2018; originally announced October 2018.

  35. arXiv:1809.06366  [pdf, other

    cs.IR cs.CL

    AUEB at BioASQ 6: Document and Snippet Retrieval

    Authors: Georgios-Ioannis Brokos, Polyvios Liosis, Ryan McDonald, Dimitris Pappas, Ion Androutsopoulos

    Abstract: We present AUEB's submissions to the BioASQ 6 document and snippet retrieval tasks (parts of Task 6b, Phase A). Our models use novel extensions to deep learning architectures that operate solely over the text of the query and candidate document/snippets. Our systems scored at the top or near the top for all batches of the challenge, highlighting the effectiveness of deep learning for these tasks.

    Submitted 15 September, 2018; originally announced September 2018.

    Comments: In Proceedings of the workshop BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering, at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, 2018. arXiv admin note: text overlap with arXiv:1809.01682

  36. arXiv:1809.01682  [pdf, other

    cs.IR cs.CL

    Deep Relevance Ranking Using Enhanced Document-Query Interactions

    Authors: Ryan McDonald, Georgios-Ioannis Brokos, Ion Androutsopoulos

    Abstract: We explore several new models for document relevance ranking, building upon the Deep Relevance Matching Model (DRMM) of Guo et al. (2016). Unlike DRMM, which uses context-insensitive encodings of terms and query-document term interactions, we inject rich context-sensitive encodings throughout our models, inspired by PACRR's (Hui et al., 2017) convolutional n-gram matching features, but extended in… ▽ More

    Submitted 11 September, 2018; v1 submitted 5 September, 2018; originally announced September 2018.

    Comments: In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, 2018

  37. arXiv:1805.03871  [pdf, other

    cs.CL

    Obligation and Prohibition Extraction Using Hierarchical RNNs

    Authors: Ilias Chalkidis, Ion Androutsopoulos, Achilleas Michos

    Abstract: We consider the task of detecting contractual obligations and prohibitions. We show that a self-attention mechanism improves the performance of a BILSTM classifier, the previous state of the art for this task, by allowing it to focus on indicative tokens. We also introduce a hierarchical BILSTM, which converts each sentence to an embedding, and processes the sentence embeddings to classify each se… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

    Comments: 6 pages, short paper at ACL 2018

  38. arXiv:1709.06518  [pdf, other

    cs.SI

    Identifying Retweetable Tweets with a Personalized Global Classifier

    Authors: Michail Vougioukas, Ion Androutsopoulos, Georgios Paliouras

    Abstract: In this paper we present a method to identify tweets that a user may find interesting enough to retweet. The method is based on a global, but personalized classifier, which is trained on data from several users, represented in terms of user-specific features. Thus, the method is trained on a sufficient volume of data, while also being able to make personalized decisions, i.e., the same post receiv… ▽ More

    Submitted 21 August, 2017; originally announced September 2017.

    Comments: This is a long paper version of the extended abstract titled "A Personalized Global Filter To Predict Retweets", of the same authors, which was published in the 25th ACM UMAP conference in Bratislava, Slovakia, in July 2017

  39. arXiv:1708.03699  [pdf, other

    cs.CL

    Improved Abusive Comment Moderation with User Embeddings

    Authors: John Pavlopoulos, Prodromos Malakasiotis, Juli Bakagianni, Ion Androutsopoulos

    Abstract: Experimenting with a dataset of approximately 1.6M user comments from a Greek news sports portal, we explore how a state of the art RNN-based moderation method can be improved by adding user embeddings, user type embeddings, user biases, or user type biases. We observe improvements in all cases, with user embeddings leading to the biggest performance gains.

    Submitted 11 August, 2017; originally announced August 2017.

  40. arXiv:1705.09993  [pdf, other

    cs.CL cs.LG

    Deep Learning for User Comment Moderation

    Authors: John Pavlopoulos, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: Experimenting with a new dataset of 1.6M user comments from a Greek news portal and existing datasets of English Wikipedia comments, we show that an RNN outperforms the previous state of the art in moderation. A deep, classification-specific attention mechanism improves further the overall performance of the RNN. We also compare against a CNN and a word-list baseline, considering both fully automa… ▽ More

    Submitted 17 July, 2017; v1 submitted 28 May, 2017; originally announced May 2017.

  41. arXiv:1608.03905  [pdf, other

    cs.IR

    Using Centroids of Word Embeddings and Word Mover's Distance for Biomedical Document Retrieval in Question Answering

    Authors: Georgios-Ioannis Brokos, Prodromos Malakasiotis, Ion Androutsopoulos

    Abstract: We propose a document retrieval method for question answering that represents documents and questions as weighted centroids of word embeddings and reranks the retrieved documents with a relaxation of Word Mover's Distance. Using biomedical questions and documents from BIOASQ, we show that our method is competitive with PUBMED. With a top-k approximation, our method is fast, and easily portable to… ▽ More

    Submitted 12 August, 2016; originally announced August 2016.

    Comments: 5 pages, 4 images, presented at BioNLP 2016

  42. arXiv:1505.02251  [pdf, ps, other

    cs.LG cs.CL cs.IR

    Probabilistic Cascading for Large Scale Hierarchical Classification

    Authors: Aris Kosmopoulos, Georgios Paliouras, Ion Androutsopoulos

    Abstract: Hierarchies are frequently used for the organization of objects. Given a hierarchy of classes, two main approaches are used, to automatically classify new instances: flat classification and cascade classification. Flat classification ignores the hierarchy, while cascade classification greedily traverses the hierarchy from the root to the predicted leaf. In this paper we propose a new approach, whi… ▽ More

    Submitted 9 May, 2015; originally announced May 2015.

  43. arXiv:1503.08581  [pdf, other

    cs.IR cs.CL cs.LG

    LSHTC: A Benchmark for Large-Scale Text Classification

    Authors: Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, Patrick Galinari

    Abstract: LSHTC is a series of challenges which aims to assess the performance of classification systems in large-scale classification in a a large number of classes (up to hundreds of thousands). This paper describes the dataset that have been released along the LSHTC series. The paper details the construction of the datsets and the design of the tracks as well as the evaluation measures that we implemente… ▽ More

    Submitted 30 March, 2015; originally announced March 2015.

  44. arXiv:1405.6164  [pdf

    cs.CL cs.AI

    Generating Natural Language Descriptions from OWL Ontologies: the NaturalOWL System

    Authors: Ion Androutsopoulos, Gerasimos Lampouras, Dimitrios Galanis

    Abstract: We present NaturalOWL, a natural language generation system that produces texts describing individuals or classes of OWL ontologies. Unlike simpler OWL verbalizers, which typically express a single axiom at a time in controlled, often not entirely fluent natural language primarily for the benefit of domain experts, we aim to generate fluent and coherent multi-sentence texts for end-users. With a s… ▽ More

    Submitted 23 April, 2014; originally announced May 2014.

    Journal ref: Journal Of Artificial Intelligence Research, Volume 48, pages 671-715, 2013

  45. Evaluation Measures for Hierarchical Classification: a unified view and novel approaches

    Authors: Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier, Georgios Paliouras, Ion Androutsopoulos

    Abstract: Hierarchical classification addresses the problem of classifying items into a hierarchy of classes. An important issue in hierarchical classification is the evaluation of different classification algorithms, which is complicated by the hierarchical relations among the classes. Several evaluation measures have been proposed for hierarchical classification using the hierarchy in different ways. This… ▽ More

    Submitted 1 July, 2013; v1 submitted 28 June, 2013; originally announced June 2013.

    Comments: Submitted to journal

  46. A Survey of Paraphrasing and Textual Entailment Methods

    Authors: Ion Androutsopoulos, Prodromos Malakasiotis

    Abstract: Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true.… ▽ More

    Submitted 30 May, 2010; v1 submitted 18 December, 2009; originally announced December 2009.

    Comments: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 2010

    ACM Class: I.2.7

    Journal ref: I. Androutsopoulos and P. Malakasiotis, "A Survey of Paraphrasing and Textual Entailment Methods". Journal of Artificial Intelligence Research, 38:135-187, 2010

  47. arXiv:cs/0306062  [pdf

    cs.CL

    Learning to Order Facts for Discourse Planning in Natural Language Generation

    Authors: Aggeliki Dimitromanolaki, Ion Androutsopoulos

    Abstract: This paper presents a machine learning approach to discourse planning in natural language generation. More specifically, we address the problem of learning the most natural ordering of facts in discourse plans for a specific domain. We discuss our methodology and how it was instantiated using two different machine learning algorithms. A quantitative evaluation performed in the domain of museum e… ▽ More

    Submitted 13 June, 2003; originally announced June 2003.

    Comments: 8 pages, 4 figures, 1 table

    ACM Class: H.5.2

    Journal ref: Proceedings of EACL 2003 Workshop on Natural Language Generation

  48. arXiv:cs/0205017  [pdf

    cs.CL

    Ellogon: A New Text Engineering Platform

    Authors: Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Ion Androutsopoulos, Constantine D. Spyropoulos

    Abstract: This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language engineering systems for the end-user. Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and exchanging textual data, embedding and m… ▽ More

    Submitted 13 May, 2002; originally announced May 2002.

    Comments: 7 pages, 9 figures. Will be presented to the Third International Conference on Language Resources and Evaluation - LREC 2002

    ACM Class: I.2.7

  49. arXiv:cs/0110057  [pdf

    cs.CL cs.AI

    Generating Multilingual Personalized Descriptions of Museum Exhibits - The M-PIRO Project

    Authors: Ion Androutsopoulos, Vassiliki Kokkinaki, Aggeliki Dimitromanolaki, Jo Calder, Jon Oberlander, Elena Not

    Abstract: This paper provides an overall presentation of the M-PIRO project. M-PIRO is develo** technology that will allow museums to generate automatically textual or spoken descriptions of exhibits for collections available over the Web or in virtual reality environments. The descriptions are generated in several languages from information in a language-independent database and small fragments of text… ▽ More

    Submitted 29 October, 2001; originally announced October 2001.

    Comments: 15 pages. Presented at the 29th Conference on Computer Applications and Quantitative Methods in Archaeology, Gotland, Sweden, 2001. A version of the paper with higher quality images can be downloaded from: http://www.iit.demokritos.gr/~ionandr/caa_paper.pdf

    ACM Class: I.2.7; H.5.2; H.5.4; I.7.4

  50. arXiv:cs/0106040  [pdf

    cs.CL cs.AI

    Stacking classifiers for anti-spam filtering of e-mail

    Authors: G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, P. Stamatopoulos

    Abstract: We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or "spam", floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the eff… ▽ More

    Submitted 19 June, 2001; originally announced June 2001.

    ACM Class: H.4.3; I.2.6; I.2.7; I.5.4; K.4.1

    Journal ref: Proceedings of "Empirical Methods in Natural Language Processing" (EMNLP 2001), L. Lee and D. Harman (Eds.), pp. 44-50, Carnegie Mellon University, Pittsburgh, PA, 2001