Skip to main content

Showing 1–34 of 34 results for author: Ruas, T

.
  1. arXiv:2407.03192  [pdf, other

    cs.DL cs.CL

    CiteAssist: A System for Automated Preprint Citation and BibTeX Generation

    Authors: Lars Benedikt Kaesberg, Terry Ruas, Jan Philip Wahle, Bela Gipp

    Abstract: We present CiteAssist, a system to automate the generation of BibTeX entries for preprints, streamlining the process of bibliographic annotation. Our system extracts metadata, such as author names, titles, publication dates, and keywords, to create standardized annotations within the document. CiteAssist automatically attaches the BibTeX citation to the end of a PDF and links it on the first page… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Published at SDProc @ ACL 2024

  2. arXiv:2407.02302  [pdf, other

    cs.CL

    Towards Human Understanding of Paraphrase Types in ChatGPT

    Authors: Dominik Meier, Jan Philip Wahle, Terry Ruas, Bela Gipp

    Abstract: Paraphrases represent a human's intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    ACM Class: I.2.7

  3. arXiv:2406.19898  [pdf, other

    cs.CL

    Paraphrase Types Elicit Prompt Engineering Capabilities

    Authors: Jan Philip Wahle, Terry Ruas, Yang Xu, Bela Gipp

    Abstract: Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  4. arXiv:2406.07494  [pdf, other

    cs.CL cs.AI

    CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization

    Authors: Frederic Kirstein, Jan Philip Wahle, Bela Gipp, Terry Ruas

    Abstract: Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although reviews have been conducted on this topic, there is a lack of comprehensive work detailing the challenges of dialogue summarization, unifying the differing understanding of the task, and aligning proposed techniques, datasets, and evaluation metrics with the challenges. This… ▽ More

    Submitted 12 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

  5. arXiv:2405.15604  [pdf, other

    cs.CL

    Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

    Authors: Jonas Becker, Jan Philip Wahle, Bela Gipp, Terry Ruas

    Abstract: Text generation has become more accessible than ever, and the increasing interest in these systems, especially those using large language models, has spurred an increasing number of related publications. We provide a systematic literature review comprising 244 selected papers between 2017 and 2024. This review categorizes works in text generation into five main tasks: open-ended text generation, s… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 35 pages, 2 figures, 2 tables, Under review

    ACM Class: A.1; I.2.7

  6. arXiv:2404.11124  [pdf, other

    cs.CL cs.AI

    What's under the hood: Investigating Automatic Metrics on Meeting Summarization

    Authors: Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp

    Abstract: Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores w… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  7. arXiv:2403.07910  [pdf, other

    cs.CY cs.CL

    MAGPIE: Multi-Task Media-Bias Analysis Generalization for Pre-Trained Identification of Expressions

    Authors: Tomáš Horych, Martin Wessel, Jan Philip Wahle, Terry Ruas, Jerome Waßmuth, André Greiner-Petter, Akiko Aizawa, Bela Gipp, Timo Spinde

    Abstract: Media bias detection poses a complex, multifaceted problem traditionally tackled using single-task models and small in-domain datasets, consequently lacking generalizability. To address this, we introduce MAGPIE, the first large-scale multi-task pre-training approach explicitly tailored for media bias detection. To enable pre-training at scale, we present Large Bias Mixture (LBM), a compilation of… ▽ More

    Submitted 15 March, 2024; v1 submitted 26 February, 2024; originally announced March 2024.

  8. arXiv:2402.12046  [pdf, other

    cs.DL cs.CL

    Citation Amnesia: NLP and Other Academic Fields Are in a Citation Age Recession

    Authors: Jan Philip Wahle, Terry Ruas, Mohamed Abdalla, Bela Gipp, Saif M. Mohammad

    Abstract: This study examines the tendency to cite older work across 20 fields of study over 43 years (1980--2023). We put NLP's propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to these other fields over time or whether differences can be observed. Our analysis, based on a dataset of approximately 240 million papers, revea… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  9. arXiv:2312.16148  [pdf, other

    cs.CL

    The Media Bias Taxonomy: A Systematic Literature Review on the Forms and Automated Detection of Media Bias

    Authors: Timo Spinde, Smi Hinterreiter, Fabian Haak, Terry Ruas, Helge Giese, Norman Meuschke, Bela Gipp

    Abstract: The way the media presents events can significantly affect public perception, which in turn can alter people's beliefs and views. Media bias describes a one-sided or polarizing perspective on a topic. This article summarizes the research on computational methods to detect media bias by systematically reviewing 3140 research papers published between 2019 and 2022. To structure our review and suppor… ▽ More

    Submitted 10 January, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

  10. We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields

    Authors: Jan Philip Wahle, Terry Ruas, Mohamed Abdalla, Bela Gipp, Saif M. Mohammad

    Abstract: Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on… ▽ More

    Submitted 1 July, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Published at EMNLP 2023

    Journal ref: EMNLP 2023

  11. Paraphrase Types for Generation and Detection

    Authors: Jan Philip Wahle, Bela Gipp, Terry Ruas

    Abstract: Current approaches in paraphrase generation and detection heavily rely on a single general similarity score, ignoring the intricate linguistic properties of language. This paper introduces two new tasks to address this shortcoming by considering paraphrase types - specific linguistic perturbations at particular text positions. We name these tasks Paraphrase Type Generation and Paraphrase Type Dete… ▽ More

    Submitted 1 July, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Published at EMNLP 2023

    Journal ref: EMNLP 2023

  12. The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research

    Authors: Mohamed Abdalla, Jan Philip Wahle, Terry Ruas, Aurélie Névéol, Fanny Ducel, Saif M. Mohammad, Karën Fort

    Abstract: Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence… ▽ More

    Submitted 1 July, 2024; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: Published at ACL 2023

    Journal ref: ACL 2023

  13. Introducing MBIB -- the first Media Bias Identification Benchmark Task and Dataset Collection

    Authors: Martin Wessel, Tomáš Horych, Terry Ruas, Akiko Aizawa, Bela Gipp, Timo Spinde

    Abstract: Although media bias detection is a complex multi-task problem, there is, to date, no unified benchmark grou** these evaluation tasks. We introduce the Media Bias Identification Benchmark (MBIB), a comprehensive benchmark that groups different types of media bias (e.g., linguistic, cognitive, political) under a common framework to test how prospective detection techniques generalize. After review… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: To be published in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23)

  14. arXiv:2303.13989  [pdf, other

    cs.CL cs.AI

    Paraphrase Detection: Human vs. Machine Content

    Authors: Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

    Abstract: The growing prominence of large language models, such as GPT-4 and ChatGPT, has led to increased concerns over academic integrity due to the potential for machine-generated content and paraphrasing. Although studies have explored the detection of human- and machine-paraphrased content, the comparison between these types of content remains underexplored. In this paper, we conduct a comprehensive an… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

  15. arXiv:2303.03886  [pdf, other

    cs.CY

    AI Usage Cards: Responsibly Reporting AI-generated Content

    Authors: Jan Philip Wahle, Terry Ruas, Saif M. Mohammad, Norman Meuschke, Bela Gipp

    Abstract: Given AI systems like ChatGPT can generate content that is indistinguishable from human-made work, the responsible use of this technology is a growing concern. Although understanding the benefits and harms of using AI systems requires more time, their rapid and indiscriminate adoption in practice is a reality. Currently, we lack a common framework and language to define and report the responsible… ▽ More

    Submitted 9 May, 2023; v1 submitted 16 February, 2023; originally announced March 2023.

  16. Exploiting Transformer-based Multitask Learning for the Detection of Media Bias in News Articles

    Authors: Timo Spinde, Jan-David Krieger, Terry Ruas, Jelena Mitrović, Franz Götz-Hahn, Akiko Aizawa, Bela Gipp

    Abstract: Media has a substantial impact on the public perception of events. A one-sided or polarizing perspective on any topic is usually described as media bias. One of the ways how bias in news articles can be introduced is by altering word choice. Biased word choices are not always obvious, nor do they exhibit high context-dependency. Hence, detecting bias is often difficult. We propose a Transformer-ba… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Journal ref: Proceedings of the iConference 2022

  17. Analyzing Multi-Task Learning for Abstractive Text Summarization

    Authors: Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp

    Abstract: Despite the recent success of multi-task learning and pre-finetuning for natural language understanding, few works have studied the effects of task families on abstractive text summarization. Task families are a form of task grou** during the pre-finetuning stage to learn common skills, such as reading comprehension. To close this gap, we analyze the influence of multi-task learning strategies u… ▽ More

    Submitted 10 November, 2022; v1 submitted 26 October, 2022; originally announced October 2022.

    Journal ref: EMNLP-GEM 2022

  18. arXiv:2210.06878  [pdf, other

    cs.CL cs.DL

    CS-Insights: A System for Analyzing Computer Science Research

    Authors: Terry Ruas, Jan Philip Wahle, Lennart Küll, Saif M. Mohammad, Bela Gipp

    Abstract: This paper presents CS-Insights, an interactive web application to analyze computer science publications from DBLP through multiple perspectives. The dedicated interfaces allow its users to identify trends in research activity, productivity, accessibility, author's productivity, venues' statistics, topics of interest, and the impact of computer science research on other fields. CS-Insightsis publi… ▽ More

    Submitted 29 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

  19. How Large Language Models are Transforming Machine-Paraphrased Plagiarism

    Authors: Jan Philip Wahle, Terry Ruas, Frederic Kirstein, Bela Gipp

    Abstract: The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still develo** in the literature. This work explores T5 and GPT-3 for machine-… ▽ More

    Submitted 10 November, 2022; v1 submitted 7 October, 2022; originally announced October 2022.

    Journal ref: EMNLP 2022

  20. Neural Media Bias Detection Using Distant Supervision With BABE -- Bias Annotations By Experts

    Authors: Timo Spinde, Manuel Plank, Jan-David Krieger, Terry Ruas, Bela Gipp, Akiko Aizawa

    Abstract: Media coverage has a substantial effect on the public perception of events. Nevertheless, media outlets are often biased. One way to bias news articles is by altering the word choice. The automatic identification of bias by word choice is challenging, primarily due to the lack of a gold standard data set and high context dependencies. This paper presents BABE, a robust and diverse data set created… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: substantial text overlap with Ph.D. proposal by same author, part of dissertation arXiv:2112.13352

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2021

  21. A Domain-adaptive Pre-training Approach for Language Bias Detection in News

    Authors: Jan-David Krieger, Timo Spinde, Terry Ruas, Juhi Kulshrestha, Bela Gipp

    Abstract: Media bias is a multi-faceted construct influencing individual behavior and collective decision-making. Slanted news reporting is the result of one-sided and polarized writing which can occur in various forms. In this work, we focus on an important form of media bias, i.e. bias by word choice. Detecting biased word choices is a challenging task due to its linguistic complexity and the lack of repr… ▽ More

    Submitted 22 May, 2022; originally announced May 2022.

    Journal ref: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries 2022 (JCDL)

  22. arXiv:2204.13384  [pdf, other

    cs.DL cs.CL

    D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research

    Authors: Jan Philip Wahle, Terry Ruas, Saif M. Mohammad, Bela Gipp

    Abstract: DBLP is the largest open-access repository of scientific articles on computer science and provides metadata associated with publications, authors, and venues. We retrieved more than 6 million publications from DBLP and extracted pertinent metadata (e.g., abstracts, author affiliations, citations) from the publication texts to create the DBLP Discovery Dataset (D3). D3 can be used to identify trend… ▽ More

    Submitted 10 November, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

    Journal ref: LREC 2022

  23. arXiv:2203.14541  [pdf, other

    cs.IR cs.CL

    Specialized Document Embeddings for Aspect-based Similarity of Research Papers

    Authors: Malte Ostendorff, Till Blume, Terry Ruas, Bela Gipp, Georg Rehm

    Abstract: Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Accepted for publication at JCDL 2022

  24. Detecting Cross-Language Plagiarism using Open Knowledge Graphs

    Authors: Johannes Stegmüller, Fabian Bauer-Marquart, Norman Meuschke, Terry Ruas, Moritz Schubotz, Bela Gipp

    Abstract: Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require compu… ▽ More

    Submitted 16 December, 2021; v1 submitted 18 November, 2021; originally announced November 2021.

    Comments: 10 pages, EEKE21, Preprint

  25. Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection

    Authors: Jan Philip Wahle, Nischal Ashok, Terry Ruas, Norman Meuschke, Tirthankar Ghosal, Bela Gipp

    Abstract: A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods predominantly target specific content types (e.g., n… ▽ More

    Submitted 10 November, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Journal ref: iConference 2022

  26. arXiv:2106.07967  [pdf, other

    cs.CL cs.AI

    Incorporating Word Sense Disambiguation in Neural Language Models

    Authors: Jan Philip Wahle, Terry Ruas, Norman Meuschke, Bela Gipp

    Abstract: We present two supervised (pre-)training methods to incorporate gloss definitions from lexical resources into neural language models (LMs). The training improves our models' performance for Word Sense Disambiguation (WSD) but also benefits general language understanding tasks while adding almost no parameters. We evaluate our techniques with seven different neural LMs and find that XLNet is more s… ▽ More

    Submitted 15 March, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

  27. arXiv:2104.13841  [pdf, other

    cs.CL cs.IR

    Evaluating Document Representations for Content-based Legal Literature Recommendations

    Authors: Malte Ostendorff, Elliott Ash, Terry Ruas, Bela Gipp, Julian Moreno-Schneider, Georg Rehm

    Abstract: Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datase… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: Accepted for publication at ICAIL 2021

  28. Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

    Authors: Jan Philip Wahle, Terry Ruas, Norman Meuschke, Bela Gipp

    Abstract: The rise of language models such as BERT allows for high-quality text paraphrasing. This is a problem to academic integrity, as it is difficult to differentiate between original and machine-generated content. We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture. Our contribution fosters future research of paraphrase detectio… ▽ More

    Submitted 10 November, 2022; v1 submitted 23 March, 2021; originally announced March 2021.

    Journal ref: JCDL 2021

  29. Identifying Machine-Paraphrased Plagiarism

    Authors: Jan Philip Wahle, Terry Ruas, Tomáš Foltýnek, Norman Meuschke, Bela Gipp

    Abstract: Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine-learning classifiers and eight state-of-the-art neural language models. We analyzed preprints of research papers, graduation theses, and Wikipedia article… ▽ More

    Submitted 25 February, 2023; v1 submitted 22 March, 2021; originally announced March 2021.

    Journal ref: iConference 2022

  30. Enhanced word embeddings using multi-semantic representation through lexical chains

    Authors: Terry Ruas, Charles Henrique Porto Ferreira, William Grosky, Fabrício Olivetti de França, Débora Maria Rossi Medeiros

    Abstract: The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of… ▽ More

    Submitted 19 December, 2022; v1 submitted 22 January, 2021; originally announced January 2021.

    Journal ref: Information Sciences. Volume 532, September 2020, Pages 16-32

  31. Multi-sense embeddings through a word sense disambiguation process

    Authors: Terry Ruas, William Grosky, Akiko Aizawa

    Abstract: Natural Language Understanding has seen an increasing number of publications in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Any exp… ▽ More

    Submitted 19 December, 2022; v1 submitted 21 January, 2021; originally announced January 2021.

    Journal ref: Expert Systems with Applications. Volume 136, 1 December 2019, Pages 288-303

  32. arXiv:2010.06395  [pdf, other

    cs.CL cs.IR

    Aspect-based Document Similarity for Research Papers

    Authors: Malte Ostendorff, Terry Ruas, Till Blume, Bela Gipp, Georg Rehm

    Abstract: Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classifi… ▽ More

    Submitted 13 October, 2020; originally announced October 2020.

    Comments: Accepted for publication at COLING 2020

  33. arXiv:2003.09881  [pdf, other

    cs.DL cs.CL cs.IR

    Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles

    Authors: Malte Ostendorff, Terry Ruas, Moritz Schubotz, Georg Rehm, Bela Gipp

    Abstract: Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between do… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

    Comments: Accepted at ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020)

  34. arXiv:1905.08359  [pdf, other

    cs.DL cs.AI cs.IR

    Why Machines Cannot Learn Mathematics, Yet

    Authors: André Greiner-Petter, Terry Ruas, Moritz Schubotz, Akiko Aizawa, William Grosky, Bela Gipp

    Abstract: Nowadays, Machine Learning (ML) is seen as the universal solution to improve the effectiveness of information retrieval (IR) methods. However, while mathematics is a precise and accurate science, it is usually expressed by less accurate and imprecise descriptions, contributing to the relative dearth of machine learning applications for IR in this domain. Generally, mathematical documents communica… ▽ More

    Submitted 20 May, 2019; originally announced May 2019.

    Comments: Submitted to 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries colocated at the 42nd International ACM SIGIR Conference

    Journal ref: 2019 http://ceur-ws.org/Vol-2414/paper14.pdf