Search | arXiv e-print repository

A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages

Authors: Lisa Raithel, Hui-Syuan Yeh, Shuntaro Yada, Cyril Grouin, Thomas Lavergne, Aurélie Névéol, Patrick Paroubek, Philippe Thomas, Tomohiro Nishiyama, Sebastian Möller, Eiji Aramaki, Yuji Matsumoto, Roland Roller, Pierre Zweigenbaum

Abstract: User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social m… ▽ More User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted at LREC-COLING 2024

arXiv:2310.11275 [pdf, other]

xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization

Authors: Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P. Schapranow

Abstract: Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given term… ▽ More Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available. We also evaluate cross-encoders trained in a weakly supervised manner based on machine-translated datasets from a high resource domain. Our system is publicly available as an extensible Python toolkit. Results: xMEN improves the state-of-the-art performance across a wide range of multilingual benchmark datasets. Weakly supervised cross-encoders are effective when no training data is available for the target task. Through the compatibility of xMEN with the BigBIO framework, it can be easily used with existing and prospective datasets. Discussion: Our experiments show the importance of balancing the output of general-purpose candidate generators with subsequent trainable re-rankers, which we achieve through a rank regularization term in the loss function of the cross-encoder. However, error analysis reveals that multi-word expressions and other complex entities are still challenging. Conclusion: xMEN exhibits strong performance for medical entity normalization in multiple languages, even when no labeled data and few terminology aliases for the target language are available. Its configuration system and evaluation modules enable reproducible benchmarks. Models and code are available online at the following URL: https://github.com/hpi-dhc/xmen △ Less

Submitted 17 October, 2023; originally announced October 2023.

Comments: 16 pages, 3 figures

arXiv:2308.08827 [pdf, other]

Factuality Detection using Machine Translation -- a Use Case for German Clinical Text

Authors: Mohammed Bin Sumait, Aleksandra Gabryszak, Leonhard Hennig, Roland Roller

Abstract: Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data… ▽ More Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted at KONVENS 2023

arXiv:2301.00183 [pdf, other]

doi 10.1142/S021952592250014X

Modeling social resilience: Questions, answers, open problems

Authors: Frank Schweitzer, Georges Andres, Giona Casiraghi, Christoph Gote, Ramona Roller, Ingo Scholtes, Giacomo Vaccario, Christian Zingg

Abstract: Resilience denotes the capacity of a system to withstand shocks and its ability to recover from them. We develop a framework to quantify the resilience of highly volatile, non-equilibrium social organizations, such as collectives or collaborating teams. It consists of four steps: (i) \emph{delimitation}, i.e., narrowing down the target systems, (ii) \emph{conceptualization}, .e., identifying how t… ▽ More Resilience denotes the capacity of a system to withstand shocks and its ability to recover from them. We develop a framework to quantify the resilience of highly volatile, non-equilibrium social organizations, such as collectives or collaborating teams. It consists of four steps: (i) \emph{delimitation}, i.e., narrowing down the target systems, (ii) \emph{conceptualization}, .e., identifying how to approach social organizations, (iii) formal \emph{representation} using a combination of agent-based and network models, (iv) \emph{operationalization}, i.e. specifying measures and demonstrating how they enter the calculation of resilience. Our framework quantifies two dimensions of resilience, the \emph{robustness} of social organizations and their \emph{adaptivity}, and combines them in a novel resilience measure. It allows monitoring resilience instantaneously using longitudinal data instead of an ex-post evaluation. △ Less

Submitted 31 December, 2022; originally announced January 2023.

arXiv:2209.00262 [pdf, other]

Which anonymization technique is best for which NLP task? -- It depends. A Systematic Study on Clinical Text Processing

Authors: Iyadh Ben Cheikh Larbi, Aljoscha Burchardt, Roland Roller

Abstract: Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of dif… ▽ More Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of different anonymization techniques on the performance of ML models using multiple datasets corresponding to five different NLP tasks. Several learnings and recommendations are presented. This work confirms that particularly stronger anonymization techniques lead to a significant drop of performance. In addition to that, most of the presented techniques are not secure against a re-identification attack based on similarity search. △ Less

Submitted 1 September, 2022; originally announced September 2022.

arXiv:2208.02031 [pdf, other]

Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective

Authors: Lisa Raithel, Philippe Thomas, Roland Roller, Oliver Sapina, Sebastian Möller, Pierre Zweigenbaum

Abstract: In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topi… ▽ More In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Accepted at LREC 2022

arXiv:2207.03885 [pdf, other]

A Medical Information Extraction Workbench to Process German Clinical Text

Authors: Roland Roller, Laura Seiffe, Ammer Ayach, Sebastian Möller, Oliver Marten, Michael Mikhailov, Christoph Alt, Danilo Schmidt, Fabian Halleck, Marcel Naik, Wiebke Duettmann, Klemens Budde

Abstract: Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of… ▽ More Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of existing tools. One of the main reasons is the sensitivity of the data. This problem is even more evident for non-English languages. Approach: In order to address this situation, we introduce a workbench: a collection of German clinical text processing models. The models are trained on a de-identified corpus of German nephrology reports. Result: The presented models provide promising results on in-domain data. Moreover, we show that our models can be also successfully applied to other biomedical text in German. Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems. △ Less

Submitted 15 August, 2022; v1 submitted 8 July, 2022; originally announced July 2022.

Comments: Paper under review since 2021

arXiv:2204.12810 [pdf, other]

When Performance is not Enough -- A Multidisciplinary View on Clinical Decision Support

Authors: Roland Roller, Klemens Budde, Aljoscha Burchardt, Peter Dabrock, Sebastian Möller, Bilgin Osmanodja, Simon Ronicke, David Samhammer, Sven Schmeier

Abstract: Scientific publications about machine learning in healthcare are often about implementing novel methods and boosting the performance - at least from a computer science perspective. However, beyond such often short-lived improvements, much more needs to be taken into consideration if we want to arrive at a sustainable progress in healthcare. What does it take to actually implement such a system, ma… ▽ More Scientific publications about machine learning in healthcare are often about implementing novel methods and boosting the performance - at least from a computer science perspective. However, beyond such often short-lived improvements, much more needs to be taken into consideration if we want to arrive at a sustainable progress in healthcare. What does it take to actually implement such a system, make it usable for the domain expert, and possibly bring it into practical usage? Targeted at Computer Scientists, this work presents a multidisciplinary view on machine learning in medical decision support systems and covers information technology, medical, as well as ethical aspects. Along with an implemented risk prediction system in nephrology, challenges and lessons learned in a pilot project are presented. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: Paper currently under review

arXiv:2005.11494 [pdf, ps, other]

From Witch's Shot to Music Making Bones -- Resources for Medical Laymen to Technical Language and Vice Versa

Authors: Laura Seiffe, Oliver Marten, Michael Mikhailov, Sven Schmeier, Sebastian Möller, Roland Roller

Abstract: Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, ap… ▽ More Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, applications can be created that enable e.g. the detection of possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked to the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German. We introduce a new data set which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data. △ Less

Submitted 23 May, 2020; originally announced May 2020.

Comments: In Proceedings of LREC 2020

arXiv:2004.03822 [pdf, other]

doi 10.1186/s13321-018-0319-2

SIA: A Scalable Interoperable Annotation Server for Biomedical Named Entities

Authors: Johannes Kirschnick, Philippe Thomas, Roland Roller, Leonhard Hennig

Abstract: Recent years showed a strong increase in biomedical sciences and an inherent increase in publication volume. Extraction of specific information from these sources requires highly sophisticated text mining and information extraction tools. However, the integration of freely available tools into customized workflows is often cumbersome and difficult. We describe SIA (Scalable Interoperable Annotatio… ▽ More Recent years showed a strong increase in biomedical sciences and an inherent increase in publication volume. Extraction of specific information from these sources requires highly sophisticated text mining and information extraction tools. However, the integration of freely available tools into customized workflows is often cumbersome and difficult. We describe SIA (Scalable Interoperable Annotation Server), our contribution to the BeCalm-Technical interoperability and performance of annotation servers (BeCalm-TIPS) task, a scalable, extensible, and robust annotation service. The system currently covers six named entity types (i.e., Chemicals, Diseases, Genes, miRNA, Mutations, and Organisms) and is freely available under Apache 2.0 license at https://github.com/Erechtheus/sia. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: 11 pages, 2 figures, published in Journal of Cheminformatics

Journal ref: J Cheminform 10, 63 (2018)

arXiv:1811.03809 [pdf, other]

Football and Beer - a Social Media Analysis on Twitter in Context of the FIFA Football World Cup 2018

Authors: Roland Roller, Philippe Thomas, Sven Schmeier

Abstract: In many societies alcohol is a legal and common recreational substance and socially accepted. Alcohol consumption often comes along with social events as it helps people to increase their sociability and to overcome their inhibitions. On the other hand we know that increased alcohol consumption can lead to serious health issues, such as cancer, cardiovascular diseases and diseases of the digestive… ▽ More In many societies alcohol is a legal and common recreational substance and socially accepted. Alcohol consumption often comes along with social events as it helps people to increase their sociability and to overcome their inhibitions. On the other hand we know that increased alcohol consumption can lead to serious health issues, such as cancer, cardiovascular diseases and diseases of the digestive system, to mention a few. This work examines alcohol consumption during the FIFA Football World Cup 2018, particularly the usage of alcohol related information on Twitter. For this we analyse the tweeting behaviour and show that the tournament strongly increases the interest in beer. Furthermore we show that countries who had to leave the tournament at early stage might have done something good to their fans as the interest in beer decreased again. △ Less

Submitted 9 November, 2018; originally announced November 2018.

Journal ref: In proceedings of Social Media Mining for Health Applications (SMM4H) @ EMNLP 2018

arXiv:1805.01646 [pdf, ps, other]

Cross-lingual Candidate Search for Biomedical Concept Normalization

Authors: Roland Roller, Madeleine Kittner, Dirk Weissenborn, Ulf Leser

Abstract: Biomedical concept normalization links concept mentions in texts to a semantically equivalent concept in a biomedical knowledge base. This task is challenging as concepts can have different expressions in natural languages, e.g. paraphrases, which are not necessarily all present in the knowledge base. Concept normalization of non-English biomedical text is even more challenging as non-English reso… ▽ More Biomedical concept normalization links concept mentions in texts to a semantically equivalent concept in a biomedical knowledge base. This task is challenging as concepts can have different expressions in natural languages, e.g. paraphrases, which are not necessarily all present in the knowledge base. Concept normalization of non-English biomedical text is even more challenging as non-English resources tend to be much smaller and contain less synonyms. To overcome the limitations of non-English terminologies we propose a cross-lingual candidate search for concept normalization using a character-based neural translation model trained on a multilingual biomedical terminology. Our model is trained with Spanish, French, Dutch and German versions of UMLS. The evaluation of our model is carried out on the French Quaero corpus, showing that it outperforms most teams of CLEF eHealth 2015 and 2016. Additionally, we compare performance to commercial translators on Spanish, French, Dutch and German versions of Mantra. Our model performs similarly well, but is free of charge and can be run locally. This is particularly important for clinical NLP applications as medical documents underlay strict privacy restrictions. △ Less

Submitted 4 May, 2018; originally announced May 2018.

arXiv:1710.11154 [pdf, other]

Creation of an Annotated Corpus of Spanish Radiology Reports

Authors: Viviana Cotik, Darío Filippo, Roland Roller, Hans Uszkoreit, Feiyu Xu

Abstract: This paper presents a new annotated corpus of 513 anonymized radiology reports written in Spanish. Reports were manually annotated with entities, negation and uncertainty terms and relations. The corpus was conceived as an evaluation resource for named entity recognition and relation extraction algorithms, and as input for the use of supervised methods. Biomedical annotated resources are scarce du… ▽ More This paper presents a new annotated corpus of 513 anonymized radiology reports written in Spanish. Reports were manually annotated with entities, negation and uncertainty terms and relations. The corpus was conceived as an evaluation resource for named entity recognition and relation extraction algorithms, and as input for the use of supervised methods. Biomedical annotated resources are scarce due to confidentiality issues and associated costs. This work provides some guidelines that could help other researchers to undertake similar tasks. △ Less

Submitted 30 October, 2017; originally announced October 2017.

Comments: WiNLP Workshop ACL

arXiv:1509.03739 [pdf, other]

Improving distant supervision using inference learning

Authors: Roland Roller, Eneko Agirre, Aitor Soroa, Mark Stevenson

Abstract: Distant supervision is a widely applied approach to automatic training of relation extraction systems and has the advantage that it can generate large amounts of labelled data with minimal effort. However, this data may contain errors and consequently systems trained using distant supervision tend not to perform as well as those based on manually labelled data. This work proposes a novel method fo… ▽ More Distant supervision is a widely applied approach to automatic training of relation extraction systems and has the advantage that it can generate large amounts of labelled data with minimal effort. However, this data may contain errors and consequently systems trained using distant supervision tend not to perform as well as those based on manually labelled data. This work proposes a novel method for detecting potential false negative training examples using a knowledge inference method. Results show that our approach improves the performance of relation extraction systems trained using distantly supervised data. △ Less

Submitted 12 September, 2015; originally announced September 2015.

Comments: In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Showing 1–14 of 14 results for author: Roller, R