-
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
Authors:
Lisa Raithel,
Hui-Syuan Yeh,
Shuntaro Yada,
Cyril Grouin,
Thomas Lavergne,
Aurélie Névéol,
Patrick Paroubek,
Philippe Thomas,
Tomohiro Nishiyama,
Sebastian Möller,
Eiji Aramaki,
Yuji Matsumoto,
Roland Roller,
Pierre Zweigenbaum
Abstract:
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social m…
▽ More
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization
Authors:
Florian Borchert,
Ignacio Llorca,
Roland Roller,
Bert Arnrich,
Matthieu-P. Schapranow
Abstract:
Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.
Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given term…
▽ More
Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.
Materials and Methods: We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model if annotations for the target task are available. We also evaluate cross-encoders trained in a weakly supervised manner based on machine-translated datasets from a high resource domain. Our system is publicly available as an extensible Python toolkit.
Results: xMEN improves the state-of-the-art performance across a wide range of multilingual benchmark datasets. Weakly supervised cross-encoders are effective when no training data is available for the target task. Through the compatibility of xMEN with the BigBIO framework, it can be easily used with existing and prospective datasets.
Discussion: Our experiments show the importance of balancing the output of general-purpose candidate generators with subsequent trainable re-rankers, which we achieve through a rank regularization term in the loss function of the cross-encoder. However, error analysis reveals that multi-word expressions and other complex entities are still challenging.
Conclusion: xMEN exhibits strong performance for medical entity normalization in multiple languages, even when no labeled data and few terminology aliases for the target language are available. Its configuration system and evaluation modules enable reproducible benchmarks. Models and code are available online at the following URL: https://github.com/hpi-dhc/xmen
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Factuality Detection using Machine Translation -- a Use Case for German Clinical Text
Authors:
Mohammed Bin Sumait,
Aleksandra Gabryszak,
Leonhard Hennig,
Roland Roller
Abstract:
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data…
▽ More
Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Modeling social resilience: Questions, answers, open problems
Authors:
Frank Schweitzer,
Georges Andres,
Giona Casiraghi,
Christoph Gote,
Ramona Roller,
Ingo Scholtes,
Giacomo Vaccario,
Christian Zingg
Abstract:
Resilience denotes the capacity of a system to withstand shocks and its ability to recover from them. We develop a framework to quantify the resilience of highly volatile, non-equilibrium social organizations, such as collectives or collaborating teams. It consists of four steps: (i) \emph{delimitation}, i.e., narrowing down the target systems, (ii) \emph{conceptualization}, .e., identifying how t…
▽ More
Resilience denotes the capacity of a system to withstand shocks and its ability to recover from them. We develop a framework to quantify the resilience of highly volatile, non-equilibrium social organizations, such as collectives or collaborating teams. It consists of four steps: (i) \emph{delimitation}, i.e., narrowing down the target systems, (ii) \emph{conceptualization}, .e., identifying how to approach social organizations, (iii) formal \emph{representation} using a combination of agent-based and network models, (iv) \emph{operationalization}, i.e. specifying measures and demonstrating how they enter the calculation of resilience. Our framework quantifies two dimensions of resilience, the \emph{robustness} of social organizations and their \emph{adaptivity}, and combines them in a novel resilience measure. It allows monitoring resilience instantaneously using longitudinal data instead of an ex-post evaluation.
△ Less
Submitted 31 December, 2022;
originally announced January 2023.
-
Which anonymization technique is best for which NLP task? -- It depends. A Systematic Study on Clinical Text Processing
Authors:
Iyadh Ben Cheikh Larbi,
Aljoscha Burchardt,
Roland Roller
Abstract:
Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of dif…
▽ More
Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of different anonymization techniques on the performance of ML models using multiple datasets corresponding to five different NLP tasks. Several learnings and recommendations are presented. This work confirms that particularly stronger anonymization techniques lead to a significant drop of performance. In addition to that, most of the presented techniques are not secure against a re-identification attack based on similarity search.
△ Less
Submitted 1 September, 2022;
originally announced September 2022.
-
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective
Authors:
Lisa Raithel,
Philippe Thomas,
Roland Roller,
Oliver Sapina,
Sebastian Möller,
Pierre Zweigenbaum
Abstract:
In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topi…
▽ More
In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
A Medical Information Extraction Workbench to Process German Clinical Text
Authors:
Roland Roller,
Laura Seiffe,
Ammer Ayach,
Sebastian Möller,
Oliver Marten,
Michael Mikhailov,
Christoph Alt,
Danilo Schmidt,
Fabian Halleck,
Marcel Naik,
Wiebke Duettmann,
Klemens Budde
Abstract:
Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of…
▽ More
Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of existing tools. One of the main reasons is the sensitivity of the data. This problem is even more evident for non-English languages.
Approach: In order to address this situation, we introduce a workbench: a collection of German clinical text processing models. The models are trained on a de-identified corpus of German nephrology reports.
Result: The presented models provide promising results on in-domain data. Moreover, we show that our models can be also successfully applied to other biomedical text in German. Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems.
△ Less
Submitted 15 August, 2022; v1 submitted 8 July, 2022;
originally announced July 2022.
-
When Performance is not Enough -- A Multidisciplinary View on Clinical Decision Support
Authors:
Roland Roller,
Klemens Budde,
Aljoscha Burchardt,
Peter Dabrock,
Sebastian Möller,
Bilgin Osmanodja,
Simon Ronicke,
David Samhammer,
Sven Schmeier
Abstract:
Scientific publications about machine learning in healthcare are often about implementing novel methods and boosting the performance - at least from a computer science perspective. However, beyond such often short-lived improvements, much more needs to be taken into consideration if we want to arrive at a sustainable progress in healthcare. What does it take to actually implement such a system, ma…
▽ More
Scientific publications about machine learning in healthcare are often about implementing novel methods and boosting the performance - at least from a computer science perspective. However, beyond such often short-lived improvements, much more needs to be taken into consideration if we want to arrive at a sustainable progress in healthcare. What does it take to actually implement such a system, make it usable for the domain expert, and possibly bring it into practical usage? Targeted at Computer Scientists, this work presents a multidisciplinary view on machine learning in medical decision support systems and covers information technology, medical, as well as ethical aspects. Along with an implemented risk prediction system in nephrology, challenges and lessons learned in a pilot project are presented.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
From Witch's Shot to Music Making Bones -- Resources for Medical Laymen to Technical Language and Vice Versa
Authors:
Laura Seiffe,
Oliver Marten,
Michael Mikhailov,
Sven Schmeier,
Sebastian Möller,
Roland Roller
Abstract:
Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, ap…
▽ More
Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, applications can be created that enable e.g. the detection of possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked to the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German. We introduce a new data set which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.
△ Less
Submitted 23 May, 2020;
originally announced May 2020.
-
SIA: A Scalable Interoperable Annotation Server for Biomedical Named Entities
Authors:
Johannes Kirschnick,
Philippe Thomas,
Roland Roller,
Leonhard Hennig
Abstract:
Recent years showed a strong increase in biomedical sciences and an inherent increase in publication volume. Extraction of specific information from these sources requires highly sophisticated text mining and information extraction tools. However, the integration of freely available tools into customized workflows is often cumbersome and difficult. We describe SIA (Scalable Interoperable Annotatio…
▽ More
Recent years showed a strong increase in biomedical sciences and an inherent increase in publication volume. Extraction of specific information from these sources requires highly sophisticated text mining and information extraction tools. However, the integration of freely available tools into customized workflows is often cumbersome and difficult. We describe SIA (Scalable Interoperable Annotation Server), our contribution to the BeCalm-Technical interoperability and performance of annotation servers (BeCalm-TIPS) task, a scalable, extensible, and robust annotation service. The system currently covers six named entity types (i.e., Chemicals, Diseases, Genes, miRNA, Mutations, and Organisms) and is freely available under Apache 2.0 license at https://github.com/Erechtheus/sia.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Football and Beer - a Social Media Analysis on Twitter in Context of the FIFA Football World Cup 2018
Authors:
Roland Roller,
Philippe Thomas,
Sven Schmeier
Abstract:
In many societies alcohol is a legal and common recreational substance and socially accepted. Alcohol consumption often comes along with social events as it helps people to increase their sociability and to overcome their inhibitions. On the other hand we know that increased alcohol consumption can lead to serious health issues, such as cancer, cardiovascular diseases and diseases of the digestive…
▽ More
In many societies alcohol is a legal and common recreational substance and socially accepted. Alcohol consumption often comes along with social events as it helps people to increase their sociability and to overcome their inhibitions. On the other hand we know that increased alcohol consumption can lead to serious health issues, such as cancer, cardiovascular diseases and diseases of the digestive system, to mention a few. This work examines alcohol consumption during the FIFA Football World Cup 2018, particularly the usage of alcohol related information on Twitter. For this we analyse the tweeting behaviour and show that the tournament strongly increases the interest in beer. Furthermore we show that countries who had to leave the tournament at early stage might have done something good to their fans as the interest in beer decreased again.
△ Less
Submitted 9 November, 2018;
originally announced November 2018.
-
Cross-lingual Candidate Search for Biomedical Concept Normalization
Authors:
Roland Roller,
Madeleine Kittner,
Dirk Weissenborn,
Ulf Leser
Abstract:
Biomedical concept normalization links concept mentions in texts to a semantically equivalent concept in a biomedical knowledge base. This task is challenging as concepts can have different expressions in natural languages, e.g. paraphrases, which are not necessarily all present in the knowledge base. Concept normalization of non-English biomedical text is even more challenging as non-English reso…
▽ More
Biomedical concept normalization links concept mentions in texts to a semantically equivalent concept in a biomedical knowledge base. This task is challenging as concepts can have different expressions in natural languages, e.g. paraphrases, which are not necessarily all present in the knowledge base. Concept normalization of non-English biomedical text is even more challenging as non-English resources tend to be much smaller and contain less synonyms. To overcome the limitations of non-English terminologies we propose a cross-lingual candidate search for concept normalization using a character-based neural translation model trained on a multilingual biomedical terminology. Our model is trained with Spanish, French, Dutch and German versions of UMLS. The evaluation of our model is carried out on the French Quaero corpus, showing that it outperforms most teams of CLEF eHealth 2015 and 2016. Additionally, we compare performance to commercial translators on Spanish, French, Dutch and German versions of Mantra. Our model performs similarly well, but is free of charge and can be run locally. This is particularly important for clinical NLP applications as medical documents underlay strict privacy restrictions.
△ Less
Submitted 4 May, 2018;
originally announced May 2018.
-
Creation of an Annotated Corpus of Spanish Radiology Reports
Authors:
Viviana Cotik,
Darío Filippo,
Roland Roller,
Hans Uszkoreit,
Feiyu Xu
Abstract:
This paper presents a new annotated corpus of 513 anonymized radiology reports written in Spanish. Reports were manually annotated with entities, negation and uncertainty terms and relations. The corpus was conceived as an evaluation resource for named entity recognition and relation extraction algorithms, and as input for the use of supervised methods. Biomedical annotated resources are scarce du…
▽ More
This paper presents a new annotated corpus of 513 anonymized radiology reports written in Spanish. Reports were manually annotated with entities, negation and uncertainty terms and relations. The corpus was conceived as an evaluation resource for named entity recognition and relation extraction algorithms, and as input for the use of supervised methods. Biomedical annotated resources are scarce due to confidentiality issues and associated costs. This work provides some guidelines that could help other researchers to undertake similar tasks.
△ Less
Submitted 30 October, 2017;
originally announced October 2017.
-
Improving distant supervision using inference learning
Authors:
Roland Roller,
Eneko Agirre,
Aitor Soroa,
Mark Stevenson
Abstract:
Distant supervision is a widely applied approach to automatic training of relation extraction systems and has the advantage that it can generate large amounts of labelled data with minimal effort. However, this data may contain errors and consequently systems trained using distant supervision tend not to perform as well as those based on manually labelled data. This work proposes a novel method fo…
▽ More
Distant supervision is a widely applied approach to automatic training of relation extraction systems and has the advantage that it can generate large amounts of labelled data with minimal effort. However, this data may contain errors and consequently systems trained using distant supervision tend not to perform as well as those based on manually labelled data. This work proposes a novel method for detecting potential false negative training examples using a knowledge inference method. Results show that our approach improves the performance of relation extraction systems trained using distantly supervised data.
△ Less
Submitted 12 September, 2015;
originally announced September 2015.