Search | arXiv e-print repository

Detecting AI-Generated Text: Factors Influencing Detectability with Current Methods

Authors: Kathleen C. Fraser, Hillary Dawkins, Svetlana Kiritchenko

Abstract: Large language models (LLMs) have advanced to a point that even humans have difficulty discerning whether a text was generated by another human, or by a computer. However, knowing whether a text was produced by human or artificial intelligence (AI) is important to determining its trustworthiness, and has applications in many domains including detecting fraud and academic dishonesty, as well as com… ▽ More Large language models (LLMs) have advanced to a point that even humans have difficulty discerning whether a text was generated by another human, or by a computer. However, knowing whether a text was produced by human or artificial intelligence (AI) is important to determining its trustworthiness, and has applications in many domains including detecting fraud and academic dishonesty, as well as combating the spread of misinformation and political propaganda. The task of AI-generated text (AIGT) detection is therefore both very challenging, and highly critical. In this survey, we summarize state-of-the art approaches to AIGT detection, including watermarking, statistical and stylistic analysis, and machine learning classification. We also provide information about existing datasets for this task. Synthesizing the research findings, we aim to provide insight into the salient factors that combine to determine how "detectable" AIGT text is under different scenarios, and to make practical recommendations for future work towards this significant technical and societal challenge. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2405.20152 [pdf, other]

Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals

Authors: Phillip Howard, Kathleen C. Fraser, Anahita Bhiwandiwalla, Svetlana Kiritchenko

Abstract: With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined… ▽ More With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different models under this counterfactual generation setting at scale, producing over 57 million responses from popular LVLMs. Our multi-dimensional analysis reveals that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence the generation of toxic content, competency-associated words, harmful stereotypes, and numerical ratings of depicted individuals. We additionally explore the relationship between social bias in LVLMs and their corresponding LLMs, as well as inference-time strategies to mitigate bias. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2404.11845 [pdf, other]

Challenging Negative Gender Stereotypes: A Study on the Effectiveness of Automated Counter-Stereotypes

Authors: Isar Nejadgholi, Kathleen C. Fraser, Anna Kerkhof, Svetlana Kiritchenko

Abstract: Gender stereotypes are pervasive beliefs about individuals based on their gender that play a significant role in sha** societal attitudes, behaviours, and even opportunities. Recognizing the negative implications of gender stereotypes, particularly in online communications, this study investigates eleven strategies to automatically counter-act and challenge these views. We present AI-generated g… ▽ More Gender stereotypes are pervasive beliefs about individuals based on their gender that play a significant role in sha** societal attitudes, behaviours, and even opportunities. Recognizing the negative implications of gender stereotypes, particularly in online communications, this study investigates eleven strategies to automatically counter-act and challenge these views. We present AI-generated gender-based counter-stereotypes to (self-identified) male and female study participants and ask them to assess their offensiveness, plausibility, and potential effectiveness. The strategies of counter-facts and broadening universals (i.e., stating that anyone can have a trait regardless of group membership) emerged as the most robust approaches, while humour, perspective-taking, counter-examples, and empathy for the speaker were perceived as less effective. Also, the differences in ratings were more pronounced for stereotypes about the different targets than between the genders of the raters. Alarmingly, many AI-generated counter-stereotypes were perceived as offensive and/or implausible. Our analysis and the collected dataset offer foundational insight into counter-stereotype generation, guiding future efforts to develop strategies that effectively challenge gender stereotypes in online interactions. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: LREC-COLING2024

arXiv:2404.00166 [pdf, other]

Uncovering Bias in Large Vision-Language Models with Counterfactuals

Authors: Phillip Howard, Anahita Bhiwandiwalla, Kathleen C. Fraser, Svetlana Kiritchenko

Abstract: With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined… ▽ More With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different LVLMs under this counterfactual generation setting and find that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence toxicity and the generation of competency-associated words. △ Less

Submitted 7 June, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

Comments: Accepted to the CVPR 2024 Responsible Generative AI (ReGenAI) Workshop

arXiv:2402.05779 [pdf, other]

Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images

Authors: Kathleen C. Fraser, Svetlana Kiritchenko

Abstract: Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision-language models (LVLMs) has emerged. Such models can incorporate images as input in addition to text, and perform tasks such as visual question answering, image captioning, story generation, etc. Here, we examine potential gender and racial biases in such systems, based on the percei… ▽ More Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision-language models (LVLMs) has emerged. Such models can incorporate images as input in addition to text, and perform tasks such as visual question answering, image captioning, story generation, etc. Here, we examine potential gender and racial biases in such systems, based on the perceived characteristics of the people in the input images. To accomplish this, we present a new dataset PAIRS (PArallel Images for eveRyday Scenarios). The PAIRS dataset contains sets of AI-generated images of people, such that the images are highly similar in terms of background and visual content, but differ along the dimensions of gender (man, woman) and race (Black, white). By querying the LVLMs with such images, we observe significant differences in the responses according to the perceived gender or race of the person depicted. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: To appear at EACL 2024

arXiv:2307.01900 [pdf, other]

Concept-Based Explanations to Test for False Causal Relationships Learned by Abusive Language Classifiers

Authors: Isar Nejadgholi, Svetlana Kiritchenko, Kathleen C. Fraser, Esma Balkır

Abstract: Classifiers tend to learn a false causal relationship between an over-represented concept and a label, which can result in over-reliance on the concept and compromised classification accuracy. It is imperative to have methods in place that can compare different models and identify over-reliances on specific concepts. We consider three well-known abusive language classifiers trained on large Englis… ▽ More Classifiers tend to learn a false causal relationship between an over-represented concept and a label, which can result in over-reliance on the concept and compromised classification accuracy. It is imperative to have methods in place that can compare different models and identify over-reliances on specific concepts. We consider three well-known abusive language classifiers trained on large English datasets and focus on the concept of negative emotions, which is an important signal but should not be learned as a sufficient feature for the label of abuse. Motivated by the definition of global sufficiency, we first examine the unwanted dependencies learned by the classifiers by assessing their accuracy on a challenge set across all decision thresholds. Further, recognizing that a challenge set might not always be available, we introduce concept-based explanation metrics to assess the influence of the concept on the labels. These explanations allow us to compare classifiers regarding the degree of false global sufficiency they have learned between a concept and a label. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: Published at WOAH2023 co-located with ACL2023

arXiv:2303.14128 [pdf, other]

The crime of being poor

Authors: Georgina Curto, Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser

Abstract: The criminalization of poverty has been widely denounced as a collective bias against the most vulnerable. NGOs and international organizations claim that the poor are blamed for their situation, are more often associated with criminal offenses than the wealthy strata of society and even incur criminal offenses simply as a result of being poor. While no evidence has been found in the literature th… ▽ More The criminalization of poverty has been widely denounced as a collective bias against the most vulnerable. NGOs and international organizations claim that the poor are blamed for their situation, are more often associated with criminal offenses than the wealthy strata of society and even incur criminal offenses simply as a result of being poor. While no evidence has been found in the literature that correlates poverty and overall criminality rates, this paper offers evidence of a collective belief that associates both concepts. This brief report measures the societal bias that correlates criminality with the poor, as compared to the rich, by using Natural Language Processing (NLP) techniques in Twitter. The paper quantifies the level of crime-poverty bias in a panel of eight different English-speaking countries. The regional differences in the association between crime and poverty cannot be justified based on different levels of inequality or unemployment, which the literature correlates to property crimes. The variation in the observed rates of crime-poverty bias for different geographic locations could be influenced by cultural factors and the tendency to overestimate the equality of opportunities and social mobility in specific countries. These results have consequences for policy-making and open a new path of research for poverty mitigation with the focus not only on the poor but on society as a whole. Acting on the collective bias against the poor would facilitate the approval of poverty reduction policies, as well as the restoration of the dignity of the persons affected. △ Less

Submitted 24 March, 2023; originally announced March 2023.

arXiv:2302.07159 [pdf, other]

A Friendly Face: Do Text-to-Image Systems Rely on Stereotypes when the Input is Under-Specified?

Authors: Kathleen C. Fraser, Svetlana Kiritchenko, Isar Nejadgholi

Abstract: As text-to-image systems continue to grow in popularity with the general public, questions have arisen about bias and diversity in the generated images. Here, we investigate properties of images generated in response to prompts which are visually under-specified, but contain salient social attributes (e.g., 'a portrait of a threatening person' versus 'a portrait of a friendly person'). Grounding o… ▽ More As text-to-image systems continue to grow in popularity with the general public, questions have arisen about bias and diversity in the generated images. Here, we investigate properties of images generated in response to prompts which are visually under-specified, but contain salient social attributes (e.g., 'a portrait of a threatening person' versus 'a portrait of a friendly person'). Grounding our work in social cognition theory, we find that in many cases, images contain similar demographic biases to those reported in the stereotype literature. However, trends are inconsistent across different models and further investigation is warranted. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: Appearing in the AAAI 2023 Workshop on Creative AI Across Modalities

arXiv:2210.10689 [pdf, other]

Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information

Authors: Isar Nejadgholi, Esma Balkır, Kathleen C. Fraser, Svetlana Kiritchenko

Abstract: Previous works on the fairness of toxic language classifiers compare the output of models with different identity terms as input features but do not consider the impact of other important concepts present in the context. Here, besides identity terms, we take into account high-level latent features learned by the classifier and investigate the interaction between these features and identity terms.… ▽ More Previous works on the fairness of toxic language classifiers compare the output of models with different identity terms as input features but do not consider the impact of other important concepts present in the context. Here, besides identity terms, we take into account high-level latent features learned by the classifier and investigate the interaction between these features and identity terms. For a multi-class toxic language classifier, we leverage a concept-based explanation framework to calculate the sensitivity of the model to the concept of sentiment, which has been used before as a salient feature for toxic language detection. Our results show that although for some classes, the classifier has learned the sentiment information as expected, this information is outweighed by the influence of identity terms as input features. This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes. The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: 13 pages, 2 figures, accepted at the fifth edition of BlackBoxNLP collocated with EMNLP2022

arXiv:2206.03945 [pdf, other]

Challenges in Applying Explainability Methods to Improve the Fairness of NLP Models

Authors: Esma Balkir, Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser

Abstract: Motivations for methods in explainable artificial intelligence (XAI) often include detecting, quantifying and mitigating bias, and contributing to making machine learning models fairer. However, exactly how an XAI method can help in combating biases is often left unspecified. In this paper, we briefly review trends in explainability and fairness in NLP research, identify the current practices in w… ▽ More Motivations for methods in explainable artificial intelligence (XAI) often include detecting, quantifying and mitigating bias, and contributing to making machine learning models fairer. However, exactly how an XAI method can help in combating biases is often left unspecified. In this paper, we briefly review trends in explainability and fairness in NLP research, identify the current practices in which explainability methods are applied to detect and mitigate bias, and investigate the barriers preventing XAI methods from being used more widely in tackling fairness issues. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: TrustNLP Workshop at NAACL 2022

arXiv:2205.12771 [pdf, other]

Does Moral Code Have a Moral Code? Probing Delphi's Moral Philosophy

Authors: Kathleen C. Fraser, Svetlana Kiritchenko, Esma Balkir

Abstract: In an effort to guarantee that machine learning model outputs conform with human moral values, recent work has begun exploring the possibility of explicitly training models to learn the difference between right and wrong. This is typically done in a bottom-up fashion, by exposing the model to different scenarios, annotated with human moral judgements. One question, however, is whether the trained… ▽ More In an effort to guarantee that machine learning model outputs conform with human moral values, recent work has begun exploring the possibility of explicitly training models to learn the difference between right and wrong. This is typically done in a bottom-up fashion, by exposing the model to different scenarios, annotated with human moral judgements. One question, however, is whether the trained models actually learn any consistent, higher-level ethical principles from these datasets -- and if so, what? Here, we probe the Allen AI Delphi model with a set of standardized morality questionnaires, and find that, despite some inconsistencies, Delphi tends to mirror the moral principles associated with the demographic groups involved in the annotation process. We question whether this is desirable and discuss how we might move forward with this knowledge. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: To appear at TrustNLP Workshop @ NAACL 2022

arXiv:2205.03302 [pdf, other]

Necessity and Sufficiency for Explaining Text Classifiers: A Case Study in Hate Speech Detection

Authors: Esma Balkir, Isar Nejadgholi, Kathleen C. Fraser, Svetlana Kiritchenko

Abstract: We present a novel feature attribution method for explaining text classifiers, and analyze it in the context of hate speech detection. Although feature attribution models usually provide a single importance score for each token, we instead provide two complementary and theoretically-grounded scores -- necessity and sufficiency -- resulting in more informative explanations. We propose a transparent… ▽ More We present a novel feature attribution method for explaining text classifiers, and analyze it in the context of hate speech detection. Although feature attribution models usually provide a single importance score for each token, we instead provide two complementary and theoretically-grounded scores -- necessity and sufficiency -- resulting in more informative explanations. We propose a transparent method that calculates these values by generating explicit perturbations of the input text, allowing the importance scores themselves to be explainable. We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors, exposing sources of classifier bias against marginalized groups. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Comments: NAACL 2022

arXiv:2204.02261 [pdf, other]

Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors

Authors: Isar Nejadgholi, Kathleen C. Fraser, Svetlana Kiritchenko

Abstract: Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. In this paper, we show th… ▽ More Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. In this paper, we show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse. Next, we propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on new data, in this case, COVID-related anti-Asian hate speech. Extending this technique, we introduce a novel metric, Degree of Explicitness, for a single instance and show that the new metric is beneficial in suggesting out-of-domain unlabeled examples to effectively enrich the training data with informative, implicitly abusive texts. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: accepted to be published at ACL2022

arXiv:2110.09421 [pdf, other]

doi 10.1109/MIM.2021.9513645

Measuring Cognitive Status from Speech in a Smart Home Environment

Authors: Kathleen C. Fraser, Majid Komeili

Abstract: The population is aging, and becoming more tech-savvy. The United Nations predicts that by 2050, one in six people in the world will be over age 65 (up from one in 11 in 2019), and this increases to one in four in Europe and Northern America. Meanwhile, the proportion of American adults over 65 who own a smartphone has risen 24 percentage points from 2013-2017, and the majority have Internet in th… ▽ More The population is aging, and becoming more tech-savvy. The United Nations predicts that by 2050, one in six people in the world will be over age 65 (up from one in 11 in 2019), and this increases to one in four in Europe and Northern America. Meanwhile, the proportion of American adults over 65 who own a smartphone has risen 24 percentage points from 2013-2017, and the majority have Internet in their homes. Smart devices and smart home technology have profound potential to transform how people age, their ability to live independently in later years, and their interactions with their circle of care. Cognitive health is a key component to independence and well-being in old age, and smart homes present many opportunities to measure cognitive status in a continuous, unobtrusive manner. In this article, we focus on speech as a measurement instrument for cognitive health. Existing methods of cognitive assessment suffer from a number of limitations that could be addressed through smart home speech sensing technologies. We begin with a brief tutorial on measuring cognitive status from speech, including some pointers to useful open-source software toolboxes for the interested reader. We then present an overview of the preliminary results from pilot studies on active and passive smart home speech sensing for the measurement of cognitive health, and conclude with some recommendations and challenge statements for the next wave of work in this area, to help overcome both technical and ethical barriers to success. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Journal ref: IEEE Instrumentation & Measurement Magazine (Volume: 24, Issue: 6, September 2021)

arXiv:2106.02596 [pdf, other]

Understanding and Countering Stereotypes: A Computational Approach to the Stereotype Content Model

Authors: Kathleen C. Fraser, Isar Nejadgholi, Svetlana Kiritchenko

Abstract: Stereotypical language expresses widely-held beliefs about different social categories. Many stereotypes are overtly negative, while others may appear positive on the surface, but still lead to negative consequences. In this work, we present a computational approach to interpreting stereotypes in text through the Stereotype Content Model (SCM), a comprehensive causal theory from social psychology.… ▽ More Stereotypical language expresses widely-held beliefs about different social categories. Many stereotypes are overtly negative, while others may appear positive on the surface, but still lead to negative consequences. In this work, we present a computational approach to interpreting stereotypes in text through the Stereotype Content Model (SCM), a comprehensive causal theory from social psychology. The SCM proposes that stereotypes can be understood along two primary dimensions: warmth and competence. We present a method for defining warmth and competence axes in semantic embedding space, and show that the four quadrants defined by this subspace accurately represent the warmth and competence concepts, according to annotated lexicons. We then apply our computational SCM model to textual stereotype data and show that it compares favourably with survey-based studies in the psychological literature. Furthermore, we explore various strategies to counter stereotypical beliefs with anti-stereotypes. It is known that countering stereotypes with anti-stereotypical examples is one of the most effective ways to reduce biased thinking, yet the problem of generating anti-stereotypes has not been previously studied. Thus, a better understanding of how to generate realistic and effective anti-stereotypes can contribute to addressing pressing societal concerns of stereoty**, prejudice, and discrimination. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)

arXiv:2012.12305 [pdf, other]

doi 10.1613/jair.1.12590

Confronting Abusive Language Online: A Survey from the Ethical and Human Rights Perspective

Authors: Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser

Abstract: The pervasiveness of abusive content on the internet can lead to severe psychological and physical harm. Significant effort in Natural Language Processing (NLP) research has been devoted to addressing this problem through abusive content detection and related sub-areas, such as the detection of hate speech, toxicity, cyberbullying, etc. Although current technologies achieve high classification per… ▽ More The pervasiveness of abusive content on the internet can lead to severe psychological and physical harm. Significant effort in Natural Language Processing (NLP) research has been devoted to addressing this problem through abusive content detection and related sub-areas, such as the detection of hate speech, toxicity, cyberbullying, etc. Although current technologies achieve high classification performance in research studies, it has been observed that the real-life application of this technology can cause unintended harms, such as the silencing of under-represented groups. We review a large body of NLP research on automatic abuse detection with a new focus on ethical challenges, organized around eight established ethical principles: privacy, accountability, safety and security, transparency and explainability, fairness and non-discrimination, human control of technology, professional responsibility, and promotion of human values. In many cases, these principles relate not only to situational ethical codes, which may be context-dependent, but are in fact connected to universal human rights, such as the right to privacy, freedom from discrimination, and freedom of expression. We highlight the need to examine the broad social impacts of this technology, and to bring ethical and human rights considerations to every stage of the application life-cycle, from task formulation and dataset design, to model training and evaluation, to application deployment. Guided by these principles, we identify several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse, including `nudging', `quarantining', value sensitive design, counter-narratives, style transfer, and AI-driven public education applications. △ Less

Submitted 22 July, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

Comments: published in Journal of Artificial Intelligence Research, 71: 431-478, July 2021

arXiv:2006.05281 [pdf, other]

Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience

Authors: Isar Nejadgholi, Kathleen C. Fraser, Berry De Bruijn

Abstract: When comparing entities extracted by a medical entity recognition system with gold standard annotations over a test set, two types of mismatches might occur, label mismatch or span mismatch. Here we focus on span mismatch and show that its severity can vary from a serious error to a fully acceptable entity extraction due to the subjectivity of span annotations. For a domain-specific BERT-based NER… ▽ More When comparing entities extracted by a medical entity recognition system with gold standard annotations over a test set, two types of mismatches might occur, label mismatch or span mismatch. Here we focus on span mismatch and show that its severity can vary from a serious error to a fully acceptable entity extraction due to the subjectivity of span annotations. For a domain-specific BERT-based NER system, we showed that 25% of the errors have the same labels and overlap** span with gold standard entities. We collected expert judgement which shows more than 90% of these mismatches are accepted or partially accepted by the user. Using the training set of the NER system, we built a fast and lightweight entity classifier to approximate the user experience of such mismatches through accepting or rejecting them. The decisions made by this classifier are used to calculate a learning-based F-score which is shown to be a better approximation of a forgiving user's experience than the relaxed F-score. We demonstrated the results of applying the proposed evaluation metric for a variety of deep learning medical entity recognition models trained with two datasets. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: to appear at BioNLP2020

arXiv:1910.01274 [pdf, other]

Extracting UMLS Concepts from Medical Text Using General and Domain-Specific Deep Learning Models

Authors: Kathleen C. Fraser, Isar Nejadgholi, Berry De Bruijn, Muqun Li, Astha LaPlante, Khaldoun Zine El Abidine

Abstract: Entity recognition is a critical first step to a number of clinical NLP applications, such as entity linking and relation extraction. We present the first attempt to apply state-of-the-art entity recognition approaches on a newly released dataset, MedMentions. This dataset contains over 4000 biomedical abstracts, annotated for UMLS semantic types. In comparison to existing datasets, MedMentions co… ▽ More Entity recognition is a critical first step to a number of clinical NLP applications, such as entity linking and relation extraction. We present the first attempt to apply state-of-the-art entity recognition approaches on a newly released dataset, MedMentions. This dataset contains over 4000 biomedical abstracts, annotated for UMLS semantic types. In comparison to existing datasets, MedMentions contains a far greater number of entity types, and thus represents a more challenging but realistic scenario in a real-world setting. We explore a number of relevant dimensions, including the use of contextual versus non-contextual word embeddings, general versus domain-specific unsupervised pre-training, and different deep learning architectures. We contrast our results against the well-known i2b2 2010 entity recognition dataset, and propose a new method to combine general and domain-specific information. While producing a state-of-the-art result for the i2b2 2010 task (F1 = 0.90), our results on MedMentions are significantly lower (F1 = 0.63), suggesting there is still plenty of opportunity for improvement on this new data. △ Less

Submitted 2 October, 2019; originally announced October 2019.

Comments: 11 pages, accepted at LOUHI2019 workshop

Showing 1–18 of 18 results for author: Fraser, K C