Search | arXiv e-print repository

ECBD: Evidence-Centered Benchmark Design for NLP

Authors: Yu Lu Liu, Su Lin Blodgett, Jackie Chi Kit Cheung, Q. Vera Liao, Alexandra Olteanu, Ziang Xiao

Abstract: Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity… ▽ More Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark's measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in hel** practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices -- e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.05860 [pdf, other]

The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels

Authors: Eve Fleisig, Su Lin Blodgett, Dan Klein, Zeerak Talat

Abstract: Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine… ▽ More Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement--some challenged by perspectivist approaches, and some that remain to be addressed--as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2402.04420 [pdf, other]

Measuring machine learning harms from stereotypes: requires understanding who is being harmed by which errors in what ways

Authors: Angelina Wang, Xuechunzi Bai, Solon Barocas, Su Lin Blodgett

Abstract: As machine learning applications proliferate, we need an understanding of their potential for harm. However, current fairness metrics are rarely grounded in human psychological experiences of harm. Drawing on the social psychology of stereotypes, we use a case study of gender stereotypes in image search to examine how people react to machine learning errors. First, we use survey studies to show th… ▽ More As machine learning applications proliferate, we need an understanding of their potential for harm. However, current fairness metrics are rarely grounded in human psychological experiences of harm. Drawing on the social psychology of stereotypes, we use a case study of gender stereotypes in image search to examine how people react to machine learning errors. First, we use survey studies to show that not all machine learning errors reflect stereotypes nor are equally harmful. Then, in experimental studies we randomly expose participants to stereotype-reinforcing, -violating, and -neutral machine learning errors. We find stereotype-reinforcing errors induce more experientially (i.e., subjectively) harmful experiences, while having minimal changes to cognitive beliefs, attitudes, or behaviors. This experiential harm impacts women more than men. However, certain stereotype-violating errors are more experientially harmful for men, potentially due to perceived threats to masculinity. We conclude that harm cannot be the sole guide in fairness mitigation, and propose a nuanced perspective depending on who is experiencing what harm and why. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: earlier draft non-archival at EAAMO 2023

arXiv:2311.11103 [pdf, other]

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

Authors: Yu Lu Liu, Meng Cao, Su Lin Blodgett, Jackie Chi Kit Cheung, Alexandra Olteanu, Adam Trischler

Abstract: AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and other responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization -- a common NLP task lar… ▽ More AI and NLP publication venues have increasingly encouraged researchers to reflect on possible ethical considerations, adverse impacts, and other responsible AI issues their work might engender. However, for specific NLP tasks our understanding of how prevalent such issues are, or when and why these issues are likely to arise, remains limited. Focusing on text summarization -- a common NLP task largely overlooked by the responsible AI community -- we examine research and reporting practices in the current literature. We conduct a multi-round qualitative analysis of 333 summarization papers from the ACL Anthology published between 2020-2022. We focus on how, which, and when responsible AI issues are covered, which relevant stakeholders are considered, and mismatches between stated and realized research goals. We also discuss current evaluation practices and consider how authors discuss the limitations of both prior work and their own work. Overall, we find that relatively few papers engage with possible stakeholders or contexts of use, which limits their consideration of potential downstream adverse impacts or other responsible AI issues. Based on our findings, we make recommendations on concrete practices and research directions. △ Less

Submitted 18 November, 2023; originally announced November 2023.

arXiv:2310.15398 [pdf, other]

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors

Authors: Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu

Abstract: Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to behave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features… ▽ More Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to behave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people's expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute "fair" or "good" NLG system behaviors. △ Less

Submitted 3 April, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: 36 pages, 24 figures, NAACL 2024

arXiv:2306.05949 [pdf, other]

Evaluating the Social Impact of Generative AI Systems in Systems and Society

Authors: Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Canyu Chen, Hal Daumé III, Jesse Dodge, Isabella Duan, Ellie Evans, Felix Friedrich, Avijit Ghosh, Usman Gohar, Sara Hooker, Yacine Jernite, Ria Kalluri, Alberto Lusoli, Alina Leidinger, Michelle Lin, Xiuzhu Lin, Sasha Luccioni, Jennifer Mickel, Margaret Mitchell, Jessica Newman , et al. (6 additional authors not shown)

Abstract: Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categor… ▽ More Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm. △ Less

Submitted 28 June, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Forthcoming in Hacker, Engel, Hammer, Mittelstadt (eds), Oxford Handbook on the Foundations and Regulation of Generative AI. Oxford University Press

arXiv:2305.12757 [pdf, other]

This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models

Authors: Seraphina Goldfarb-Tarrant, Eddie Ungless, Esma Balkir, Su Lin Blodgett

Abstract: Bias research in NLP seeks to analyse models for social biases, thus hel** NLP practitioners uncover, measure, and mitigate social harms. We analyse the body of work that uses prompts and templates to assess bias in language models. We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure and how that measurement is carried out… ▽ More Bias research in NLP seeks to analyse models for social biases, thus hel** NLP practitioners uncover, measure, and mitigate social harms. We analyse the body of work that uses prompts and templates to assess bias in language models. We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure and how that measurement is carried out. By applying this taxonomy to 90 bias tests, we illustrate qualitatively and quantitatively that core aspects of bias test conceptualisations and operationalisations are frequently unstated or ambiguous, carry implicit assumptions, or be mismatched. Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched. We offer guidance to enable the community to explore a wider section of the possible bias space, and to better close the gap between desired outcomes and experimental design, both for bias and for evaluating language models more broadly. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to ACL Findings 2023

arXiv:2305.09022 [pdf, other]

It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance

Authors: Arjun Subramonian, Xingdi Yuan, Hal Daumé III, Su Lin Blodgett

Abstract: Progress in NLP is increasingly measured through benchmarks; hence, contextualizing progress requires understanding when and why practitioners may disagree about the validity of benchmarks. We develop a taxonomy of disagreement, drawing on tools from measurement modeling, and distinguish between two types of disagreement: 1) how tasks are conceptualized and 2) how measurements of model performance… ▽ More Progress in NLP is increasingly measured through benchmarks; hence, contextualizing progress requires understanding when and why practitioners may disagree about the validity of benchmarks. We develop a taxonomy of disagreement, drawing on tools from measurement modeling, and distinguish between two types of disagreement: 1) how tasks are conceptualized and 2) how measurements of model performance are operationalized. To provide evidence for our taxonomy, we conduct a meta-analysis of relevant literature to understand how NLP tasks are conceptualized, as well as a survey of practitioners about their impressions of different factors that affect benchmark validity. Our meta-analysis and survey across eight tasks, ranging from coreference resolution to question answering, uncover that tasks are generally not clearly and consistently conceptualized and benchmarks suffer from operationalization disagreements. These findings support our proposed taxonomy of disagreement. Finally, based on our taxonomy, we present a framework for constructing benchmarks and documenting their limitations. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Journal ref: Findings of the Association for Computational Linguistics: ACL 2023

arXiv:2305.01776 [pdf, other]

Taxonomizing and Measuring Representational Harms: A Look at Image Tagging

Authors: Jared Katzman, Angelina Wang, Morgan Scheuerman, Su Lin Blodgett, Kristen Laird, Hanna Wallach, Solon Barocas

Abstract: In this paper, we examine computational approaches for measuring the "fairness" of image tagging systems, finding that they cluster into five distinct categories, each with its own analytic foundation. We also identify a range of normative concerns that are often collapsed under the terms "unfairness," "bias," or even "discrimination" when discussing problematic cases of image tagging. Specificall… ▽ More In this paper, we examine computational approaches for measuring the "fairness" of image tagging systems, finding that they cluster into five distinct categories, each with its own analytic foundation. We also identify a range of normative concerns that are often collapsed under the terms "unfairness," "bias," or even "discrimination" when discussing problematic cases of image tagging. Specifically, we identify four types of representational harms that can be caused by image tagging systems, providing concrete examples of each. We then consider how different computational measurement approaches map to each of these types, demonstrating that there is not a one-to-one map**. Our findings emphasize that no single measurement approach will be definitive and that it is not possible to infer from the use of a particular measurement approach which type of harm was intended to be measured. Lastly, equipped with this more granular understanding of the types of representational harms that can be caused by image tagging systems, we show that attempts to mitigate some of these types of harms may be in tension with one another. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: AAAI-23 Special Track on AI for Social Impact

Journal ref: Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023)

arXiv:2301.05753 [pdf, ps, other]

Fairness and Sequential Decision Making: Limits, Lessons, and Opportunities

Authors: Samer B. Nashed, Justin Svegliato, Su Lin Blodgett

Abstract: As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated. However, various research communities have independently conceptualized these harms, envisioned potential applications, and proposed interventions. The result is a somewhat fracture… ▽ More As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated. However, various research communities have independently conceptualized these harms, envisioned potential applications, and proposed interventions. The result is a somewhat fractured landscape of literature focused generally on ensuring decision-making algorithms "do the right thing". In this paper, we compare and discuss work across two major subsets of this literature: algorithmic fairness, which focuses primarily on predictive systems, and ethical decision making, which focuses primarily on sequential decision making and planning. We explore how each of these settings has articulated its normative concerns, the viability of different techniques for these different settings, and how ideas from each setting may have utility for the other. △ Less

Submitted 13 January, 2023; originally announced January 2023.

Comments: 10 pages

arXiv:2212.14486 [pdf, other]

Examining Political Rhetoric with Epistemic Stance Detection

Authors: Ankita Gupta, Su Lin Blodgett, Justin H Gross, Brendan O'Connor

Abstract: Participants in political discourse employ rhetorical strategies -- such as hedging, attributions, or denials -- to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance predict… ▽ More Participants in political discourse employ rhetorical strategies -- such as hedging, attributions, or denials -- to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance prediction, drawn from research in computational semantics, to distinguish at the clausal level what is asserted, denied, or only ambivalently suggested by the author or other mentioned entities (belief holders). We first develop a simple RoBERTa-based model for multi-source stance predictions that outperforms more complex state-of-the-art modeling. Then we demonstrate its novel application to political science by conducting a large-scale analysis of the Mass Market Manifestos corpus of U.S. political opinion books, where we characterize trends in cited belief holders -- respected allies and opposed bogeymen -- across U.S. political ideologies. △ Less

Submitted 5 January, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

Comments: Forthcoming in Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS) at EMNLP 2022

arXiv:2205.06828 [pdf, other]

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Authors: Kaitlyn Zhou, Su Lin Blodgett, Adam Trischler, Hal Daumé III, Kaheer Suleman, Alexandra Olteanu

Abstract: There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, an… ▽ More There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, and how to evaluate -- are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: Camera Ready for NAACL 2022 (Main Conference)

arXiv:2110.10024 [pdf, other]

Risks of AI Foundation Models in Education

Authors: Su Lin Blodgett, Michael Madaio

Abstract: If the authors of a recent Stanford report (Bommasani et al., 2021) on the opportunities and risks of "foundation models" are to be believed, these models represent a paradigm shift for AI and for the domains in which they will supposedly be used, including education. Although the name is new (and contested (Field, 2021)), the term describes existing types of algorithmic models that are "trained o… ▽ More If the authors of a recent Stanford report (Bommasani et al., 2021) on the opportunities and risks of "foundation models" are to be believed, these models represent a paradigm shift for AI and for the domains in which they will supposedly be used, including education. Although the name is new (and contested (Field, 2021)), the term describes existing types of algorithmic models that are "trained on broad data at scale" and "fine-tuned" (i.e., adapted) for particular downstream tasks, and is intended to encompass large language models such as BERT or GPT-3 and computer vision models such as CLIP. Such technologies have the potential for harm broadly speaking (e.g., Bender et al., 2021), but their use in the educational domain is particularly fraught, despite the potential benefits for learners claimed by the authors. In section 3.3 of the Stanford report, Malik et al. argue that achieving the goal of providing education for all learners requires more efficient computational approaches that can rapidly scale across educational domains and across educational contexts, for which they argue foundation models are uniquely well-suited. However, evidence suggests that not only are foundation models not likely to achieve the stated benefits for learners, but their use may also introduce new risks for harm. △ Less

Submitted 19 October, 2021; originally announced October 2021.

arXiv:2106.11410 [pdf, other]

A Survey of Race, Racism, and Anti-Racism in NLP

Authors: Anjalie Field, Su Lin Blodgett, Zeerak Waseem, Yulia Tsvetkov

Abstract: Despite inextricable ties between race and language, little work has considered race in NLP research and development. In this work, we survey 79 papers from the ACL anthology that mention race. These papers reveal various types of race-related bias in all stages of NLP model development, highlighting the need for proactive consideration of how NLP systems can uphold racial hierarchies. However, pe… ▽ More Despite inextricable ties between race and language, little work has considered race in NLP research and development. In this work, we survey 79 papers from the ACL anthology that mention race. These papers reveal various types of race-related bias in all stages of NLP model development, highlighting the need for proactive consideration of how NLP systems can uphold racial hierarchies. However, persistent gaps in research on race and NLP remain: race has been siloed as a niche topic and remains ignored in many NLP tasks; most work operationalizes race as a fixed single-dimensional variable with a ground-truth label, which risks reinforcing differences produced by historical racism; and the voices of historically marginalized people are nearly absent in NLP literature. By identifying where and how NLP literature has and has not considered race, especially in comparison to related fields, our work calls for inclusion and racial justice in NLP research practices. △ Less

Submitted 15 July, 2021; v1 submitted 21 June, 2021; originally announced June 2021.

Comments: Accepted to ACL 2021

arXiv:2105.08847 [pdf, ps, other]

Beyond "Fairness:" Structural (In)justice Lenses on AI for Education

Authors: Michael Madaio, Su Lin Blodgett, Elijah Mayfield, Ezekiel Dixon-Román

Abstract: Educational technologies, and the systems of schooling in which they are deployed, enact particular ideologies about what is important to know and how learners should learn. As artificial intelligence technologies -- in education and beyond -- may contribute to inequitable outcomes for marginalized communities, various approaches have been developed to evaluate and mitigate the harmful impacts of… ▽ More Educational technologies, and the systems of schooling in which they are deployed, enact particular ideologies about what is important to know and how learners should learn. As artificial intelligence technologies -- in education and beyond -- may contribute to inequitable outcomes for marginalized communities, various approaches have been developed to evaluate and mitigate the harmful impacts of AI. However, we argue in this paper that the dominant paradigm of evaluating fairness on the basis of performance disparities in AI models is inadequate for confronting the systemic inequities that educational AI systems (re)produce. We draw on a lens of structural injustice informed by critical theory and Black feminist scholarship to critically interrogate several widely-studied and widely-adopted categories of educational AI and explore how they are bound up in and reproduce historical legacies of structural injustice and inequity, regardless of the parity of their models' performance. We close with alternative visions for a more equitable future for educational AI. △ Less

Submitted 1 November, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: To be published in: The Ethics of Artificial Intelligence in Education: Current Challenges, Practices and Debates, W. Holmesand K. Porayska-Pomsta (Eds.), Routledge. This revision incorporates reviewer feedback and updates the title to reflect the current book chapter title

ACM Class: K.3; K.4; I.2

arXiv:2104.03026 [pdf, ps, other]

How to Write a Bias Statement: Recommendations for Submissions to the Workshop on Gender Bias in NLP

Authors: Christian Hardmeier, Marta R. Costa-jussà, Kellie Webster, Will Radford, Su Lin Blodgett

Abstract: At the Workshop on Gender Bias in NLP (GeBNLP), we'd like to encourage authors to give explicit consideration to the wider aspects of bias and its social implications. For the 2020 edition of the workshop, we therefore requested that all authors include an explicit bias statement in their work to clarify how their work relates to the social context in which NLP systems are used. The programme co… ▽ More At the Workshop on Gender Bias in NLP (GeBNLP), we'd like to encourage authors to give explicit consideration to the wider aspects of bias and its social implications. For the 2020 edition of the workshop, we therefore requested that all authors include an explicit bias statement in their work to clarify how their work relates to the social context in which NLP systems are used. The programme committee of the workshops included a number of reviewers with a background in the humanities and social sciences, in addition to NLP experts doing the bulk of the reviewing. Each paper was assigned one of those reviewers, and they were asked to pay specific attention to the provided bias statements in their reviews. This initiative was well received by the authors who submitted papers to the workshop, several of whom said they received useful suggestions and literature hints from the bias reviewers. We are therefore planning to keep this feature of the review process in future editions of the workshop. △ Less

Submitted 7 April, 2021; originally announced April 2021.

Comments: This document was originally published as a blog post on the web site of GeBNLP 2020

arXiv:2005.14050 [pdf, other]

Language (Technology) is Power: A Critical Survey of "Bias" in NLP

Authors: Su Lin Blodgett, Solon Barocas, Hal Daumé III, Hanna Wallach

Abstract: We survey 146 papers analyzing "bias" in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing "bias" is an inherently normative process. We further find that these papers' proposed quantitative techniques for measuring or mitigating "bias" are poorly matched to their motivations and do not engage with the rel… ▽ More We survey 146 papers analyzing "bias" in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing "bias" is an inherently normative process. We further find that these papers' proposed quantitative techniques for measuring or mitigating "bias" are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. Based on these findings, we describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing "bias" in NLP systems. These recommendations rest on a greater recognition of the relationships between language and social hierarchies, encouraging researchers and practitioners to articulate their conceptualizations of "bias"---i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, as well as the normative reasoning underlying these statements---and to center work around the lived experiences of members of communities affected by NLP systems, while interrogating and reimagining the power relations between technologists and such communities. △ Less

Submitted 29 May, 2020; v1 submitted 28 May, 2020; originally announced May 2020.

arXiv:1804.06004 [pdf, other]

Monte Carlo Syntax Marginals for Exploring and Using Dependency Parses

Authors: Katherine A. Keith, Su Lin Blodgett, Brendan O'Connor

Abstract: Dependency parsing research, which has made significant gains in recent years, typically focuses on improving the accuracy of single-tree predictions. However, ambiguity is inherent to natural language syntax, and communicating such ambiguity is important for error analysis and better-informed downstream applications. In this work, we propose a transition sampling algorithm to sample from the full… ▽ More Dependency parsing research, which has made significant gains in recent years, typically focuses on improving the accuracy of single-tree predictions. However, ambiguity is inherent to natural language syntax, and communicating such ambiguity is important for error analysis and better-informed downstream applications. In this work, we propose a transition sampling algorithm to sample from the full joint distribution of parse trees defined by a transition-based parsing model, and demonstrate the use of the samples in probabilistic dependency analysis. First, we define the new task of dependency path prediction, inferring syntactic substructures over part of a sentence, and provide the first analysis of performance on this task. Second, we demonstrate the usefulness of our Monte Carlo syntax marginal method for parser error analysis and calibration. Finally, we use this method to propagate parse uncertainty to two downstream information extraction applications: identifying persons killed by police and semantic role assignment. △ Less

Submitted 16 April, 2018; originally announced April 2018.

Comments: To appear in Proceedings of NAACL 2018

arXiv:1707.00061 [pdf, other]

Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English

Authors: Su Lin Blodgett, Brendan O'Connor

Abstract: We highlight an important frontier in algorithmic fairness: disparity in the quality of natural language processing algorithms when applied to language from authors of different social groups. For example, current systems sometimes analyze the language of females and minorities more poorly than they do of whites and males. We conduct an empirical analysis of racial disparity in language identifica… ▽ More We highlight an important frontier in algorithmic fairness: disparity in the quality of natural language processing algorithms when applied to language from authors of different social groups. For example, current systems sometimes analyze the language of females and minorities more poorly than they do of whites and males. We conduct an empirical analysis of racial disparity in language identification for tweets written in African-American English, and discuss implications of disparity in NLP. △ Less

Submitted 30 June, 2017; originally announced July 2017.

Comments: Presented as a talk at the 2017 Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2017)

arXiv:1608.08868 [pdf, other]

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Authors: Su Lin Blodgett, Lisa Green, Brendan O'Connor

Abstract: Though dialectal language is increasingly abundant on social media, few resources exist for develo** NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages,… ▽ More Though dialectal language is increasingly abundant on social media, few resources exist for develo** NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and we verify that this language follows well-known AAE linguistic phenomena. In addition, we analyze the quality of existing language identification and dependency parsing tools on AAE-like text, demonstrating that they perform poorly on such text compared to text associated with white speakers. We also provide an ensemble classifier for language identification which eliminates this disparity and release a new corpus of tweets containing AAE-like language. △ Less

Submitted 31 August, 2016; originally announced August 2016.

Comments: To be published in EMNLP 2016, 15 pages

arXiv:1606.06352 [pdf, other]

Visualizing textual models with in-text and word-as-pixel highlighting

Authors: Abram Handler, Su Lin Blodgett, Brendan O'Connor

Abstract: We explore two techniques which use color to make sense of statistical text models. One method uses in-text annotations to illustrate a model's view of particular tokens in particular documents. Another uses a high-level, "words-as-pixels" graphic to display an entire corpus. Together, these methods offer both zoomed-in and zoomed-out perspectives into a model's understanding of text. We show how… ▽ More We explore two techniques which use color to make sense of statistical text models. One method uses in-text annotations to illustrate a model's view of particular tokens in particular documents. Another uses a high-level, "words-as-pixels" graphic to display an entire corpus. Together, these methods offer both zoomed-in and zoomed-out perspectives into a model's understanding of text. We show how these interconnected methods help diagnose a classifier's poor performance on Twitter slang, and make sense of a topic model on historical political texts. △ Less

Submitted 20 June, 2016; originally announced June 2016.

Comments: Presented at 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY

Showing 1–21 of 21 results for author: Blodgett, S L