Search | arXiv e-print repository

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Authors: Virginia K. Felkner, Jennifer A. Thompson, Jonathan May

Abstract: Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper ex… ▽ More Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of develo** a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Accepted to ACL 2024 (main conference)

ACM Class: I.2.7; K.4.2

arXiv:2306.15087 [pdf, other]

WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models

Authors: Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May

Abstract: We present WinoQueer: a benchmark specifically designed to measure whether large language models (LLMs) encode biases that are harmful to the LGBTQ+ community. The benchmark is community-sourced, via application of a novel method that generates a bias benchmark from a community survey. We apply our benchmark to several popular LLMs and find that off-the-shelf models generally do exhibit considerab… ▽ More We present WinoQueer: a benchmark specifically designed to measure whether large language models (LLMs) encode biases that are harmful to the LGBTQ+ community. The benchmark is community-sourced, via application of a novel method that generates a bias benchmark from a community survey. We apply our benchmark to several popular LLMs and find that off-the-shelf models generally do exhibit considerable anti-queer bias. Finally, we show that LLM bias against a marginalized community can be somewhat mitigated by finetuning on data written about or by members of that community, and that social media text written by community members is more effective than news text written about the community by non-members. Our method for community-in-the-loop benchmark development provides a blueprint for future researchers to develop community-driven, harms-grounded LLM benchmarks for other marginalized communities. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: Accepted to ACL 2023 (main conference). Camera-ready version

arXiv:2206.11484 [pdf, other]

Towards WinoQueer: Develo** a Benchmark for Anti-Queer Bias in Large Language Models

Authors: Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May

Abstract: This paper presents exploratory work on whether and to what extent biases against queer and trans people are encoded in large language models (LLMs) such as BERT. We also propose a method for reducing these biases in downstream tasks: finetuning the models on data written by and/or about queer people. To measure anti-queer bias, we introduce a new benchmark dataset, WinoQueer, modeled after other… ▽ More This paper presents exploratory work on whether and to what extent biases against queer and trans people are encoded in large language models (LLMs) such as BERT. We also propose a method for reducing these biases in downstream tasks: finetuning the models on data written by and/or about queer people. To measure anti-queer bias, we introduce a new benchmark dataset, WinoQueer, modeled after other bias-detection benchmarks but addressing homophobic and transphobic biases. We found that BERT shows significant homophobic bias, but this bias can be mostly mitigated by finetuning BERT on a natural language corpus written by members of the LGBTQ+ community. △ Less

Submitted 7 July, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

Comments: Accepted to Queer in AI Workshop @ NAACL 2022. Updated 07/07 with minor typographical fixes

ACM Class: I.2.7

Showing 1–3 of 3 results for author: Felkner, V K