Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Dutta, Arka; Khorramrouz, Adel; Dutta, Sujan; KhudaBukhsh, Ashiqur R.

Computer Science > Computation and Language

arXiv:2309.06415 (cs)

[Submitted on 8 Sep 2023 (v1), last revised 31 Mar 2024 (this version, v4)]

Title:Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Authors:Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R. KhudaBukhsh

View PDF HTML (experimental)

Abstract:This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.

Subjects:	Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2309.06415 [cs.CL]
	(or arXiv:2309.06415v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.06415

Submission history

From: Ashiqur Rahman KhudaBukhsh [view email]
[v1] Fri, 8 Sep 2023 03:59:02 UTC (1,699 KB)
[v2] Mon, 18 Sep 2023 16:56:40 UTC (1,699 KB)
[v3] Sat, 23 Dec 2023 06:54:20 UTC (1,701 KB)
[v4] Sun, 31 Mar 2024 02:24:39 UTC (3,391 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2023-09

Change to browse by:

cs
cs.CY

References & Citations

export BibTeX citation

Computer Science > Computation and Language

Title:Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators