Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Papakyriakopoulos, Orestis; Choi, Anna Seo Gyeong; Andrews, Jerone; Bourke, Rebecca; Thong, William; Zhao, Dora; Xiang, Alice; Koenecke, Allison

doi:10.1145/3593013.3594049

Computer Science > Computers and Society

arXiv:2305.04672 (cs)

[Submitted on 8 May 2023]

Title:Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Authors:Orestis Papakyriakopoulos, Anna Seo Gyeong Choi, Jerone Andrews, Rebecca Bourke, William Thong, Dora Zhao, Alice Xiang, Allison Koenecke

View PDF

Abstract:Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.

Comments:	To appear in 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT '23), June 12-15, Chicago, IL, USA
Subjects:	Computers and Society (cs.CY)
Cite as:	arXiv:2305.04672 [cs.CY]
	(or arXiv:2305.04672v1 [cs.CY] for this version)
	https://doi.org/10.48550/arXiv.2305.04672
Related DOI:	https://doi.org/10.1145/3593013.3594049

Submission history

From: Orestis Papakyriakopoulos [view email]
[v1] Mon, 8 May 2023 12:49:04 UTC (1,941 KB)

Computer Science > Computers and Society

Title:Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computers and Society

Title:Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators