Skip to main content

Showing 1–20 of 20 results for author: Talat, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11598  [pdf, other

    cs.CL cs.CY

    Understanding "Democratization" in NLP and ML Research

    Authors: Arjun Subramonian, Vagrant Gautam, Dietrich Klakow, Zeerak Talat

    Abstract: Recent improvements in natural language processing (NLP) and machine learning (ML) and increased mainstream adoption have led to researchers frequently discussing the "democratization" of artificial intelligence. In this paper, we seek to clarify how democratization is understood in NLP and ML publications, through large-scale mixed-methods analyses of papers using the keyword "democra*" published… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.11073  [pdf, other

    cs.CL

    Exploring the Limitations of Detecting Machine-Generated Text

    Authors: Jad Doughman, Osama Mohammed Afzal, Hawau Olamide Toyin, Shady Shehata, Preslav Nakov, Zeerak Talat

    Abstract: Recent improvements in the quality of the generations by large language models have spurred research into identifying machine-generated text. Systems proposed for the task often achieve high performance. However, humans and machines can produce text in different styles and in different domains, and it remains unclear whether machine generated-text detection models favour particular styles or domai… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  3. arXiv:2405.05860  [pdf, other

    cs.LG cs.CL cs.CY

    The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels

    Authors: Eve Fleisig, Su Lin Blodgett, Dan Klein, Zeerak Talat

    Abstract: Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  4. arXiv:2403.04445  [pdf, other

    cs.CL

    Classist Tools: Social Class Correlates with Performance in NLP

    Authors: Amanda Cercas Curry, Giuseppe Attanasio, Zeerak Talat, Dirk Hovy

    Abstract: Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  5. arXiv:2403.03874  [pdf, ps, other

    cs.CL cs.AI cs.CY

    Impoverished Language Technology: The Lack of (Social) Class in NLP

    Authors: Amanda Cercas Curry, Zeerak Talat, Dirk Hovy

    Abstract: Since Labov's (1964) foundational work on the social stratification of language, linguistics has dedicated concerted efforts towards understanding the relationships between socio-demographic factors and language production and perception. Despite the large body of evidence identifying significant relationships between socio-demographic factors and language production, relatively few of these facto… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: Accepted to LREC-COLING 2024

  6. arXiv:2403.02268  [pdf, other

    cs.CL cs.AI cs.CY

    Subjective $\textit{Isms}$? On the Danger of Conflating Hate and Offence in Abusive Language Detection

    Authors: Amanda Cercas Curry, Gavin Abercrombie, Zeerak Talat

    Abstract: Natural language processing research has begun to embrace the notion of annotator subjectivity, motivated by variations in labelling. This approach understands each annotator's view as valid, which can be highly suitable for tasks that embed subjectivity, e.g., sentiment analysis. However, this construction may be inappropriate for tasks such as hate speech detection, as it affords equal validity… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  7. arXiv:2402.02113  [pdf, other

    cs.CL

    Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon

    Authors: Fajri Koto, Tilman Beck, Zeerak Talat, Iryna Gurevych, Timothy Baldwin

    Abstract: Improving multilingual language models capabilities in low-resource languages is generally difficult due to the scarcity of large-scale data in those languages. In this paper, we relax the reliance on texts in low-resource languages by using multilingual lexicons in pretraining to enhance multilingual capabilities. Specifically, we focus on zero-shot sentiment analysis tasks across 34 languages, i… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

    Comments: Accepted at EACL 2024

  8. arXiv:2307.10223  [pdf, other

    cs.CY cs.AI

    Bound by the Bounty: Collaboratively Sha** Evaluation Processes for Queer AI Harms

    Authors: Organizers of QueerInAI, Nathan Dennler, Anaelia Ovalle, Ashwin Singh, Luca Soldaini, Arjun Subramonian, Huy Tu, William Agnew, Avijit Ghosh, Kyra Yee, Irene Font Peradejordi, Zeerak Talat, Mayra Russo, Jess de Jesus de Pinho Pinhal

    Abstract: Bias evaluation benchmarks and dataset and model documentation have emerged as central processes for assessing the biases and harms of artificial intelligence (AI) systems. However, these auditing processes have been criticized for their failure to integrate the knowledge of marginalized communities and consider the power dynamics between auditors and the communities. Consequently, modes of bias e… ▽ More

    Submitted 25 July, 2023; v1 submitted 14 July, 2023; originally announced July 2023.

    Comments: To appear at AIES 2023

    Journal ref: 2023 AAAI/ACM Conference on AI, Ethics, and Society

  9. arXiv:2306.05949  [pdf, other

    cs.CY cs.AI

    Evaluating the Social Impact of Generative AI Systems in Systems and Society

    Authors: Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Canyu Chen, Hal Daumé III, Jesse Dodge, Isabella Duan, Ellie Evans, Felix Friedrich, Avijit Ghosh, Usman Gohar, Sara Hooker, Yacine Jernite, Ria Kalluri, Alberto Lusoli, Alina Leidinger, Michelle Lin, Xiuzhu Lin, Sasha Luccioni, Jennifer Mickel, Margaret Mitchell, Jessica Newman , et al. (6 additional authors not shown)

    Abstract: Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categor… ▽ More

    Submitted 28 June, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

    Comments: Forthcoming in Hacker, Engel, Hammer, Mittelstadt (eds), Oxford Handbook on the Foundations and Regulation of Generative AI. Oxford University Press

  10. arXiv:2305.09800  [pdf, other

    cs.CL

    Mirages: On Anthropomorphism in Dialogue Systems

    Authors: Gavin Abercrombie, Amanda Cercas Curry, Tanvi Dinkar, Verena Rieser, Zeerak Talat

    Abstract: Automated dialogue or conversational systems are anthropomorphised by developers and personified by users. While a degree of anthropomorphism may be inevitable due to the choice of medium, conscious and unconscious design choices can guide users to personify such systems to varying degrees. Encouraging users to relate to automated systems as if they were human can lead to high risk scenarios cause… ▽ More

    Submitted 23 October, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

    Comments: Accepted for publication at EMNLP. See ACL Anthology for published version

  11. arXiv:2304.08315  [pdf, other

    cs.CL cs.AI

    Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing

    Authors: Lucie-Aimée Kaffee, Arnav Arora, Zeerak Talat, Isabelle Augenstein

    Abstract: Dual use, the intentional, harmful reuse of technology and scientific artefacts, is a problem yet to be well-defined within the context of Natural Language Processing (NLP). However, as NLP technologies continue to advance and become increasingly widespread in society, their inner workings have become increasingly opaque. Therefore, understanding dual use concerns and potential ways of limiting th… ▽ More

    Submitted 30 October, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

  12. Queer In AI: A Case Study in Community-Led Participatory AI

    Authors: Organizers Of QueerInAI, :, Anaelia Ovalle, Arjun Subramonian, Ashwin Singh, Claas Voelcker, Danica J. Sutherland, Davide Locatelli, Eva Breznik, Filip Klubička, Hang Yuan, Hetvi J, Huan Zhang, Jaidev Shriram, Kruno Lehman, Luca Soldaini, Maarten Sap, Marc Peter Deisenroth, Maria Leonor Pacheco, Maria Ryskina, Martin Mundt, Milind Agarwal, Nyx McLean, Pan Xu, A Pranav , et al. (26 additional authors not shown)

    Abstract: We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional tenets started and shaped this community's programs over the years. We discuss different challenges that emerged in the process, look at ways this organization has fallen short of operationalizing participatory and intersectional principles, and then assess th… ▽ More

    Submitted 8 June, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: To appear at FAccT 2023

    Journal ref: 2023 ACM Conference on Fairness, Accountability, and Transparency

  13. arXiv:2302.09243  [pdf, other

    cs.LG cs.AI cs.CL

    A Federated Approach for Hate Speech Detection

    Authors: Jay Gala, Deep Gandhi, Jash Mehta, Zeerak Talat

    Abstract: Hate speech detection has been the subject of high research attention, due to the scale of content created on social media. In spite of the attention and the sensitive nature of the task, privacy preservation in hate speech detection has remained under-studied. The majority of research has focused on centralised machine learning infrastructures which risk leaking data. In this paper, we show that… ▽ More

    Submitted 18 February, 2023; originally announced February 2023.

    Comments: EACL 2023 Main Conference (Short Paper)

  14. arXiv:2211.06401  [pdf, other

    cs.LG cs.CL

    A Federated Approach to Predicting Emojis in Hindi Tweets

    Authors: Deep Gandhi, Jash Mehta, Nirali Parekh, Karan Waghela, Lynette D'Mello, Zeerak Talat

    Abstract: The use of emojis affords a visual modality to, often private, textual communication. The task of predicting emojis however provides a challenge for machine learning as emoji use tends to cluster into the frequently used and the rarely used emojis. Much of the machine learning research on emoji use has focused on high resource languages and has conceptualised the task of predicting emojis around t… ▽ More

    Submitted 11 November, 2022; originally announced November 2022.

    Comments: EMNLP2022 Main Track Short Paper

  15. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  16. arXiv:2210.06245  [pdf, other

    cs.CL

    Back to the Future: On Potential Histories in NLP

    Authors: Zeerak Talat, Anne Lauscher

    Abstract: Machine learning and NLP require the construction of datasets to train and fine-tune models. In this context, previous work has demonstrated the sensitivity of these data sets. For instance, potential societal biases in this data are likely to be encoded and to be amplified in the models we deploy. In this work, we draw from developments in the field of history and take a novel perspective on thes… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

  17. arXiv:2206.09917  [pdf, other

    cs.CL

    Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models

    Authors: Paul Röttger, Haitham Seelawi, Debora Nozza, Zeerak Talat, Bertie Vidgen

    Abstract: Hate speech detection models are typically evaluated on held-out test sets. However, this risks painting an incomplete and potentially misleading picture of model performance because of increasingly well-documented systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, recent research has thus introduced functional tests for hate speech detection models. H… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Comments: Accepted at WOAH (NAACL 2022)

  18. arXiv:2206.03216  [pdf, other

    cs.CY cs.AI cs.CL

    Data Governance in the Age of Large-Scale Data-Driven Language Technology

    Authors: Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Gérard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac Johnson, Dragomir Radev, Somaieh Nikpoor, Jörg Frohberg, Aaron Gokaslan, Peter Henderson, Rishi Bommasani, Margaret Mitchell

    Abstract: The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distrib… ▽ More

    Submitted 2 November, 2022; v1 submitted 3 May, 2022; originally announced June 2022.

    Comments: 32 pages: Full paper and Appendices; Association for Computing Machinery, New York, NY, USA, 2206-2222

    Journal ref: Proceedings of 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22)

  19. arXiv:2201.10066  [pdf, other

    cs.CL cs.DB

    Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

    Authors: Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien, Yacine Jernite

    Abstract: In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficie… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 8 pages plus appendix and references

  20. arXiv:2111.04158  [pdf, other

    cs.CL cs.AI

    A Word on Machine Ethics: A Response to Jiang et al. (2021)

    Authors: Zeerak Talat, Hagen Blix, Josef Valvoda, Maya Indira Ganesh, Ryan Cotterell, Adina Williams

    Abstract: Ethics is one of the longest standing intellectual endeavors of humanity. In recent years, the fields of AI and NLP have attempted to wrangle with how learning systems that interact with humans should be constrained to behave ethically. One proposal in this vein is the construction of morality models that can take in arbitrary text and output a moral judgment about the situation described. In this… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

    Comments: 11 pages, 2 figures, submitting soon to ACL Rolling Review