Skip to main content

Showing 1–4 of 4 results for author: Bollacker, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  2. arXiv:2311.13028  [pdf, other

    cs.LG cs.AI cs.DC eess.SP

    DMLR: Data-centric Machine Learning Research -- Past, Present and Future

    Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš , et al. (13 additional authors not shown)

    Abstract: Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow… ▽ More

    Submitted 1 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Published in the Journal of Data-centric Machine Learning Research (DMLR) at https://data.mlr.press/assets/pdf/v01-5.pdf

  3. arXiv:2310.16787  [pdf, other

    cs.CL cs.AI cs.LG

    The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

    Authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

    Abstract: The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tool… ▽ More

    Submitted 4 November, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: 30 pages (18 main), 6 figures, 5 tables

  4. arXiv:2308.15710  [pdf, ps, other

    cs.AI cs.LG

    Speech Wikimedia: A 77 Language Multilingual Speech Dataset

    Authors: Rafael Mosquera Gómez, Julián Eusse, Juan Ciro, Daniel Galvez, Ryan Hileman, Kurt Bollacker, David Kanter

    Abstract: The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recogni… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: Data-Centric Machine Learning Workshop at the International Machine Learning Conference 2023 (ICML)