Skip to main content

Showing 1–6 of 6 results for author: Kuchnik, M

.
  1. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  2. arXiv:2403.19546  [pdf, other

    cs.LG cs.AI cs.DB cs.IR

    Croissant: A Metadata Format for ML-Ready Datasets

    Authors: Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, Carole-Jean Wu

    Abstract: Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for End-to-End Machine Learning (DEEM) Workshop https://dl.acm.org/doi/10.1145/3650203.3663326

  3. arXiv:2211.15458  [pdf, other

    cs.LG cs.CL

    Validating Large Language Models with ReLM

    Authors: Michael Kuchnik, Virginia Smith, George Amvrosiadis

    Abstract: Although large language models (LLMs) have been touted for their ability to generate natural-sounding text, there are growing concerns around possible negative effects of LLMs such as data memorization, bias, and inappropriate language. Unfortunately, the complexity and generation capacities of LLMs make validating (and correcting) such concerns difficult. In this work, we introduce ReLM, a system… ▽ More

    Submitted 8 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

  4. arXiv:2111.04131  [pdf, other

    cs.LG cs.PF

    Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

    Authors: Michael Kuchnik, Ana Klimovic, Jiri Simsa, Virginia Smith, George Amvrosiadis

    Abstract: Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over two million ML jobs in Google datacenters reveals that a significant fraction of… ▽ More

    Submitted 21 March, 2022; v1 submitted 7 November, 2021; originally announced November 2021.

  5. arXiv:1911.00472  [pdf, other

    cs.LG stat.ML

    Progressive Compressed Records: Taking a Byte out of Deep Learning Data

    Authors: Michael Kuchnik, George Amvrosiadis, Virginia Smith

    Abstract: Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effe… ▽ More

    Submitted 11 August, 2021; v1 submitted 1 November, 2019; originally announced November 2019.

  6. arXiv:1810.05222  [pdf, other

    cs.LG stat.ML

    Efficient Augmentation via Data Subsampling

    Authors: Michael Kuchnik, Virginia Smith

    Abstract: Data augmentation is commonly used to encode invariances in learning methods. However, this process is often performed in an inefficient manner, as artificial examples are created by applying a number of transformations to all points in the training set. The resulting explosion of the dataset size can be an issue in terms of storage and training costs, as well as in selecting and tuning the optima… ▽ More

    Submitted 1 March, 2019; v1 submitted 11 October, 2018; originally announced October 2018.