Skip to main content

Showing 1–5 of 5 results for author: Brannon, W

.
  1. arXiv:2404.12691  [pdf, other

    cs.AI cs.CY

    Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?

    Authors: Shayne Longpre, Robert Mahari, Naana Obeng-Marnu, William Brannon, Tobin South, Katy Gero, Sandy Pentland, Jad Kabbara

    Abstract: New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: 9 pages, 2 tables

  2. arXiv:2310.16787  [pdf, other

    cs.CL cs.AI cs.LG

    The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

    Authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

    Abstract: The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tool… ▽ More

    Submitted 4 November, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: 30 pages (18 main), 6 figures, 5 tables

  3. arXiv:2305.14321  [pdf, other

    cs.CL

    ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings

    Authors: William Brannon, Wonjune Kang, Suyash Fulay, Hang Jiang, Brandon Roy, Deb Roy, Jad Kabbara

    Abstract: Learning on text-attributed graphs (TAGs), in which nodes are associated with one or more texts, has been the subject of much recent work. However, most approaches tend to make strong assumptions about the downstream task of interest, are reliant on hand-labeled data, or fail to equally balance the importance of both text and graph representations. In this work, we propose Contrastive Graph-Text p… ▽ More

    Submitted 9 July, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: New visualizations, added references, and an application to community detection. To appear at the TextGraphs workshop @ ACL 2024. 21 pages, 5 figures, 13 tables

  4. Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

    Authors: William Brannon, Yogesh Virkar, Brian Thompson

    Abstract: We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are aware of. The results challenge a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automati… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

    Comments: Accepted at TACL. pre-MIT Press publication version

    Journal ref: Transactions of ACL, vol. 11, pp. 419-435 (2023)

  5. RadioTalk: a large-scale corpus of talk radio transcripts

    Authors: Doug Beeferman, William Brannon, Deb Roy

    Abstract: We introduce RadioTalk, a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech f… ▽ More

    Submitted 16 July, 2019; originally announced July 2019.

    Comments: 5 pages, 4 figures, accepted by Interspeech 2019

    Journal ref: Proc. Interspeech 2019, 564-568 (2019)