Skip to main content

Showing 1–15 of 15 results for author: Andrews, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.16247  [pdf, other

    cs.CL cs.CY

    Towards Red Teaming in Multimodal and Multilingual Translation

    Authors: Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, Pierre Andrews, Marta R. Costa-jussà

    Abstract: Assessing performance in Natural Language Processing is becoming increasingly complex. One particular challenge is the potential for evaluation datasets to overlap with training data, either directly or indirectly, which can lead to skewed results and overestimation of model performance. As a consequence, human evaluation is gaining increasing interest as a means to assess the performance and reli… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2312.05187

    ACM Class: I.2.7

  2. arXiv:2401.05060  [pdf, other

    cs.SD cs.CL eess.AS

    MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector

    Authors: Marta R. Costa-jussà, Mariano Coria Meglioli, Pierre Andrews, David Dale, Prangthip Hansanti, Elahe Kalbassi, Alex Mourachko, Christophe Ropers, Carleigh Wood

    Abstract: Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio u… ▽ More

    Submitted 27 June, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    ACM Class: I.2.7

  3. arXiv:2312.05187  [pdf, other

    cs.CL cs.SD eess.AS

    Seamless: Multilingual Expressive and Streaming Speech Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

    Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  4. arXiv:2309.03175  [pdf, other

    cs.CL

    Gender-specific Machine Translation with Large Language Models

    Authors: Eduardo Sánchez, Pierre Andrews, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà

    Abstract: While machine translation (MT) systems have seen significant improvements, it is still common for translations to reflect societal biases, such as gender bias. Decoder-only Large Language Models (LLMs) have demonstrated potential in MT, albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the a… ▽ More

    Submitted 16 April, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

  5. arXiv:2308.16871  [pdf, other

    cs.CL cs.AI

    The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages

    Authors: Benjamin Muller, Belen Alastruey, Prangthip Hansanti, Elahe Kalbassi, Christophe Ropers, Eric Michael Smith, Adina Williams, Luke Zettlemoyer, Pierre Andrews, Marta R. Costa-jussà

    Abstract: Gender biases in language generation systems are challenging to mitigate. One possible source for these biases is gender representation disparities in the training and evaluation data. Despite recent progress in documenting this problem and many attempts at mitigating it, we still lack shared methodology and tooling to report gender representation in large datasets. Such quantitative reporting wil… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: 15 pages

  6. arXiv:2308.11596  [pdf, other

    cs.CL

    SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

    Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More

    Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    ACM Class: I.2.7

  7. arXiv:2305.13198  [pdf, other

    cs.CL

    Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale

    Authors: Marta R. Costa-jussà, Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, Carleigh Wood

    Abstract: We introduce a multilingual extension of the HOLISTICBIAS dataset, the largest English template-based taxonomy of textual people references: MULTILINGUALHOLISTICBIAS. This extension consists of 20,459 sentences in 50 languages distributed across all 13 demographic axes. Source sentences are built from combinations of 118 demographic descriptors and three patterns, excluding nonsensical combination… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    ACM Class: I.2.7

  8. arXiv:2212.08486  [pdf, other

    cs.CL

    BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

    Authors: Mingda Chen, Paul-Ambroise Duquenne, Pierre Andrews, Justine Kao, Alexandre Mourachko, Holger Schwenk, Marta R. Costa-jussà

    Abstract: End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR sy… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    ACM Class: I.2.7

  9. arXiv:2208.10087  [pdf

    cs.CY

    A Trust Framework for Government Use of Artificial Intelligence and Automated Decision Making

    Authors: Pia Andrews, Tim de Sousa, Bruce Haefele, Matt Beard, Marcus Wigan, Abhinav Palia, Kathy Reid, Saket Narayan, Morgan Dumitru, Alex Morrison, Geoff Mason, Aurelie Jacquet

    Abstract: This paper identifies the current challenges of the mechanisation, digitisation and automation of public sector systems and processes, and proposes a modern and practical framework to ensure and assure ethical and high veracity Artificial Intelligence (AI) or Automated Decision Making (ADM) systems in public institutions. This framework is designed for the specific context of the public sector, in… ▽ More

    Submitted 22 August, 2022; originally announced August 2022.

    Comments: Comments were integrated into the paper from all peer reviewers. Am happy to provide a copied history of comments if useful

  10. arXiv:2207.04672  [pdf

    cs.CL cs.AI

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Authors: NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran , et al. (14 additional authors not shown)

    Abstract: Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality res… ▽ More

    Submitted 25 August, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: 190 pages

    MSC Class: 68T50 ACM Class: I.2.7

  11. arXiv:1911.07201  [pdf

    cs.CV eess.IV

    Countering Inconsistent Labelling by Google's Vision API for Rotated Images

    Authors: Aman Apte, Aritra Bandyopadhyay, K Akhilesh Shenoy, Jason Peter Andrews, Aditya Rathod, Manish Agnihotri, Aditya Jajodia

    Abstract: Google's Vision API analyses images and provides a variety of output predictions, one such type is context-based labelling. In this paper, it is shown that adversarial examples that cause incorrect label prediction and spoofing can be generated by rotating the images. Due to the black-boxed nature of the API, a modular context-based pre-processing pipeline is proposed consisting of a Res-Net50 mod… ▽ More

    Submitted 17 November, 2019; originally announced November 2019.

    Comments: 11 pages, 9 figures, Accepted at ICICV 2020 Jaipur India

  12. arXiv:1704.01942  [pdf, other

    cs.HC stat.ML

    ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models

    Authors: Minsuk Kahng, Pierre Y. Andrews, Aditya Kalro, Duen Horng Chau

    Abstract: While deep learning models have achieved state-of-the-art accuracies for many prediction tasks, understanding these models remains a challenge. Despite the recent interest in develo** visual tools to help users interpret deep learning models, the complexity and wide variety of models deployed in industry, and the large-scale datasets that they used, pose unique design challenges that are inadequ… ▽ More

    Submitted 8 August, 2017; v1 submitted 6 April, 2017; originally announced April 2017.

    Comments: Will be presented at IEEE VAST 2017 and published in IEEE Transactions on Visualization and Computer Graphics, 24(1)

  13. arXiv:1601.07925  [pdf, other

    cs.LG cs.NE

    Automating biomedical data science through tree-based pipeline optimization

    Authors: Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, Jason H. Moore

    Abstract: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and… ▽ More

    Submitted 28 January, 2016; originally announced January 2016.

    Comments: 16 pages, 5 figures, to appear in EvoBIO 2016 proceedings

  14. arXiv:cs/0412117  [pdf, ps, other

    cs.CL

    Thematic Annotation: extracting concepts out of documents

    Authors: Pierre Andrews, Martin Rajman

    Abstract: Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale semantic database -- the EDR Electronic Dictionary -- that provides a concept hierarchy based on hyponym and hypernym relations. This concept hierarchy is used to… ▽ More

    Submitted 29 December, 2004; originally announced December 2004.

    Comments: Technical report EPFL/LIA. 81 pages, 16 figures

    Report number: IC/2004/68 ACM Class: I.2.7; I.7

  15. arXiv:cs/0412114  [pdf

    cs.CL

    State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"

    Authors: Martin Rajman, Martin Vesely, Pierre Andrews

    Abstract: Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining. Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques… ▽ More

    Submitted 29 December, 2004; originally announced December 2004.

    Comments: 54 pages, Report of Working Group 1 for the European Network of Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS)

    ACM Class: I.2.7; I.7