Skip to main content

Showing 1–15 of 15 results for author: Dale, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.16247  [pdf, other

    cs.CL cs.CY

    Towards Red Teaming in Multimodal and Multilingual Translation

    Authors: Christophe Ropers, David Dale, Prangthip Hansanti, Gabriel Mejia Gonzalez, Ivan Evtimov, Corinne Wong, Christophe Touret, Kristina Pereyra, Seohyun Sonia Kim, Cristian Canton Ferrer, Pierre Andrews, Marta R. Costa-jussà

    Abstract: Assessing performance in Natural Language Processing is becoming increasingly complex. One particular challenge is the potential for evaluation datasets to overlap with training data, either directly or indirectly, which can lead to skewed results and overestimation of model performance. As a consequence, human evaluation is gaining increasing interest as a means to assess the performance and reli… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2312.05187

    ACM Class: I.2.7

  2. arXiv:2401.05060  [pdf, other

    cs.SD cs.CL eess.AS

    MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector

    Authors: Marta R. Costa-jussà, Mariano Coria Meglioli, Pierre Andrews, David Dale, Prangthip Hansanti, Elahe Kalbassi, Alex Mourachko, Christophe Ropers, Carleigh Wood

    Abstract: Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio u… ▽ More

    Submitted 27 June, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    ACM Class: I.2.7

  3. arXiv:2312.05187  [pdf, other

    cs.CL cs.SD eess.AS

    Seamless: Multilingual Expressive and Streaming Speech Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

    Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  4. arXiv:2311.13937  [pdf, other

    cs.CL

    Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

    Authors: Daryna Dementieva, Daniil Moskovskiy, David Dale, Alexander Panchenko

    Abstract: Text detoxification is the task of transferring the style of text from toxic to neutral. While here are approaches yielding promising results in monolingual setup, e.g., (Dale et al., 2021; Hallinan et al., 2022), cross-lingual transfer for this task remains a challenging open problem (Moskovskiy et al., 2022). In this work, we present a large-scale study of strategies for cross-lingual text detox… ▽ More

    Submitted 23 November, 2023; originally announced November 2023.

    Comments: AACL 2023, main conference, long paper

  5. arXiv:2311.06532  [pdf, other

    cs.CL

    Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation

    Authors: Marta R. Costa-jussà, David Dale, Maha Elbayad, Bokai Yu

    Abstract: Added toxicity in the context of translation refers to the fact of producing a translation output with more toxicity than there exists in the input. In this paper, we present MinTox which is a novel pipeline to identify added toxicity and mitigate this issue which works at inference time. MinTox uses a toxicity detection classifier which is multimodal (speech and text) and works in languages at sc… ▽ More

    Submitted 11 November, 2023; originally announced November 2023.

    ACM Class: I.2.7

  6. arXiv:2309.11585  [pdf, other

    cs.CL

    SpeechAlign: a Framework for Speech Translation Alignment Evaluation

    Authors: Belen Alastruey, Aleix Sant, Gerard I. Gállego, David Dale, Marta R. Costa-jussà

    Abstract: Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. In our commitment to advance these fields, we present SpeechAlign, a framework designed to evaluate the underexplored field of source-target alignment in speech models. The SpeechAlign framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold… ▽ More

    Submitted 25 April, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: LREC-COLING 2024

  7. arXiv:2308.11596  [pdf, other

    cs.CL

    SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

    Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More

    Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    ACM Class: I.2.7

  8. arXiv:2308.09055  [pdf, other

    cs.CL

    Don't lose the message while paraphrasing: A study on content preserving style transfer

    Authors: Nikolay Babakov, David Dale, Ilya Gusev, Irina Krotova, Alexander Panchenko

    Abstract: Text style transfer techniques are gaining popularity in natural language processing allowing paraphrasing text in the required form: from toxic to neural, from formal to informal, from old to the modern English language, etc. Solving the task is not sufficient to generate some neural/informal/modern text, but it is important to preserve the original content unchanged. This requirement becomes eve… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Published at the NLDB 2023 conference

  9. arXiv:2305.11746  [pdf, other

    cs.CL

    HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

    Authors: David Dale, Elena Voita, Janice Lam, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Loïc Barrault, Marta R. Costa-jussà

    Abstract: Hallucinations in machine translation are translations that contain information completely unrelated to the input. Omissions are translations that do not include some of the input information. While both cases tend to be catastrophic errors undermining user trust, annotated data with these types of pathologies is extremely scarce and is limited to a few high-resource languages. In this work, we re… ▽ More

    Submitted 5 December, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    ACM Class: I.2.7

    Journal ref: EMNLP 2023

  10. arXiv:2212.08597  [pdf, other

    cs.CL

    Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better

    Authors: David Dale, Elena Voita, Loïc Barrault, Marta R. Costa-jussà

    Abstract: While the problem of hallucinations in neural machine translation has long been recognized, so far the progress on its alleviation is very little. Indeed, recently it turned out that without artificially encouraging models to hallucinate, previously existing methods fall short and even the standard sequence log-probability is more informative. It means that characteristics internal to the model ca… ▽ More

    Submitted 20 December, 2022; v1 submitted 16 December, 2022; originally announced December 2022.

    ACM Class: I.2.7

  11. arXiv:2209.09368  [pdf, other

    cs.CL

    The first neural machine translation system for the Erzya language

    Authors: David Dale

    Abstract: We present the first neural machine translation system for translation between the endangered Erzya language and Russian and the dataset collected by us to train and evaluate it. The BLEU scores are 17 and 19 for translation to Erzya and Russian respectively, and more than half of the translations are rated as acceptable by native speakers. We also adapt our model to translate between Erzya and 10… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

    Comments: Accepted to the Field Matters workshop at the COLING 2022 conference

  12. Studying the role of named entities for content preservation in text style transfer

    Authors: Nikolay Babakov, David Dale, Varvara Logacheva, Irina Krotova, Alexander Panchenko

    Abstract: Text style transfer techniques are gaining popularity in Natural Language Processing, finding various applications such as text detoxification, sentiment, or formality transfer. However, the majority of the existing approaches were tested on such domains as online communications on public platforms, music, or entertainment yet none of them were applied to the domains which are typical for task-ori… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Journal ref: Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham, p.437--448

  13. arXiv:2109.08914  [pdf, other

    cs.CL cs.LG

    Text Detoxification using Large Pre-trained Neural Models

    Authors: David Dale, Anton Voronov, Daryna Dementieva, Varvara Logacheva, Olga Kozlova, Nikita Semenov, Alexander Panchenko

    Abstract: We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second… ▽ More

    Submitted 3 November, 2021; v1 submitted 18 September, 2021; originally announced September 2021.

    Comments: Accepted to the EMNLP 2021 conference

  14. arXiv:2105.09052  [pdf, other

    cs.CL cs.LG

    Methods for Detoxification of Texts for the Russian Language

    Authors: Daryna Dementieva, Daniil Moskovskiy, Varvara Logacheva, David Dale, Olga Kozlova, Nikita Semenov, Alexander Panchenko

    Abstract: We introduce the first study of automatic detoxification of Russian texts to combat offensive language. Such a kind of textual style transfer can be used, for instance, for processing toxic content in social media. While much work has been done for the English language in this field, it has never been solved for the Russian language yet. We test two types of models - unsupervised approach based on… ▽ More

    Submitted 19 May, 2021; originally announced May 2021.

  15. arXiv:1909.08552  [pdf, other

    cs.AI

    An Automated Engineering Assistant: Learning Parsers for Technical Drawings

    Authors: Dries Van Daele, Nicholas Decleyre, Herman Dubois, Wannes Meert

    Abstract: From a set of technical drawings and expert knowledge, we automatically learn a parser to interpret such a drawing. This enables automatic reasoning and learning on top of a large database of technical drawings. In this work, we develop a similarity based search algorithm to help engineers and designers find or complete designs more easily and flexibly. This is part of an ongoing effort to build a… ▽ More

    Submitted 18 September, 2019; originally announced September 2019.