Skip to main content

Showing 1–19 of 19 results for author: Elsahar, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.17264  [pdf, other

    cs.SD cs.AI cs.CR

    Proactive Detection of Voice Cloning with Localized Watermarking

    Authors: Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar

    Abstract: In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized waterma… ▽ More

    Submitted 6 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Published at ICML 2024. Code at https://github.com/facebookresearch/audioseal - webpage at https://pierrefdz.github.io/publications/audioseal/

  2. arXiv:2312.05187  [pdf, other

    cs.CL cs.SD eess.AS

    Seamless: Multilingual Expressive and Streaming Speech Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

    Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  3. arXiv:2308.11596  [pdf, other

    cs.CL

    SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

    Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More

    Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    ACM Class: I.2.7

  4. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  5. arXiv:2210.15424  [pdf, other

    cs.CL cs.AI cs.LG

    What Language Model to Train if You Have One Million GPU Hours?

    Authors: Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, Iz Beltagy

    Abstract: The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notabl… ▽ More

    Submitted 7 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Findings of EMNLP 2022

  6. arXiv:2206.00761  [pdf, other

    cs.LG cs.CL stat.ML

    On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

    Authors: Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

    Abstract: The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged t… ▽ More

    Submitted 14 November, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

  7. arXiv:2201.10066  [pdf, other

    cs.CL cs.DB

    Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

    Authors: Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien, Yacine Jernite

    Abstract: In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficie… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 8 pages plus appendix and references

  8. arXiv:2112.05702  [pdf, other

    cs.LG cs.CL cs.NE

    Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs

    Authors: Bryan Eikema, Germán Kruszewski, Hady Elsahar, Marc Dymetman

    Abstract: Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often di… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

  9. arXiv:2112.00791  [pdf, other

    cs.LG cs.CL

    Controlling Conditional Language Models without Catastrophic Forgetting

    Authors: Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman

    Abstract: Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g., hallucinations in abstractive summarization or style violations in c… ▽ More

    Submitted 20 June, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: ICML 2022

  10. arXiv:2111.02878  [pdf, other

    cs.CL cs.IR

    Unsupervised and Distributional Detection of Machine-Generated Text

    Authors: Matthias Gallé, Jos Rozen, Germán Kruszewski, Hady Elsahar

    Abstract: The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists in training a classifier on annotated data to predict the origin of one given new document. In this paper, we frame the problem in an unsupervised and distributi… ▽ More

    Submitted 4 November, 2021; originally announced November 2021.

    Comments: 10 pages

  11. arXiv:2106.04985  [pdf, other

    cs.LG cs.CL cs.NE cs.SE

    Energy-Based Models for Code Generation under Compilability Constraints

    Authors: Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski

    Abstract: Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint s… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: Accepted for the First Workshop on Natural Language Processing for Programming, ACL 2021

    ACM Class: I.2.2; I.2.7; I.2.6; I.5.1

  12. arXiv:2102.12511  [pdf, other

    cs.CL cs.CY cs.DL

    References in Wikipedia: The Editors' Perspective

    Authors: Lucie-Aimée Kaffee, Hady Elsahar

    Abstract: References are an essential part of Wikipedia. Each statement in Wikipedia should be referenced. In this paper, we explore the creation and collection of references for new Wikipedia articles from an editors' perspective. We map out the workflow of editors when creating a new article, emphasising how they select references.

    Submitted 4 March, 2021; v1 submitted 24 February, 2021; originally announced February 2021.

  13. arXiv:2012.11635  [pdf, other

    cs.CL cs.AI cs.LG

    A Distributional Approach to Controlled Text Generation

    Authors: Muhammad Khalifa, Hady Elsahar, Marc Dymetman

    Abstract: We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both "pointwise" and "distributional" constraints over the target LM -- to our knowledge, the first model with such generality -- while minimizing KL divergence from the initial LM distribution. The optimal target dis… ▽ More

    Submitted 6 May, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

    Comments: ICLR 2021 camera-ready version

  14. arXiv:2010.02353  [pdf, other

    cs.CL cs.AI cs.LG

    Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

    Authors: Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer , et al. (23 additional authors not shown)

    Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communicat… ▽ More

    Submitted 6 November, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020; updated benchmarks

  15. arXiv:2004.14754  [pdf, other

    cs.CL cs.LG

    Self-Supervised and Controlled Multi-Document Opinion Summarization

    Authors: Hady Elsahar, Maximin Coavoux, Matthias Gallé, Jos Rozen

    Abstract: We address the problem of unsupervised abstractive summarization of collections of user generated reviews with self-supervision and control. We propose a self-supervised setup that considers an individual document as a target summary for a set of similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss. We address the problem o… ▽ More

    Submitted 30 April, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: 18 pages including 5 pages appendix

  16. arXiv:1803.07116  [pdf, other

    cs.CL

    Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata

    Authors: Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl

    Abstract: While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidat… ▽ More

    Submitted 29 April, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

    Comments: NAACL HTL 2018

  17. arXiv:1802.06842  [pdf, other

    cs.CL

    Zero-Shot Question Generation from Knowledge Graphs for Unseen Predicates and Entity Types

    Authors: Hady Elsahar, Christophe Gravier, Frederique Laforest

    Abstract: We present a neural model for question generation from knowledge base triples in a "Zero-Shot" setup, that is generating questions for triples containing predicates, subject types or object types that were not seen at training time. Our model leverages triples occurrences in the natural language corpus in an encoder-decoder architecture, paired with an original part-of-speech copy action mechanism… ▽ More

    Submitted 19 February, 2018; originally announced February 2018.

  18. Unsupervised Open Relation Extraction

    Authors: Hady Elsahar, Elena Demidova, Simon Gottschalk, Christophe Gravier, Frederique Laforest

    Abstract: We explore methods to extract relations between named entities from free text in an unsupervised setting. In addition to standard feature extraction, we develop a novel method to re-weight word embeddings. We alleviate the problem of features sparsity using an individual feature reduction. Our approach exhibits a significant improvement by 5.8% over the state-of-the-art relation clustering scoring… ▽ More

    Submitted 22 January, 2018; originally announced January 2018.

    Comments: 4 pages, published in ESWC 2017

  19. arXiv:1711.00155  [pdf, other

    cs.CL

    Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

    Authors: Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christoph Gravier, Frederique Laforest, Jonathon Hare, Elena Simperl

    Abstract: Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand the underlying technology, they need textual or visual interfaces to help them make sense of it. We explore the problem of generating natural language summaries for Semantic Web data. This is non-trivial, especially in an open-domain context. To address this problem, we explore the use of neur… ▽ More

    Submitted 31 October, 2017; originally announced November 2017.