Skip to main content

Showing 1–50 of 70 results for author: Dupoux, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.19409  [pdf, other

    cs.CL

    Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

    Authors: Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, Olivier Pietquin

    Abstract: While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization:… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  2. arXiv:2403.11958  [pdf, other

    cs.CL cs.MA

    Language Evolution with Deep Learning

    Authors: Mathieu Rita, Paul Michel, Rahma Chaabouni, Olivier Pietquin, Emmanuel Dupoux, Florian Strub

    Abstract: Computational modeling plays an essential role in the study of language emergence. It aims to simulate the conditions and learning processes that could trigger the emergence of a structured language within a simulated controlled environment. Several methods have been used to investigate the origin of our language, including agent-based systems, Bayesian agents, genetic algorithms, and rule-based s… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: to appear in the Oxford Handbook of Approaches to Language Evolution

  3. arXiv:2402.05755  [pdf, other

    cs.CL cs.SD eess.AS

    SpiRit-LM: Interleaved Spoken and Written Language Model

    Authors: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux

    Abstract: We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  4. arXiv:2312.14069  [pdf, other

    cs.CL cs.SD eess.AS

    EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

    Authors: Maureen de Seyssel, Antony D'Avirro, Adina Williams, Emmanuel Dupoux

    Abstract: We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

  5. arXiv:2311.15930  [pdf, other

    cs.CL cs.AI

    WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

    Authors: Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent

    Abstract: We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  6. arXiv:2310.05235  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

    Authors: Robin Algayres, Pablo Diego-Simon, Benoit Sagot, Emmanuel Dupoux

    Abstract: Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Findings at EMNLP 2023

  7. arXiv:2310.05224  [pdf, other

    cs.CL cs.LG

    Generative Spoken Language Model based on continuous word-sized audio tokens

    Authors: Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoit Sagot, Emmanuel Dupoux

    Abstract: In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can g… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Conference paper at EMNLP 2023

  8. arXiv:2309.17020  [pdf, other

    eess.AS cs.SD

    Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

    Authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

    Abstract: Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ASRU 2023 SPARKS Workshop

  9. arXiv:2308.05725  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

    Authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

    Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthes… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

  10. arXiv:2306.01506  [pdf, other

    cs.CL eess.AS stat.ML

    BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

    Authors: Marvin Lavechin, Yaya Sy, Hadrien Titeux, María Andrea Cruz Blandón, Okko Räsänen, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia

    Abstract: Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and b… ▽ More

    Submitted 8 June, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Proceedings of Interspeech 2023

  11. arXiv:2305.13009  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Textually Pretrained Speech Language Models

    Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

    Abstract: Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de… ▽ More

    Submitted 30 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  12. arXiv:2302.12057  [pdf, other

    cs.CL cs.SD eess.AS

    ProsAudit, a prosodic benchmark for self-supervised speech models

    Authors: Maureen de Seyssel, Marvin Lavechin, Hadrien Titeux, Arthur Thomas, Gwendal Virlet, Andrea Santos Revilla, Guillaume Wisniewski, Bogdan Ludusan, Emmanuel Dupoux

    Abstract: We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inser… ▽ More

    Submitted 1 June, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted at Interspeech 2023. 4 pages + references, 1 figure

  13. arXiv:2211.13152  [pdf, other

    cs.NE cs.LG cs.SD eess.AS

    Introducing topography in convolutional neural networks

    Authors: Maxime Poli, Emmanuel Dupoux, Rachid Riad

    Abstract: Parts of the brain that carry sensory tasks are organized topographically: nearby neurons are responsive to the same properties of input signals. Thus, in this work, inspired by the neuroscience literature, we proposed a new topographic inductive bias in Convolutional Neural Networks (CNNs). To achieve this, we introduced a new topographic loss and an efficient implementation to topographically or… ▽ More

    Submitted 28 October, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  14. arXiv:2210.15775  [pdf, other

    cs.CL cs.SD eess.AS

    Evaluating context-invariance in unsupervised speech representations

    Authors: Mark Hallap, Emmanuel Dupoux, Ewan Dunbar

    Abstract: Unsupervised speech representations have taken off, with benchmarks (SUPERB, ZeroSpeech) demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of ``discovering the phonemes'' of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is cont… ▽ More

    Submitted 30 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  15. arXiv:2210.15759  [pdf, other

    cs.CL cs.SD eess.AS

    Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge

    Authors: Ewan Dunbar, Nicolas Hamilakis, Emmanuel Dupoux

    Abstract: Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defi… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Journal ref: Journal: IEEE Journal of Selected Topics in Signal Processing Publication Date: OCTOBER 2022 Volume: 16, Issue: 6 On Page(s): 1211-1226 Print ISSN: 1932-4553 Online ISSN: 1941-0484 Digital Object Identifier: 10.1109/JSTSP.2022.3206084

  16. arXiv:2210.13248  [pdf, other

    eess.AS cs.SD

    Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

    Authors: Marvin Lavechin, Marianne Métais, Hadrien Titeux, Alodie Boissonnet, Jade Copet, Morgane Rivière, Elika Bergelson, Alejandrina Cristia, Emmanuel Dupoux, Hervé Bredin

    Abstract: Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in… ▽ More

    Submitted 25 May, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

  17. arXiv:2210.02956  [pdf, other

    cs.CL

    Are word boundaries useful for unsupervised language learning?

    Authors: Tu Anh Nguyen, Maureen de Seyssel, Robin Algayres, Patricia Roze, Ewan Dunbar, Emmanuel Dupoux

    Abstract: Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: This is an archived version from September 2020

  18. arXiv:2209.15483  [pdf, other

    cs.CL cs.LG eess.AS

    Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

    Authors: Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, Yossi Adi

    Abstract: Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensi… ▽ More

    Submitted 29 May, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

  19. arXiv:2209.15342  [pdf, other

    cs.MA cs.CL cs.IT

    Emergent Communication: Generalization and Overfitting in Lewis Games

    Authors: Mathieu Rita, Corentin Tallec, Paul Michel, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux, Florian Strub

    Abstract: Lewis signaling games are a class of simple communication games for simulating the emergence of language. In these games, two agents must agree on a communication protocol in order to solve a cooperative task. Previous work has shown that agents trained to play this game with reinforcement learning tend to develop languages that display undesirable properties from a linguistic point of view (lack… ▽ More

    Submitted 15 October, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

  20. arXiv:2207.10643  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    STOP: A dataset for Spoken Task Oriented Semantic Parsing

    Authors: Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

    Abstract: End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi… ▽ More

    Submitted 18 October, 2022; v1 submitted 28 June, 2022; originally announced July 2022.

  21. arXiv:2206.13415  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Is the Language Familiarity Effect gradual? A computational modelling approach

    Authors: Maureen de Seyssel, Guillaume Wisniewski, Emmanuel Dupoux

    Abstract: According to the Language Familiarity Effect (LFE), people are better at discriminating between speakers of their native language. Although this cognitive effect was largely studied in the literature, experiments have only been conducted on a limited number of language pairs and their results only show the presence of the effect without yielding a gradual measure that may vary across language pair… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: 8 pages, 2 figures, accepted at CogSci 2022

  22. arXiv:2206.11332  [pdf, other

    cs.CL

    DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

    Authors: Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux

    Abstract: Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

  23. arXiv:2204.12982  [pdf, other

    cs.MA

    On the role of population heterogeneity in emergent communication

    Authors: Mathieu Rita, Florian Strub, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux

    Abstract: Populations have often been perceived as a structuring component for language to emerge and evolve: the larger the population, the more structured the language. While this observation is widespread in the sociolinguistic literature, it has not been consistently reproduced in computer simulations with neural agents. In this paper, we thus aim to clarify this apparent contradiction. We explore emerg… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: International Conference on Learning Representations (ICLR) 2022

  24. arXiv:2204.05148  [pdf, other

    cs.AI

    Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

    Authors: Robin Algayres, Adel Nabli, Benoit Sagot, Emmanuel Dupoux

    Abstract: We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of rando… ▽ More

    Submitted 21 October, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Interspeech 2022 New version on 10/21/23 with appendix data and gitlab link

  25. arXiv:2203.16502  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Generative Spoken Dialogue Language Modeling

    Authors: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech,… ▽ More

    Submitted 22 November, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

  26. Probing phoneme, language and speaker information in unsupervised speech representations

    Authors: Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski

    Abstract: Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitativ… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 2 figures

  27. Are discrete units necessary for Spoken Language Modeling?

    Authors: Tu Anh Nguyen, Benoit Sagot, Emmanuel Dupoux

    Abstract: Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the… ▽ More

    Submitted 22 August, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

  28. arXiv:2202.07359  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    textless-lib: a Library for Textless Spoken Language Processing

    Authors: Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three differ… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: The library is available here https://github.com/facebookresearch/textlesslib/

  29. arXiv:2112.11911  [pdf, other

    cs.CL cs.AI

    Towards Interactive Language Modeling

    Authors: Maartje ter Hoeve, Evgeny Kharitonov, Dieuwke Hupkes, Emmanuel Dupoux

    Abstract: Interaction between caregivers and children plays a critical role in human language acquisition and development. Given this observation, it is remarkable that explicit interaction plays little to no role in artificial language modeling -- which also targets the acquisition of human language, yet by artificial models. Moreover, an interactive approach to language modeling has the potential to make… ▽ More

    Submitted 28 September, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

  30. arXiv:2112.05555  [pdf, other

    cs.CL cs.SD eess.AS

    Shennong: a Python toolbox for audio speech features extraction

    Authors: Mathieu Bernard, Maxime Poli, Julien Karadayi, Emmanuel Dupoux

    Abstract: We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shenn… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Journal ref: Behavior Research Methods, 2023

  31. arXiv:2111.07402  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Textless Speech Emotion Conversion using Discrete and Decomposed Representations

    Authors: Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, a… ▽ More

    Submitted 13 December, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: Paper was published at EMNLP 2022

  32. arXiv:2109.03264  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Text-Free Prosody-Aware Generative Spoken Language Modeling

    Authors: Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu

    Abstract: Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-lik… ▽ More

    Submitted 10 May, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

    Comments: ACL 2022

  33. arXiv:2107.06546  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition

    Authors: Afra Alishahi, Grzegorz Chrupała, Alejandrina Cristia, Emmanuel Dupoux, Bertrand Higy, Marvin Lavechin, Okko Räsänen, Chen Yu

    Abstract: We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.

    Submitted 14 July, 2021; originally announced July 2021.

  34. arXiv:2104.14700  [pdf, ps, other

    cs.CL cs.AI

    The Zero Resource Speech Challenge 2021: Spoken language modelling

    Authors: Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux

    Abstract: We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (C… ▽ More

    Submitted 9 August, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2011.11588

  35. arXiv:2104.00355  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

    Authors: Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and she… ▽ More

    Submitted 27 July, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: In Proceedings of Interspeech 2021

  36. arXiv:2103.07125  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Learning spectro-temporal representations of complex sounds with parameterized neural networks

    Authors: Rachid Riad, Julien Karadayi, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

    Abstract: Deep Learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes on a variety of auditory tasks. Yet, these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

  37. arXiv:2102.01192  [pdf, other

    cs.CL

    Generative Spoken Language Modeling from Raw Audio

    Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text u… ▽ More

    Submitted 9 September, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

  38. arXiv:2101.00390  [pdf, other

    cs.CL eess.AS

    VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

    Authors: Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux

    Abstract: We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We pro… ▽ More

    Submitted 27 July, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: Accepted to ACL 2021 (long paper)

  39. arXiv:2011.11588  [pdf, other

    cs.CL cs.SD eess.AS

    The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

    Authors: Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, Emmanuel Dupoux

    Abstract: We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a com… ▽ More

    Submitted 1 December, 2020; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: 14 pages, including references and supplementary material

  40. arXiv:2010.16131  [pdf, other

    eess.AS cs.CL

    Comparison of Speaker Role Recognition and Speaker Enrollment Protocol for conversational Clinical Interviews

    Authors: Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Agnes Sliwinski, Jennifer Hamet Bagnou, Xuan Nga Cao, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

    Abstract: Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed-up the clinicians' reports. Yet, it is not clear which speech processing pipeline is the most performing to detect and identify the speaker turns, especially for individuals wit… ▽ More

    Submitted 5 November, 2020; v1 submitted 30 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021,1 pages of supplementary material appear only in the arxiv version

  41. arXiv:2010.05967  [pdf, other

    cs.CL cs.AI

    The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

    Authors: Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux

    Abstract: We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speec… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Journal ref: Proceedings of Interspeech 2020

  42. arXiv:2010.03446  [pdf, other

    cs.CL cs.AI

    Analogies minus analogy test: measuring regularities in word embeddings

    Authors: Louis Fournier, Emmanuel Dupoux, Ewan Dunbar

    Abstract: Vector space models of words have long been claimed to capture linguistic regularities as simple vector translations, but problems have been raised with this claim. We decompose and empirically analyze the classic arithmetic word analogy test, to motivate two new metrics that address the issues with the standard test, and which distinguish between class-wise offset concentration (similar direction… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Journal ref: Proceedings of CoNLL 2020

  43. arXiv:2010.01878  [pdf, other

    cs.CL cs.AI cs.MA

    "LazImpa": Lazy and Impatient neural agents learn to communicate efficiently

    Authors: Mathieu Rita, Rahma Chaabouni, Emmanuel Dupoux

    Abstract: Previous work has shown that artificial neural agents naturally develop surprisingly non-efficient codes. This is illustrated by the fact that in a referential game involving a speaker and a listener neural networks optimizing accurate transmission over a discrete channel, the emergent messages fail to achieve an optimal length. Furthermore, frequent messages tend to be longer than infrequent ones… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: Accepted to CoNLL 2020

    MSC Class: I.2 ACM Class: I.2

    Journal ref: Proceedings of CoNLL 2020

  44. arXiv:2007.13542  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Evaluating the reliability of acoustic speech embeddings

    Authors: Robin Algayres, Mohamed Salah Zaiem, Benoit Sagot, Emmanuel Dupoux

    Abstract: Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two p… ▽ More

    Submitted 6 November, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Conference paper at Interspeech 2020

  45. arXiv:2007.00991  [pdf, other

    eess.AS cs.CL cs.SD

    Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

    Authors: Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

    Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is gene… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  46. arXiv:2006.05365  [pdf, other

    eess.AS cs.CL cs.SD

    Vocal markers from sustained phonation in Huntington's Disease

    Authors: Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi

    Abstract: Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and M… ▽ More

    Submitted 31 July, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: To appear at INTERSPEECH 2020. 1 pages of supplementary material appear only in the arxiv version. Code to replicate https://github.com/bootphon/sustained-phonation-features

  47. arXiv:2005.00069  [pdf, other

    cs.CV cs.LG eess.IV

    Occlusion resistant learning of intuitive physics from videos

    Authors: Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

    Abstract: To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to t… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

  48. arXiv:2004.09124  [pdf, other

    cs.CL cs.AI cs.LG

    Compositionality and Generalization in Emergent Languages

    Authors: Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, Marco Baroni

    Abstract: Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as \emph{compositionality}. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin t… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

  49. arXiv:2003.01472  [pdf, other

    cs.CL

    Seshat: A tool for managing and verifying annotation campaigns of audio data

    Authors: Hadrien Titeux, Rachid Riad, Xuan-Nga Cao, Nicolas Hamilakis, Kris Madden, Alejandrina Cristia, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

    Abstract: We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following sp… ▽ More

    Submitted 17 February, 2021; v1 submitted 3 March, 2020; originally announced March 2020.

    Journal ref: LREC 2020 - 12th Language Resources and Evaluation Conference, May 2020, Marseille, France. pp.6976-6982

  50. arXiv:2003.01018  [pdf, other

    cs.CL

    Identification of primary and collateral tracks in stuttered speech

    Authors: Rachid Riad, Anne-Catherine Bachoud-Lévi, Frank Rudzicz, Emmanuel Dupoux

    Abstract: Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the… ▽ More

    Submitted 2 March, 2020; originally announced March 2020.

    Comments: To be published in LREC 2020