Skip to main content

Showing 1–37 of 37 results for author: Dupoux, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.05755  [pdf, other

    cs.CL cs.SD eess.AS

    SpiRit-LM: Interleaved Spoken and Written Language Model

    Authors: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux

    Abstract: We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  2. arXiv:2312.14069  [pdf, other

    cs.CL cs.SD eess.AS

    EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

    Authors: Maureen de Seyssel, Antony D'Avirro, Adina Williams, Emmanuel Dupoux

    Abstract: We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

  3. arXiv:2310.05235  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

    Authors: Robin Algayres, Pablo Diego-Simon, Benoit Sagot, Emmanuel Dupoux

    Abstract: Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Findings at EMNLP 2023

  4. arXiv:2309.17020  [pdf, other

    eess.AS cs.SD

    Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

    Authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

    Abstract: Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ASRU 2023 SPARKS Workshop

  5. arXiv:2308.05725  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

    Authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

    Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthes… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

  6. arXiv:2306.01506  [pdf, other

    cs.CL eess.AS stat.ML

    BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

    Authors: Marvin Lavechin, Yaya Sy, Hadrien Titeux, María Andrea Cruz Blandón, Okko Räsänen, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia

    Abstract: Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and b… ▽ More

    Submitted 8 June, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Proceedings of Interspeech 2023

  7. arXiv:2305.13009  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Textually Pretrained Speech Language Models

    Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

    Abstract: Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de… ▽ More

    Submitted 30 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  8. arXiv:2302.12057  [pdf, other

    cs.CL cs.SD eess.AS

    ProsAudit, a prosodic benchmark for self-supervised speech models

    Authors: Maureen de Seyssel, Marvin Lavechin, Hadrien Titeux, Arthur Thomas, Gwendal Virlet, Andrea Santos Revilla, Guillaume Wisniewski, Bogdan Ludusan, Emmanuel Dupoux

    Abstract: We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inser… ▽ More

    Submitted 1 June, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted at Interspeech 2023. 4 pages + references, 1 figure

  9. arXiv:2211.13152  [pdf, other

    cs.NE cs.LG cs.SD eess.AS

    Introducing topography in convolutional neural networks

    Authors: Maxime Poli, Emmanuel Dupoux, Rachid Riad

    Abstract: Parts of the brain that carry sensory tasks are organized topographically: nearby neurons are responsive to the same properties of input signals. Thus, in this work, inspired by the neuroscience literature, we proposed a new topographic inductive bias in Convolutional Neural Networks (CNNs). To achieve this, we introduced a new topographic loss and an efficient implementation to topographically or… ▽ More

    Submitted 28 October, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2210.15775  [pdf, other

    cs.CL cs.SD eess.AS

    Evaluating context-invariance in unsupervised speech representations

    Authors: Mark Hallap, Emmanuel Dupoux, Ewan Dunbar

    Abstract: Unsupervised speech representations have taken off, with benchmarks (SUPERB, ZeroSpeech) demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of ``discovering the phonemes'' of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is cont… ▽ More

    Submitted 30 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  11. arXiv:2210.15759  [pdf, other

    cs.CL cs.SD eess.AS

    Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge

    Authors: Ewan Dunbar, Nicolas Hamilakis, Emmanuel Dupoux

    Abstract: Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defi… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Journal ref: Journal: IEEE Journal of Selected Topics in Signal Processing Publication Date: OCTOBER 2022 Volume: 16, Issue: 6 On Page(s): 1211-1226 Print ISSN: 1932-4553 Online ISSN: 1941-0484 Digital Object Identifier: 10.1109/JSTSP.2022.3206084

  12. arXiv:2210.13248  [pdf, other

    eess.AS cs.SD

    Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

    Authors: Marvin Lavechin, Marianne Métais, Hadrien Titeux, Alodie Boissonnet, Jade Copet, Morgane Rivière, Elika Bergelson, Alejandrina Cristia, Emmanuel Dupoux, Hervé Bredin

    Abstract: Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in… ▽ More

    Submitted 25 May, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

  13. arXiv:2209.15483  [pdf, other

    cs.CL cs.LG eess.AS

    Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

    Authors: Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, Yossi Adi

    Abstract: Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensi… ▽ More

    Submitted 29 May, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

  14. arXiv:2207.10643  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    STOP: A dataset for Spoken Task Oriented Semantic Parsing

    Authors: Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

    Abstract: End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi… ▽ More

    Submitted 18 October, 2022; v1 submitted 28 June, 2022; originally announced July 2022.

  15. arXiv:2206.13415  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Is the Language Familiarity Effect gradual? A computational modelling approach

    Authors: Maureen de Seyssel, Guillaume Wisniewski, Emmanuel Dupoux

    Abstract: According to the Language Familiarity Effect (LFE), people are better at discriminating between speakers of their native language. Although this cognitive effect was largely studied in the literature, experiments have only been conducted on a limited number of language pairs and their results only show the presence of the effect without yielding a gradual measure that may vary across language pair… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: 8 pages, 2 figures, accepted at CogSci 2022

  16. arXiv:2203.16502  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Generative Spoken Dialogue Language Modeling

    Authors: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech,… ▽ More

    Submitted 22 November, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

  17. Probing phoneme, language and speaker information in unsupervised speech representations

    Authors: Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski

    Abstract: Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitativ… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 2 figures

  18. arXiv:2202.07359  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    textless-lib: a Library for Textless Spoken Language Processing

    Authors: Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three differ… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: The library is available here https://github.com/facebookresearch/textlesslib/

  19. arXiv:2112.05555  [pdf, other

    cs.CL cs.SD eess.AS

    Shennong: a Python toolbox for audio speech features extraction

    Authors: Mathieu Bernard, Maxime Poli, Julien Karadayi, Emmanuel Dupoux

    Abstract: We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shenn… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Journal ref: Behavior Research Methods, 2023

  20. arXiv:2111.07402  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Textless Speech Emotion Conversion using Discrete and Decomposed Representations

    Authors: Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, a… ▽ More

    Submitted 13 December, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: Paper was published at EMNLP 2022

  21. arXiv:2109.03264  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Text-Free Prosody-Aware Generative Spoken Language Modeling

    Authors: Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu

    Abstract: Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-lik… ▽ More

    Submitted 10 May, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

    Comments: ACL 2022

  22. arXiv:2107.06546  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition

    Authors: Afra Alishahi, Grzegorz Chrupała, Alejandrina Cristia, Emmanuel Dupoux, Bertrand Higy, Marvin Lavechin, Okko Räsänen, Chen Yu

    Abstract: We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.

    Submitted 14 July, 2021; originally announced July 2021.

  23. arXiv:2104.00355  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

    Authors: Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and she… ▽ More

    Submitted 27 July, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: In Proceedings of Interspeech 2021

  24. arXiv:2103.07125  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Learning spectro-temporal representations of complex sounds with parameterized neural networks

    Authors: Rachid Riad, Julien Karadayi, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

    Abstract: Deep Learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes on a variety of auditory tasks. Yet, these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

  25. arXiv:2101.00390  [pdf, other

    cs.CL eess.AS

    VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

    Authors: Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux

    Abstract: We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We pro… ▽ More

    Submitted 27 July, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: Accepted to ACL 2021 (long paper)

  26. arXiv:2011.11588  [pdf, other

    cs.CL cs.SD eess.AS

    The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

    Authors: Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, Emmanuel Dupoux

    Abstract: We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a com… ▽ More

    Submitted 1 December, 2020; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: 14 pages, including references and supplementary material

  27. arXiv:2010.16131  [pdf, other

    eess.AS cs.CL

    Comparison of Speaker Role Recognition and Speaker Enrollment Protocol for conversational Clinical Interviews

    Authors: Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Agnes Sliwinski, Jennifer Hamet Bagnou, Xuan Nga Cao, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux

    Abstract: Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed-up the clinicians' reports. Yet, it is not clear which speech processing pipeline is the most performing to detect and identify the speaker turns, especially for individuals wit… ▽ More

    Submitted 5 November, 2020; v1 submitted 30 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021,1 pages of supplementary material appear only in the arxiv version

  28. arXiv:2007.13542  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Evaluating the reliability of acoustic speech embeddings

    Authors: Robin Algayres, Mohamed Salah Zaiem, Benoit Sagot, Emmanuel Dupoux

    Abstract: Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two p… ▽ More

    Submitted 6 November, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Conference paper at Interspeech 2020

  29. arXiv:2007.00991  [pdf, other

    eess.AS cs.CL cs.SD

    Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

    Authors: Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

    Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is gene… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  30. arXiv:2006.05365  [pdf, other

    eess.AS cs.CL cs.SD

    Vocal markers from sustained phonation in Huntington's Disease

    Authors: Rachid Riad, Hadrien Titeux, Laurie Lemoine, Justine Montillot, Jennifer Hamet Bagnou, Xuan Nga Cao, Emmanuel Dupoux, Anne-Catherine Bachoud-Lévi

    Abstract: Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and M… ▽ More

    Submitted 31 July, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: To appear at INTERSPEECH 2020. 1 pages of supplementary material appear only in the arxiv version. Code to replicate https://github.com/bootphon/sustained-phonation-features

  31. arXiv:2005.12656  [pdf, other

    eess.AS

    An open-source voice type classifier for child-centered daylong recordings

    Authors: Marvin Lavechin, Ruben Bousbib, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia

    Abstract: Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the spe… ▽ More

    Submitted 22 January, 2021; v1 submitted 26 May, 2020; originally announced May 2020.

    Comments: accepted to Interspeech 2020

    ACM Class: I.2.7

  32. arXiv:2005.00069  [pdf, other

    cs.CV cs.LG eess.IV

    Occlusion resistant learning of intuitive physics from videos

    Authors: Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

    Abstract: To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to t… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

  33. arXiv:2002.02848  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Unsupervised pretraining transfers well across languages

    Authors: Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux

    Abstract: Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervis… ▽ More

    Submitted 7 February, 2020; originally announced February 2020.

    Comments: 6 pages. Accepted at ICASSP 2020. However the 2 pages of supplementary materials will appear only in the arxiv version

    Journal ref: ICASSP 2020

  34. Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More

    Submitted 17 December, 2019; originally announced December 2019.

  35. arXiv:1912.00938  [pdf

    eess.AS cs.SD

    Speaker detection in the wild: Lessons learned from JSALT 2019

    Authors: Paola Garcia, Jesus Villalba, Herve Bredin, Jun Du, Diego Castan, Alejandrina Cristia, Latane Bullock, Ling Guo, Koji Okabe, Phani Sankar Nidadavolu, Saurabh Kataria, Sizhu Chen, Leo Galmant, Marvin Lavechin, Lei Sun, Marie-Philippe Gill, Bar Ben-Yair, Sajjad Abdoli, Xin Wang, Wassim Bouaziz, Hadrien Titeux, Emmanuel Dupoux, Kong Aik Lee, Najim Dehak

    Abstract: This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker dete… ▽ More

    Submitted 2 December, 2019; originally announced December 2019.

    Comments: Submitted to ICASSP 2020

  36. arXiv:1904.11469  [pdf, other

    cs.CL cs.SD eess.AS

    The Zero Resource Speech Challenge 2019: TTS without T

    Authors: Ewan Dunbar, Robin Algayres, Julien Karadayi, Mathieu Bernard, Juan Benjumea, Xuan-Nga Cao, Lucie Miskic, Charlotte Dugrain, Lucas Ondel, Alan W. Black, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux

    Abstract: We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery datase… ▽ More

    Submitted 7 July, 2019; v1 submitted 25 April, 2019; originally announced April 2019.

    Comments: Interspeech 2019

  37. arXiv:1806.07098  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-End Speech Recognition From the Raw Waveform

    Authors: Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, Emmanuel Dupoux

    Abstract: State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone fi… ▽ More

    Submitted 21 June, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

    Comments: Accepted for presentation at Interspeech 2018