Skip to main content

Showing 1–34 of 34 results for author: Zeghidour, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.10419  [pdf, other

    eess.AS cs.CL

    MAD Speech: Measures of Acoustic Diversity of Speech

    Authors: Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, Eugene Kharitonov

    Abstract: Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by develo** lightweight metrics of acoustic diversity, which we collectively refer to as… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  2. arXiv:2402.04229  [pdf, other

    cs.LG cs.SD eess.AS

    MusicRL: Aligning Music Generation to Human Preferences

    Authors: Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Hussenot, Neil Zeghidour, Andrea Agostinelli

    Abstract: We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  3. arXiv:2308.10415  [pdf, other

    cs.SD cs.LG eess.AS

    TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

    Authors: Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey

    Abstract: We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2023, project webpage with audio demos at https://google-research.github.io/sound-separation/papers/tokensplit

  4. arXiv:2306.12925  [pdf, other

    cs.CL cs.AI cs.SD eess.AS stat.ML

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

    Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: Technical report

  5. arXiv:2305.09636  [pdf, other

    cs.SD cs.LG eess.AS

    SoundStorm: Efficient Parallel Audio Generation

    Authors: Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

    Abstract: We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consist… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  6. arXiv:2303.12984  [pdf, other

    cs.SD eess.AS

    LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

    Authors: Teerapat Jenrungrot, Michael Chinen, W. Bastiaan Kleijn, Jan Skoglund, Zalán Borsos, Neil Zeghidour, Marco Tagliasacchi

    Abstract: We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the tran… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

    Comments: 5 pages, accepted to ICASSP 2023, project page: https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec

  7. arXiv:2303.07533  [pdf, other

    eess.AS cs.SD

    Speech Intelligibility Classifiers from 550k Disordered Speech Samples

    Authors: Subhashini Venugopalan, Jimmy Tobin, Samuel J. Yang, Katie Seaver, Richard J. N. Cave, Pan-Pan Jiang, Neil Zeghidour, Rus Heywood, Jordan Green, Michael P. Brenner

    Abstract: We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a five-point scale. We trained three models following different deep learning approaches and evaluated them on ~94K utterances from 100 speakers. We further found… ▽ More

    Submitted 15 March, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023 camera-ready

  8. arXiv:2302.05400  [pdf, other

    cs.LG stat.ML

    DNArch: Learning Convolutional Neural Architectures by Backpropagation

    Authors: David W. Romero, Neil Zeghidour

    Abstract: We present Differentiable Neural Architectures (DNArch), a method that jointly learns the weights and the architecture of Convolutional Neural Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i) the size of convolutional kernels at each layer, (ii) the number of channels at each layer, (iii) the position and values of downsampling layers, and (iv) the depth of the network… ▽ More

    Submitted 22 July, 2023; v1 submitted 10 February, 2023; originally announced February 2023.

  9. arXiv:2302.03540  [pdf, other

    cs.SD eess.AS

    Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

    Authors: Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  10. arXiv:2301.12662  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    SingSong: Generating musical accompaniments from singing

    Authors: Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel

    Abstract: We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

  11. arXiv:2301.11325  [pdf, other

    cs.SD cs.LG eess.AS

    MusicLM: Generating Music From Text

    Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

    Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

  12. arXiv:2209.03143  [pdf, other

    cs.SD cs.LG eess.AS

    AudioLM: a Language Modeling Approach to Audio Generation

    Authors: Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenizati… ▽ More

    Submitted 25 July, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

  13. arXiv:2206.05408  [pdf, other

    cs.SD cs.LG eess.AS

    Multi-instrument Music Synthesis with Spectrogram Diffusion

    Authors: Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel

    Abstract: An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generat… ▽ More

    Submitted 12 December, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

  14. arXiv:2203.15578  [pdf, other

    cs.SD cs.LG eess.AS

    Disentangling speech from surroundings with neural embeddings

    Authors: Ahmed Omran, Neil Zeghidour, Zalán Borsos, Félix de Chaumont Quitry, Malcolm Slaney, Marco Tagliasacchi

    Abstract: We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning t… ▽ More

    Submitted 4 June, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at ICASSP 2023

  15. arXiv:2203.15519  [pdf, other

    cs.SD cs.LG eess.AS

    Learning neural audio features without supervision

    Authors: Sarthak Yadav, Neil Zeghidour

    Abstract: Deep audio classification, traditionally cast as training a deep neural network on top of mel-filterbanks in a supervised fashion, has recently benefited from two independent lines of work. The first one explores "learnable frontends", i.e., neural modules that produce a learnable time-frequency representation, to overcome limitations of fixed features. The second one uses self-supervised learning… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  16. arXiv:2202.07765  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    General-purpose, long-context autoregressive modeling with Perceiver AR

    Authors: Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, João Carreira, Jesse Engel

    Abstract: Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic… ▽ More

    Submitted 14 June, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: ICML 2022

  17. arXiv:2202.01653  [pdf, other

    cs.LG

    Learning strides in convolutional neural networks

    Authors: Rachid Riad, Olivier Teboul, David Grangier, Neil Zeghidour

    Abstract: Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsamplin… ▽ More

    Submitted 3 February, 2022; originally announced February 2022.

    Comments: Spotlight at ICLR2022, open-source code available at https://github.com/google-research/diffstride

  18. arXiv:2107.03312  [pdf, other

    cs.SD cs.LG eess.AS

    SoundStream: An End-to-End Neural Audio Codec

    Authors: Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi

    Abstract: We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and s… ▽ More

    Submitted 7 July, 2021; originally announced July 2021.

  19. arXiv:2105.13802  [pdf, other

    cs.SD cs.LG eess.AS

    DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

    Authors: Neil Zeghidour, Olivier Teboul, David Grangier

    Abstract: We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice activity of each speaker conditioned on the extracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classical permut… ▽ More

    Submitted 28 May, 2021; originally announced May 2021.

  20. arXiv:2103.09879  [pdf, other

    cs.SD cs.AI eess.AS

    Self-Supervised Learning of Audio Representations from Permutations with Differentiable Ranking

    Authors: Andrew N Carr, Quentin Berthet, Mathieu Blondel, Olivier Teboul, Neil Zeghidour

    Abstract: Self-supervised pre-training using so-called "pretext" tasks has recently shown impressive performance across a wide range of modalities. In this work, we advance self-supervised learning from permutations, by pre-training a model to reorder shuffled parts of the spectrogram of an audio signal, to improve downstream classification performance. We make two main contributions. First, we overcome the… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

  21. arXiv:2101.08596  [pdf, other

    cs.SD cs.LG eess.AS

    LEAF: A Learnable Frontend for Audio Classification

    Authors: Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi

    Abstract: Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: Accepted at ICLR 2021

  22. arXiv:2010.13694  [pdf, other

    eess.SP cs.LG

    Learning from Heterogeneous EEG Signals with Differentiable Channel Reordering

    Authors: Aaqib Saeed, David Grangier, Olivier Pietquin, Neil Zeghidour

    Abstract: We propose CHARM, a method for training a single neural network across inconsistent input channels. Our work is motivated by Electroencephalography (EEG), where data collection protocols from different headsets result in varying channel ordering and number, which limits the feasibility of transferring trained systems across datasets. Our approach builds upon attention mechanisms to estimate a late… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  23. arXiv:2010.10915  [pdf, other

    cs.SD cs.LG eess.AS

    Contrastive Learning of General-Purpose Audio Representations

    Authors: Aaqib Saeed, David Grangier, Neil Zeghidour

    Abstract: We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio. Our approach is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings. We build on top of recent advances in contrastive learnin… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  24. arXiv:2002.08933  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Wavesplit: End-to-End Speech Separation by Speaker Clustering

    Authors: Neil Zeghidour, David Grangier

    Abstract: We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a representation for each source and then estimates each source signal given the inferred representations. The model is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations via clustering, which addresses the fundamental permutation proble… ▽ More

    Submitted 2 July, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

  25. arXiv:1905.12909  [pdf, other

    cs.LG stat.ML

    Deep multi-class learning from label proportions

    Authors: Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, Jean-Philippe Vert

    Abstract: We propose a learning algorithm capable of learning from label proportions instead of direct data labels. In this scenario, our data are arranged into various bags of a certain size, and only the proportions of each label within a given bag are known. This is a common situation in cases where per-data labeling is lengthy, but a more general label is easily accessible. Several approaches have been… ▽ More

    Submitted 26 June, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

  26. arXiv:1812.06864  [pdf, other

    cs.CL

    Fully Convolutional Speech Recognition

    Authors: Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert

    Abstract: Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language mod… ▽ More

    Submitted 9 April, 2019; v1 submitted 17 December, 2018; originally announced December 2018.

  27. arXiv:1812.03483  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition

    Authors: Yossi Adi, Neil Zeghidour, Ronan Collobert, Nicolas Usunier, Vitaliy Liptchinsky, Gabriel Synnaeve

    Abstract: Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; we expect a performance improvement with this joint training if the two tasks of speech recognition and speaker recognition share a common… ▽ More

    Submitted 14 February, 2019; v1 submitted 9 December, 2018; originally announced December 2018.

  28. arXiv:1811.11101  [pdf, other

    cs.CL

    Learning to detect dysarthria from raw speech

    Authors: Juliette Millet, Neil Zeghidour

    Abstract: Speech classifiers of paralinguistic traits traditionally learn from diverse hand-crafted low-level features, by selecting the relevant information for the task at hand. We explore an alternative to this selection, by learning jointly the classifier, and the feature extraction. Recent work on speech recognition has shown improved performance over speech features by learning from the waveform. We e… ▽ More

    Submitted 8 January, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

    Comments: 5 pages, 3 figures, submitted to ICASSP

    MSC Class: 68T10

  29. arXiv:1810.09785  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    SING: Symbol-to-Instrument Neural Generator

    Authors: Alexandre Défossez, Neil Zeghidour, Nicolas Usunier, Léon Bottou, Francis Bach

    Abstract: Recent progress in deep learning for audio synthesis opens the way to models that directly produce the waveform, shifting away from the traditional paradigm of relying on vocoders or MIDI synthesizers for speech or music generation. Despite their successes, current state-of-the-art neural audio synthesizers such as WaveNet and SampleRNN suffer from prohibitive training and inference times because… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

    Journal ref: Conference on Neural Information Processing Systems (NIPS), Dec 2018, Montr{é}al, Canada

  30. arXiv:1806.07098  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-End Speech Recognition From the Raw Waveform

    Authors: Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, Emmanuel Dupoux

    Abstract: State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone fi… ▽ More

    Submitted 21 June, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

    Comments: Accepted for presentation at Interspeech 2018

  31. arXiv:1804.11297  [pdf, other

    cs.CL cs.LG

    Sampling strategies in Siamese Networks for unsupervised speech representation learning

    Authors: Rachid Riad, Corentin Dancette, Julien Karadayi, Neil Zeghidour, Thomas Schatz, Emmanuel Dupoux

    Abstract: Recent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the dist… ▽ More

    Submitted 23 August, 2018; v1 submitted 30 April, 2018; originally announced April 2018.

    Comments: Conference paper at Interspeech 2018

  32. arXiv:1711.01161  [pdf, other

    cs.CL

    Learning Filterbanks from Raw Speech for Phone Recognition

    Authors: Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux

    Abstract: We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for seve… ▽ More

    Submitted 4 April, 2018; v1 submitted 3 November, 2017; originally announced November 2017.

    Comments: Accepted at ICASSP 2018

  33. arXiv:1706.00409  [pdf, other

    cs.CV

    Fader Networks: Manipulating Images by Sliding Attributes

    Authors: Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, Marc'Aurelio Ranzato

    Abstract: This paper introduces a new encoder-decoder architecture that is trained to reconstruct images by disentangling the salient information of the image and the values of attributes directly in the latent space. As a result, after training, our model can generate different realistic versions of an input image by varying the attribute values. By using continuous attribute values, we can choose how much… ▽ More

    Submitted 28 January, 2018; v1 submitted 1 June, 2017; originally announced June 2017.

    Comments: NIPS 2017

  34. arXiv:1704.06913  [pdf, other

    cs.CL cs.LG

    Learning weakly supervised multimodal phoneme embeddings

    Authors: Rahma Chaabouni, Ewan Dunbar, Neil Zeghidour, Emmanuel Dupoux

    Abstract: Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips movements, in a weakly supervised way using Siamese networks and lexical same-different side information. In par… ▽ More

    Submitted 18 October, 2017; v1 submitted 23 April, 2017; originally announced April 2017.