Skip to main content

Showing 1–29 of 29 results for author: Kharitonov, E

.
  1. arXiv:2404.10419  [pdf, other

    eess.AS cs.CL

    MAD Speech: Measures of Acoustic Diversity of Speech

    Authors: Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, Eugene Kharitonov

    Abstract: Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by develo** lightweight metrics of acoustic diversity, which we collectively refer to as… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  2. arXiv:2308.06265  [pdf

    econ.GN cs.LG

    Long-term Effects of Temperature Variations on Economic Growth: A Machine Learning Approach

    Authors: Eugene Kharitonov, Oksana Zakharchuk, Lin Mei

    Abstract: This study investigates the long-term effects of temperature variations on economic growth using a data-driven approach. Leveraging machine learning techniques, we analyze global land surface temperature data from Berkeley Earth and economic indicators, including GDP and population data, from the World Bank. Our analysis reveals a significant relationship between average temperature and GDP growth… ▽ More

    Submitted 17 June, 2023; originally announced August 2023.

  3. arXiv:2306.12925  [pdf, other

    cs.CL cs.AI cs.SD eess.AS stat.ML

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

    Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: Technical report

  4. arXiv:2305.09636  [pdf, other

    cs.SD cs.LG eess.AS

    SoundStorm: Efficient Parallel Audio Generation

    Authors: Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

    Abstract: We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consist… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  5. arXiv:2302.03540  [pdf, other

    cs.SD eess.AS

    Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

    Authors: Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  6. arXiv:2209.03143  [pdf, other

    cs.SD cs.LG eess.AS

    AudioLM: a Language Modeling Approach to Audio Generation

    Authors: Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenizati… ▽ More

    Submitted 25 July, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

  7. arXiv:2203.16502  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Generative Spoken Dialogue Language Modeling

    Authors: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech,… ▽ More

    Submitted 22 November, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

  8. arXiv:2202.07359  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    textless-lib: a Library for Textless Spoken Language Processing

    Authors: Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three differ… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: The library is available here https://github.com/facebookresearch/textlesslib/

  9. arXiv:2112.11911  [pdf, other

    cs.CL cs.AI

    Towards Interactive Language Modeling

    Authors: Maartje ter Hoeve, Evgeny Kharitonov, Dieuwke Hupkes, Emmanuel Dupoux

    Abstract: Interaction between caregivers and children plays a critical role in human language acquisition and development. Given this observation, it is remarkable that explicit interaction plays little to no role in artificial language modeling -- which also targets the acquisition of human language, yet by artificial models. Moreover, an interactive approach to language modeling has the potential to make… ▽ More

    Submitted 28 September, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

  10. arXiv:2111.07402  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Textless Speech Emotion Conversion using Discrete and Decomposed Representations

    Authors: Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, a… ▽ More

    Submitted 13 December, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: Paper was published at EMNLP 2022

  11. arXiv:2110.02782  [pdf, other

    cs.CL

    How BPE Affects Memorization in Transformers

    Authors: Eugene Kharitonov, Marco Baroni, Dieuwke Hupkes

    Abstract: Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to foc… ▽ More

    Submitted 2 December, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

  12. arXiv:2109.03264  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Text-Free Prosody-Aware Generative Spoken Language Modeling

    Authors: Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu

    Abstract: Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-lik… ▽ More

    Submitted 10 May, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

    Comments: ACL 2022

  13. arXiv:2107.01366  [pdf, other

    cs.CL cs.AI cs.LG

    Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN

    Authors: Rahma Chaabouni, Roberto Dessì, Eugene Kharitonov

    Abstract: Despite their practical success, modern seq2seq architectures are unable to generalize systematically on several SCAN tasks. Hence, it is not clear if SCAN-style compositional generalization is useful in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that g… ▽ More

    Submitted 16 September, 2021; v1 submitted 3 July, 2021; originally announced July 2021.

    Comments: BlackboxNLP workshop, EMNLP 2021

  14. arXiv:2106.04258  [pdf, other

    cs.CL cs.AI cs.LG cs.MA

    Interpretable agent communication from scratch (with a generic visual processor emerging on the side)

    Authors: Roberto Dessì, Eugene Kharitonov, Marco Baroni

    Abstract: As deep networks begin to be deployed as autonomous agents, the issue of how they can communicate with each other becomes important. Here, we train two deep nets from scratch to perform realistic referent identification through unsupervised emergent communication. We show that the largely interpretable emergent protocol allows the nets to successfully communicate even about object types they did n… ▽ More

    Submitted 15 October, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted at NeurIPS 2021

  15. arXiv:2104.14700  [pdf, ps, other

    cs.CL cs.AI

    The Zero Resource Speech Challenge 2021: Spoken language modelling

    Authors: Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux

    Abstract: We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (C… ▽ More

    Submitted 9 August, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2011.11588

  16. arXiv:2104.00355  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

    Authors: Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and she… ▽ More

    Submitted 27 July, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: In Proceedings of Interspeech 2021

  17. arXiv:2102.01192  [pdf, other

    cs.CL

    Generative Spoken Language Modeling from Raw Audio

    Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text u… ▽ More

    Submitted 9 September, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

  18. arXiv:2011.11588  [pdf, other

    cs.CL cs.SD eess.AS

    The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

    Authors: Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, Emmanuel Dupoux

    Abstract: We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a com… ▽ More

    Submitted 1 December, 2020; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: 14 pages, including references and supplementary material

  19. arXiv:2007.00991  [pdf, other

    eess.AS cs.CL cs.SD

    Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

    Authors: Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

    Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is gene… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  20. arXiv:2006.14953  [pdf, other

    cs.CL cs.AI cs.IT

    What they do when in doubt: a study of inductive biases in seq2seq learners

    Authors: Eugene Kharitonov, Rahma Chaabouni

    Abstract: Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use SCAN and three new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and com… ▽ More

    Submitted 29 March, 2021; v1 submitted 26 June, 2020; originally announced June 2020.

  21. arXiv:2004.09124  [pdf, other

    cs.CL cs.AI cs.LG

    Compositionality and Generalization in Emergent Languages

    Authors: Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, Marco Baroni

    Abstract: Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as \emph{compositionality}. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin t… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

  22. arXiv:2004.03420  [pdf, other

    cs.CL cs.LG

    Emergent Language Generalization and Acquisition Speed are not tied to Compositionality

    Authors: Eugene Kharitonov, Marco Baroni

    Abstract: Studies of discrete languages emerging when neural agents communicate to solve a joint task often look for evidence of compositional structure. This stems for the expectation that such a structure would allow languages to be acquired faster by the agents and enable them to generalize better. We argue that these beneficial properties are only loosely connected to compositionality. In two experiment… ▽ More

    Submitted 25 April, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

  23. Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More

    Submitted 17 December, 2019; originally announced December 2019.

  24. arXiv:1907.00852  [pdf, other

    cs.CL cs.AI

    EGG: a toolkit for research on Emergence of lanGuage in Games

    Authors: Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni

    Abstract: There is renewed interest in simulating language emergence among deep neural agents that communicate to jointly solve a task, spurred by the practical aim to develop language-enabled interactive AIs, as well as by theoretical questions about the evolution of human language. However, optimizing deep architectures connected by a discrete communication channel (such as that in which language emerges)… ▽ More

    Submitted 13 October, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: EMNLP 2019 Demo paper

  25. arXiv:1905.13687  [pdf, other

    cs.CL cs.LG

    Entropy Minimization In Emergent Languages

    Authors: Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, Marco Baroni

    Abstract: There is growing interest in studying the languages that emerge when neural agents are jointly trained to solve tasks requiring communication through a discrete channel. We investigate here the information-theoretic complexity of such languages, focusing on the basic two-agent, one-exchange setup. We find that, under common training procedures, the emergent languages are subject to an entropy mini… ▽ More

    Submitted 26 June, 2020; v1 submitted 31 May, 2019; originally announced May 2019.

    Comments: Accepted at ICML 2020

  26. arXiv:1905.12561  [pdf, other

    cs.CL cs.AI cs.LG cs.MA

    Anti-efficient encoding in emergent communication

    Authors: Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, Marco Baroni

    Abstract: Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of the induced code, and how they compare to human language. One fundamental characteristic of the latter, known as Zipf's Law of Abbreviation (ZLA), is that more frequent words are efficiently associated to shorter strings. We study whether the same pattern emerges when two n… ▽ More

    Submitted 15 October, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

  27. arXiv:1905.12330  [pdf, other

    cs.CL cs.AI cs.LG

    Word-order biases in deep-agent emergent communication

    Authors: Rahma Chaabouni, Eugene Kharitonov, Alessandro Lazaric, Emmanuel Dupoux, Marco Baroni

    Abstract: Sequence-processing neural networks led to remarkable progress on many NLP tasks. As a consequence, there has been increasing interest in understanding to what extent they process language as humans do. We aim here to uncover which biases such models display with respect to "natural" word-order constraints. We train models to communicate about paths in a simple gridworld, using miniature languages… ▽ More

    Submitted 14 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: Conference: Association for Computational Linguistics (ACL)

  28. arXiv:1312.1611  [pdf, ps, other

    cs.IR

    Intent Models for Contextualising and Diversifying Query Suggestions

    Authors: Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, Iadh Ounis

    Abstract: The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in the query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions… ▽ More

    Submitted 5 December, 2013; originally announced December 2013.

    Comments: A short version of this paper was presented at CIKM 2013

  29. arXiv:1007.3208  [pdf

    cs.IR

    Link Graph Analysis for Adult Images Classification

    Authors: Evgeny Kharitonov, Anton Slesarev, Ilya Muchnik, Fedor Romanenko, Dmitry Belyaev, Dmitry Kotlyarov

    Abstract: In order to protect an image search engine's users from undesirable results adult images' classifier should be built. The information about links from websites to images is employed to create such a classifier. These links are represented as a bipartite website-image graph. Each vertex is equipped with scores of adultness and decentness. The scores for image vertexes are initialized with zero, tho… ▽ More

    Submitted 19 July, 2010; originally announced July 2010.

    Comments: 7 pages. Young Scientists Conference, 4th Russian Summer School in Information Retrieval