Skip to main content

Showing 1–10 of 10 results for author: Mazaré, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.10746  [pdf, other

    cs.CV cs.DB

    Vector search with small radiuses

    Authors: Gergely Szilvasy, Pierre-Emmanuel Mazaré, Matthijs Douze

    Abstract: In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision need… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  2. arXiv:2401.08281  [pdf, other

    cs.LG cs.CV cs.SE

    The Faiss library

    Authors: Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, Hervé Jégou

    Abstract: Vector databases manage large collections of embedding vectors. As AI applications are growing rapidly, so are the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This pa… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

  3. arXiv:2207.06220  [pdf, other

    cs.IR cs.AI

    Improving Wikipedia Verifiability with AI

    Authors: Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré, Armand Joulin, Edouard Grave, Sebastian Riedel

    Abstract: Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not supp… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

  4. arXiv:2007.00991  [pdf, other

    eess.AS cs.CL cs.SD

    Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

    Authors: Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

    Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is gene… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  5. arXiv:2002.02848  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Unsupervised pretraining transfers well across languages

    Authors: Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux

    Abstract: Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervis… ▽ More

    Submitted 7 February, 2020; originally announced February 2020.

    Comments: 6 pages. Accepted at ICASSP 2020. However the 2 pages of supplementary materials will appear only in the arxiv version

    Journal ref: ICASSP 2020

  6. Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More

    Submitted 17 December, 2019; originally announced December 2019.

  7. arXiv:1901.10746  [pdf, other

    cs.CL

    Reference-less Quality Estimation of Text Simplification Systems

    Authors: Louis Martin, Samuel Humeau, Pierre-Emmanuel Mazaré, Antoine Bordes, Éric Villemonte de La Clergerie, Benoît Sagot

    Abstract: The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between th… ▽ More

    Submitted 30 January, 2019; originally announced January 2019.

    Journal ref: 1st Workshop on Automatic Text Adaptation (ATA), Nov 2018, Tilburg, Netherlands. https://www.ida.liu.se/~evere22/ATA-18/

  8. arXiv:1901.05415  [pdf, other

    cs.CL cs.AI cs.HC cs.LG stat.ML

    Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

    Authors: Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazaré, Jason Weston

    Abstract: The majority of conversations a dialogue agent sees over its lifetime occur after it has already been trained and deployed, leaving a vast store of potential training signal untapped. In this work, we propose the self-feeding chatbot, a dialogue agent with the ability to extract new training examples from the conversations it participates in. As our agent engages in conversation, it also estimates… ▽ More

    Submitted 13 June, 2019; v1 submitted 16 January, 2019; originally announced January 2019.

    Comments: ACL 2019

  9. arXiv:1809.01984  [pdf, other

    cs.CL

    Training Millions of Personalized Dialogue Agents

    Authors: Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, Antoine Bordes

    Abstract: Current dialogue systems are not very engaging for users, especially when trained end-to-end without relying on proactive reengaging scripted strategies. Zhang et al. (2018) showed that the engagement level of end-to-end dialogue models increases when conditioning them on text personas providing some personalized back-story to the model. However, the dataset used in Zhang et al. (2018) is syntheti… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  10. arXiv:1804.10490  [pdf, other

    cs.CL

    Weaver: Deep Co-Encoding of Questions and Documents for Machine Reading

    Authors: Martin Raison, Pierre-Emmanuel Mazaré, Rajarshi Das, Antoine Bordes

    Abstract: This paper aims at improving how machines can answer questions directly from text, with the focus of having models that can answer correctly multiple types of questions and from various types of texts, documents or even from large collections of them. To that end, we introduce the Weaver model that uses a new way to relate a question to a textual context by weaving layers of recurrent networks, wi… ▽ More

    Submitted 27 April, 2018; originally announced April 2018.