Skip to main content

Showing 1–16 of 16 results for author: Boito, M Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.06371  [pdf, other

    cs.CL cs.SD eess.AS

    mHuBERT-147: A Compact Multilingual HuBERT Model

    Authors: Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

    Abstract: We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and data… ▽ More

    Submitted 27 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Extended version of the Interspeech 2024 paper of same name

  2. arXiv:2311.01070  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

    Authors: Thomas Palmeira Ferraz, Marcely Zanon Boito, Caroline Brun, Vassilina Nikoulina

    Abstract: Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performa… ▽ More

    Submitted 12 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted to IEEE ICASSP 2024

  3. arXiv:2309.05472  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

    Authors: Titouan Parcollet, Ha Nguyen, Solene Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli, Shucong Zhang, Alexandre Allauzen, Maximin Coavoux, Yannick Esteve, Mickael Rouvier, Jerome Goulian, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

    Abstract: Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-… ▽ More

    Submitted 18 March, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Published in Computer Science and Language. Preprint allowed

  4. arXiv:2306.07763  [pdf, other

    cs.CL

    NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track

    Authors: Edward Gow-Smith, Alexandre Berard, Marcely Zanon Boito, Ioan Calapodescu

    Abstract: This paper presents NAVER LABS Europe's systems for Tamasheq-French and Quechua-Spanish speech translation in the IWSLT 2023 Low-Resource track. Our work attempts to maximize translation quality in low-resource settings using multilingual parameter-efficient solutions that leverage strong pre-trained models. Our primary submission for Tamasheq outperforms the previous state of the art by 7.5 BLEU… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

    Comments: IWSLT 2023: Tamasheq-French and Quechua-Spanish challenge winner

  5. arXiv:2205.01987  [pdf, ps, other

    cs.CL cs.SD eess.AS

    ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks

    Authors: Marcely Zanon Boito, John Ortega, Hugo Riguidel, Antoine Laurent, Loïc Barrault, Fethi Bougares, Firas Chaabani, Ha Nguyen, Florentin Barbier, Souhir Gahbiche, Yannick Estève

    Abstract: This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation. For the Tunisian Arabic-English dataset (low-resource and dialect tracks), we build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tu… ▽ More

    Submitted 4 May, 2022; originally announced May 2022.

    Comments: IWSLT 2022 system paper

  6. arXiv:2204.01397  [pdf, ps, other

    cs.CL cs.SD eess.AS

    A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems

    Authors: Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko, Yannick Estève

    Abstract: Self-supervised models for speech processing emerged recently as popular foundation blocks in speech processing pipelines. These models are pre-trained on unlabeled audio data and then used in speech processing downstream tasks such as automatic speech recognition (ASR) or speech translation (ST). Since these models are now used in research and industrial systems alike, it becomes necessary to und… ▽ More

    Submitted 5 July, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022 (Special session Inclusive and Fair Speech Technologies)

  7. arXiv:2201.05051  [pdf, ps, other

    cs.CL

    Speech Resources in the Tamasheq Language

    Authors: Marcely Zanon Boito, Fethi Bougares, Florentin Barbier, Souhir Gahbiche, Loïc Barrault, Mickael Rouvier, Yannick Estève

    Abstract: In this paper we present two datasets for Tamasheq, a develo** language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours)… ▽ More

    Submitted 11 April, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: Accepted to LREC 2022

  8. arXiv:2106.04298  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

    Authors: Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

    Abstract: Documenting languages helps to prevent the extinction of endangered dialects, many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, unsupervised word segmentation (UWS) from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words, being performed from… ▽ More

    Submitted 18 May, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted to SIGUL 2022

  9. LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech

    Authors: Solene Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Esteve, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

    Abstract: Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing. Recent works also investigated SSL from speech. They were notably successful to improve performance on downstream tasks such as automatic speech recognition (ASR). While these works suggest it is possible to reduce dependence on labeled data for building efficient spee… ▽ More

    Submitted 10 June, 2021; v1 submitted 23 April, 2021; originally announced April 2021.

    Comments: Will be presented at Interspeech 2021

    Journal ref: Proc. Interspeech 2021

  10. arXiv:2003.13325  [pdf, other

    cs.CL

    Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

    Authors: Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

    Abstract: For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Accepted to 1st Joint SLTU and CCURL Workshop

  11. arXiv:1910.13689  [pdf, other

    cs.CL cs.SD eess.AS

    ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task

    Authors: Ha Nguyen, Natalia Tomashenko, Marcely Zanon Boito, Antoine Caubriere, Fethi Bougares, Mickael Rouvier, Laurent Besacier, Yannick Esteve

    Abstract: This paper describes the ON-TRAC Consortium translation systems developed for the end-to-end model task of IWSLT Evaluation 2019 for the English-to-Portuguese language pair. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Université), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). A single end-to-end model built as a neural encod… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: IWSLT 2019 - First two authors contributed equally to this work

  12. arXiv:1910.05154  [pdf, other

    cs.CL

    How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages

    Authors: Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

    Abstract: For language documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013). Recently, collecting aligned translations in well-resourced languages became a popular solution for ensuring posterior interpretability of the recordings (Adda et al. 2016). In this paper we invest… ▽ More

    Submitted 11 October, 2019; originally announced October 2019.

    Comments: 4 pages, workshop LIFT 2019

  13. arXiv:1907.12895  [pdf, other

    cs.CL

    MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

    Authors: Marcely Zanon Boito, William N. Havard, Mahault Garnerin, Éric Le Ferrand, Laurent Besacier

    Abstract: The CMU Wilderness Multilingual Speech Dataset (Black, 2019) is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date.Ther… ▽ More

    Submitted 26 February, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

    Comments: Accepted to LREC2020

  14. arXiv:1907.00184  [pdf, other

    cs.CL

    Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings

    Authors: Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

    Abstract: Since Bahdanau et al. [1] first introduced attention for neural machine translation, most sequence-to-sequence models made use of attention mechanisms [2, 3, 4]. While they produce soft-alignment matrices that could be interpreted as alignment between target and source languages, we lack metrics to quantify their quality, being unclear which approach produces the best alignments. This paper presen… ▽ More

    Submitted 11 September, 2019; v1 submitted 29 June, 2019; originally announced July 2019.

    Comments: Interspeech 2019

  15. arXiv:1807.10740  [pdf, ps, other

    cs.CL

    A small Griko-Italian speech translation corpus

    Authors: Marcely Zanon Boito, Antonios Anastasopoulos, Marika Lekakou, Aline Villavicencio, Laurent Besacier

    Abstract: This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 20 minutes of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also include… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

  16. arXiv:1709.05631  [pdf, other

    cs.CL

    Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

    Authors: Marcely Zanon Boito, Alexandre Berard, Aline Villavicencio, Laurent Besacier

    Abstract: Word discovery is the task of extracting words from unsegmented text. In this paper we examine to what extent neural networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtaine… ▽ More

    Submitted 19 September, 2017; v1 submitted 17 September, 2017; originally announced September 2017.

    Comments: Accepted to IEEE ASRU 2017