Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli^1,2,5 Titouan Parcollet^4,6 Adel Moumen³ Sylvain de Langen³ Cem Subakan^7,2,1 Peter Plantinga² Yingzhi Wang⁸ Pooneh Mousavi^1,2 Luca Della Libera^1,2 Artem Ploujnikov^5,2 Francesco Paissan^9,14 Davide Borra¹⁰ Salah Zaiem¹¹ Zeyu Zhao¹² Shucong Zhang⁴ Georgios Karakasidis¹² Sung-Lin Yeh¹² Aku Rouhe^13,17 Rudolf Braun¹⁹ Florian Mai¹⁸ Juan Zuluaga-Gomez^19,20 Seyed Mahed Mousavi¹⁴ Andreas Nautsch³ Xuechen Liu¹⁶ Sangeet Sagar¹⁵ Jarod Duret³ Salima Mdhaffar³ Gaëlle Laperrière³ Renato De Mori^3,21 Yannick Estève³
¹Concordia University ²Mila-Quebec AI Institute ³Avignon University ⁴Samsung AI Center Cambridge ⁵Université de Montréal ⁶University of Cambridge ⁷Laval University ⁸Zaion ⁹Fondazione Bruno Kessler ¹⁰University of Bologna ¹¹Telecom Paris ¹²University of Edinburgh ¹³Aalto University ¹⁴University of Trento ¹⁵Saarland University ¹⁶National Institute of Informatics - Tokyo ¹⁷Silo AI ¹⁸KU Leuven ¹⁹Idiap ²⁰EPFL ²¹McGill University

Abstract

SpeechBrain¹¹1https://speechbrain.github.io/ is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete “recipes” of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

Keywords: Conversational AI, open-source, speech processing, deep learning.

1 Introduction

Conversational AI is experiencing extraordinary progress, with Large Language Models (LLMs) and speech assistants rapidly evolving and becoming widely adopted in the daily lives of millions of users (McTear, 2021). However, this quick evolution poses a challenge to a fundamental pillar of science: reproducibility. Replicating recent findings is often difficult or impossible for many researchers due to limited access to data, computational resources, or code (Kapoor and Narayanan, 2023). The open-source community is making a remarkable collective effort to mitigate this “reproducibility crisis”, yet many contributors primarily release pre-trained models only, known as open-weight (Liesenfeld and Dingemanse, 2024). While this is a step forward, it is still very common for the data and algorithms used to train them to remain undisclosed. We helped address this problem by releasing SpeechBrain (Ravanelli et al., 2021), a PyTorch-based open-source toolkit designed for accelerating research in speech, audio, and text processing. We ensure replicability by releasing pre-trained models for various tasks and providing the “recipe” for training them from scratch, conveniently including all necessary algorithms and code. A few other open-source toolkits, like NeMo and ESPnet, also support multiple Conversational AI tasks, each excelling in different applications. NeMo is industry-focused, offering ready-to-use solutions, but may provide less flexibility for extensive customization compared to SpeechBrain, which is more research-oriented. ESPnet also supports various tasks with competitive performance, but SpeechBrain stands out for its comprehensive documentation, beginner-friendly tutorials, simplicity, and lightweight design with fewer dependencies.

This paper introduces SpeechBrain 1.0, a remarkable milestone resulting from years of collaboration between the core development team and our community volunteers. We will outline the key technical updates to support novel learning modalities, LLM integration, and advanced decoding strategies, along with novel models, tasks, and new modalities. We also present our new benchmark repository, which is designed for researchers to evaluate and compare their models against state-of-the-art baselines across various tasks.

Refer to caption — Figure 1: Overview of the SpeechBrain architecture. Training is managed by a Python script (train.py) which uses data manifest files (in JSON or CSV formats) and YAML-specified hyperparameters.

2 Overview of SpeechBrain

Since its launch in March 2021, SpeechBrain has grown rapidly and emerged as one of the most popular toolkits for speech processing. It is downloaded 2.5 million times monthly, used in 1,923 repositories, has 8.2k GitHub stars, and 148 contributors. Despite its constant evolution, we remain faithful to the original design principles. We prioritized replicability by releasing both training recipes and pre-trained models. Moreover, 95% of our recipes utilize freely available data and include comprehensive training logs, checkpoints, and other essential information. We made SpeechBrain easy to use by providing comprehensive documentation, examples, and tutorials. Our modular architecture facilitates easy integration or modification of modules. We built it on PyTorch standard interfaces (e.g., torch.nn.Module, torch.optim, torch.utils.data.Dataset), enabling seamless integration with the PyTorch ecosystem (Rouhe et al., 2022). It is released under the permissive Apache 2.0 license.

2.1 Architecture Overview

Training a model with SpeechBrain involves combining three components, as depicted in Figure 1: the training script, the hyperparameter file, and the data manifest files. For simplicity, the training script is integrated into a single Python script. This script leverages a specialized Brain class, designed to orchestrate training intuitively. Hyperparameters are specified using a modified version of YAML called HyperPyYAML, enabling complex parameter configurations that define objects and their corresponding arguments. Data for training, validation, and testing is specified using CSV or JSON files. Moreover, our toolkit further accelerates Conversational AI development by implementing popular models, efficient sequence-to-sequence learning, data handling, distributed training, beam search decoding, evaluation metrics, and data augmentation, across over 200 training recipes for widely used research datasets and more than 100 pretrained models.

Table 1: Summary of the technology supported by SpeechBrain 1.0.

Modality	Task and Techniques
Audio	Vocoding, Audio Augmentation, Feature Extraction, Sound Event Detection/Classification, Beamforming.
Speech	Speech Recognition, Enhancement, Separation, Text-to-Speech, Speaker Recognition, Speech Translation, Speech-to-Speech Translation, Spoken Language Understanding, Voice Activity Detection, Speaker Diarization, Emotion Recognition, Emotion Diarization, Language Identification, Self-Supervised Training, Metric Learning, Forced Alignment.
Text	Language Models Training, LLMs Fine-Tuning, Dialogue Modeling, Response Generation, Grapheme-to-Phoneme.
EEG	Motor imagery, P300, and SSVEP classification.

3 Recent Developments

SpeechBrain has rapidly evolved to support a wide array of tasks. Please, refer to Table 1 for a complete list as of June 2024. The main improvements in SpeechBrain 1.0 include:

•

Learning Modalities: We expanded the support for emerging deep learning modalities. For continual learning, we implemented methods like Rehearsal, Architecture, and Regularization-based approaches (Della Libera et al., 2023). For interpretability, we developed both post-hoc and design-based methods, including Post-hoc Interpretation via Quantization (Paissan et al., 2023), Listen to Interpret (Parekh et al., 2022), Activation Map Thresholding (AMT) for Focal Networks (Della Libera et al., 2024), and Listenable Maps for Audio Classifiers (Paissan et al., 2024). We also implemented audio generation using standard and latent diffusion techniques, along with DiffWave (Kong et al., 2020) as a novel vocoder based on diffusion. Lastly, efficient fine-tuning strategies have been introduced for faster inference using speech self-supervised models (Zaiem et al., 2023a). We implemented wav2vec2 SSL pretraining from scratch as described by (Baevski et al., 2020). This enabled efficient training of a 1-billion-parameter SSL model for French on 14,000 hours of speech using over 100 A100 GPUs, showcasing the scalability of SpeechBrain (Parcollet et al., 2024).
•

Models and Tasks: We developed several new models and expanded support for various tasks. For speech recognition, we introduced new alternatives to the Transformer architecture like HyperConformer (Mai et al., 2023) and Branchformer (Peng et al., 2022), along with a Streamable Conformer Transducer. We support models for discrete audio tokens (e.g., discrete wav2vec, HuBERT, WavLM, EnCodec, DAC, and Speech Tokenizer), which form the basis for modern multimodal LLMs (Mousavi et al., 2024a). Additionally, we introduced technology for Speech Emotion Diarization (Wang et al., 2023). To improve usability and flexibility, we refactored speech augmentation techniques (Ravanelli and Omologo, 2014, 2015). In terms of new modalities, SpeechBrain 1.0 now supports electroencephalographic (EEG) signal processing. The similarity between EEG and speech allows us to reuse many techniques originally developed for speech processing, enabling tasks like motor imagery, P300, and SSVP classification with popular models such as EEGNet (Lawhern et al., 2018), ShallowConvNet (Schirrmeister et al., 2017), and EEGConformer (Song et al., 2023).
•

Decoding Strategies: We improved beam search algorithms for tasks like speech recognition and translation. Our update simplifies code with separate scoring and search functions. This update allows easy integration of various scorers, including n-gram language models and custom heuristics. Additionally, we support pure CTC training, RNN-T latency controlled beamsearch (Jain et al., 2019), batch and GPU decoding (Kim et al., 2017), and N-best hypothesis output with neural language model rescoring (Salazar et al., 2019). We also offer an interface to Kaldi2 (k2) for search based on Finite State Transducers (FST) (Kang et al., 2023) and KenLM for fast language model rescoring (Heafield, 2011).
•

Integration with LLMs: LLMs are crucial in modern Conversational AI. We enhanced our interfaces with popular models like GPT-2 (Radford et al., 2019) and Llama 2/3 (Touvron et al., 2023), enabling easy fine-tuning for tasks such as dialogue modeling and response generation (Mousavi et al., 2024c). We also implemented LTU-AS (Gong et al., 2023), a speech LLM designed to jointly understand audio and speech. Additionally, LLMs can be used to rescore n-best hypotheses provided by speech recognizers (Salazar et al., 2019).
•

Benchmarks: We have launched a new benchmark repository for facilitating community standardization across various areas of broad interest. Currently, we host four benchmarks: CL-MASR for multilingual ASR continual learning (Della Libera et al., 2023), MP3S for speech self-supervised models with customizable probing heads (Zaiem et al., 2023b), DASB for discrete audio token assessment (Mousavi et al., 2024b), and SpeechBrain-MOABB for fair evaluation of deep learning models on EEG datasets. The goal of these benchmarks is to provide researchers with a common framework, baselines, and evaluation protocols for tasks of significant research interest.

4 Conclusion

We presented SpeechBrain 1.0, a significant advancement in the evolution of the SpeechBrain project. We outlined the main updates, including novel learning modalities, models, tasks, and decoding strategies, alongside our efforts in benchmarking initiatives. For an overview of further improvements, please visit the project website. Looking ahead, we plan to keep serving our community with future advancements on both large-scale, small-footprint, and multi-modal models.

Acknowledgment

We would like to thank our sponsors: HuggingFace, Samsung AI Center Cambridge, Baidu, OVHCloud, ViaDialog, and Naver Labs Europe. A special thank you to all the contributors who made SpeechBrain 1.0 possible. We thank the Torchaudio team (Hwang et al., 2023) for helpful discussion and support. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Digital Research Alliance of Canada (alliancecan.ca). We also thank Jean Zay GENCI-IDRIS for their support in computing (Grant 2024-A0161015099 and Grant 2022-A0111012991), and the LIAvignon Partnership Chair in AI.

References

Baevski et al. (2020) A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2020.
Della Libera et al. (2023) L. Della Libera, P. Mousavi, S. Zaiem, C. Subakan, and M. Ravanelli. CL-MASR: A Continual Learning Benchmark for Multilingual ASR. CoRR, abs/2310.16931, 2023.
Della Libera et al. (2024) L. Della Libera, C. Subakan, and M. Ravanelli. Focal modulation networks for interpretable sound classification. In Proceedings of the ICASSP Workshop on Explainable AI for Speech and Audio (XAI-SA), 2024.
Gong et al. (2023) Y. Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass. Joint audio and speech understanding. In In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
Heafield (2011) K. Heafield. KenLM: Faster and Smaller Language Model Queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT), 2011.
Hwang et al. (2023) J. Hwang, M. Hira, C. Chen, X. Zhang, Z. Ni, G. Sun, P. Ma, R. Huang, V. Pratap, Y. Zhang, A. Kumar, C.-Y. Yu, C. Zhu, C. Liu, J. Kahn, M. Ravanelli, P. Sun, S. Watanabe, Y. Shi, Y. Tao, R. Scheibler, S. Cornell, S. Kim, and S. Petridis. Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
Jain et al. (2019) M. Jain, K. Schubert, J. Mahadeokar, C. Yeh, K. Kalgaonkar, A. Sriram, C. Fuegen, and M. L. Seltzer. RNN-T for latency controlled ASR with improved beam search. CoRR, abs/1911.01629, 2019.
Kang et al. (2023) W. Kang, L. Guo, F. Kuang, L. Lin, M. Luo, Z. Yao, X. Yang, P. Żelasko, and D. Povey. Fast and Parallel Decoding for Transducer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
Kapoor and Narayanan (2023) S. Kapoor and A. Narayanan. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 2023.
Kim et al. (2017) S. Kim, T. Hori, and S. Watanabe. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835–4839. IEEE, 2017.
Kong et al. (2020) Z. Kong, W. **, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A Versatile Diffusion Model for Audio Synthesis. CoRR, abs/2009.09761, 2020.
Lawhern et al. (2018) V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance. EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces. Journal of Neural Engineering, 15(5), July 2018.
Liesenfeld and Dingemanse (2024) A. Liesenfeld and M. Dingemanse. Rethinking open source generative ai: open washing and the eu ai act. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2024.
Mai et al. (2023) F. Mai, J. Zuluaga-Gomez, T. Parcollet, and P. Motlicek. Hyperconformer: Multi-head hypermixer for efficient speech recognition. In Proceedings of Interspeech, 2023.
McTear (2021) M. McTear. Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots. Synthesis lectures on human language technologies. Morgan & Claypool Publishers, 2021.
Mousavi et al. (2024a) P. Mousavi, J. Duret, S. Zaiem, L. D. Libera, A. Ploujnikov, C. Subakan, and M. Ravanelli. How should we extract discrete audio tokens from self-supervised models? In Proceedings of Interspeech, 2024a.
Mousavi et al. (2024b) P. Mousavi, L. D. Libera, J. Duret, A. Ploujnikov, C. Subakan, and M. Ravanelli. DASB-Discrete Audio and Speech Benchmark. CoRR, abs/2406.14294, 2024b.
Mousavi et al. (2024c) S. M. Mousavi, G. Roccabruna, S. Alghisi, M. Rizzoli, M. Ravanelli, and G. Riccardi. Are LLMs Robust for Spoken Dialogues? In Proceedings of the International Workshop on Spoken Dialogue Systems Technology (IWSDS), 2024c.
Paissan et al. (2023) F. Paissan, C. Subakan, and M. Ravanelli. Posthoc Interpretation via Quantization. CoRR, abs/2303.12659, 2023.
Paissan et al. (2024) F. Paissan, M. Ravanelli, and C. Subakan. Listenable Maps for Audio Classifiers. In Proceedings of the International Conference on Machine Learning (ICML), 2024.
Parcollet et al. (2024) T. Parcollet, H. Nguyen, S. Evain, M. Zanon Boito, A. Pupier, S. Mdhaffar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Allauzen, M. Coavoux, Y. Estève, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and L. Besacier. Lebenchmark 2.0: A standardized, replicable and enhanced framework for self-supervised representations of french speech. Computer Speech & Language, 86:101622, 2024.
Parekh et al. (2022) J. Parekh, S. Parekh, P. Mozharovskyi, F. d'Alché-Buc, and G. Richard. Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF. In In proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 2022.
Peng et al. (2022) Y. Peng, S. Dalmia, I. R. Lane, and S. Watanabe. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the International Conference on Machine Learning (ICML), 2022.
Radford et al. (2019) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. Technical report.
Ravanelli and Omologo (2014) M. Ravanelli and M. Omologo. On the selection of the impulse responses for distant-speech recognition based on contaminated speech training. In Proceesings of Interspeech, 2014.
Ravanelli and Omologo (2015) M. Ravanelli and M. Omologo. Contaminated speech training methods for robust DNN-HMM distant speech recognition. In Proceesings of Interspeech, 2015.
Ravanelli et al. (2021) M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, et al. SpeechBrain: A general-purpose speech toolkit. CoRR, abs/2106.04624, 2021.
Rouhe et al. (2022) A. Rouhe, M. Ravanelli, T. Parcollet, and P. Plantinga. A SpeechBrain for Everything: State of the PyTorch Ecosystem for Speech Technologies. Interspeech Tutorial Presentation, September 2022.
Salazar et al. (2019) J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff. Masked language model scoring. CoRR, abs/1910.14659, 2019.
Schirrmeister et al. (2017) R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball. Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Map**, 38(11):5391–5420, Aug. 2017.
Song et al. (2023) Y. Song, Q. Zheng, B. Liu, and X. Gao. EEG conformer: Convolutional transformer for EEG decoding and visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2023.
Touvron et al. (2023) H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
Wang et al. (2023) Y. Wang, M. Ravanelli, and A. Yacoubi. Speech Emotion Diarization: Which Emotion Appears When? In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
Zaiem et al. (2023a) S. Zaiem, R. Algayres, T. Parcollet, E. Slim, and M. Ravanelli. Fine-tuning strategies for faster inference using speech self-supervised models: A comparative study. In In Proceesings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSP), 2023a.
Zaiem et al. (2023b) S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli. Speech Self-Supervised Representation Benchmarking: Are We Doing it Right? In Proceedings of Interspeech, 2023b.