Search | arXiv e-print repository

Open-Source Conversational AI with SpeechBrain 1.0

Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar , et al. (5 additional authors not shown)

Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presen… ▽ More SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: Submitted to JMLR (Machine Learning Open Source Software)

arXiv:2405.06937 [pdf, other]

High-Order Synchrosqueezed Chirplet Transforms for Multicomponent Signal Analysis

Authors: Yi-Ju Yen, De-Yan Lu, Sing-Yuan Yeh, Jian-Jiun Ding, Chun-Yen Shen

Abstract: This study focuses on the analysis of signals containing multiple components with crossover instantaneous frequencies (IF). This problem was initially solved with the chirplet transform (CT). Also, it can be sharpened by adding the synchrosqueezing step, which is called the synchrosqueezed chirplet transform (SCT). However, we found that the SCT goes wrong with the high chirp modulation signal due… ▽ More This study focuses on the analysis of signals containing multiple components with crossover instantaneous frequencies (IF). This problem was initially solved with the chirplet transform (CT). Also, it can be sharpened by adding the synchrosqueezing step, which is called the synchrosqueezed chirplet transform (SCT). However, we found that the SCT goes wrong with the high chirp modulation signal due to the wrong estimation of the IF. In this paper, we present the improvement of the post-transformation of the CT. The main goal of this paper is to amend the estimation introduced in the SCT and carry out the high-order synchrosqueezed chirplet transform. The proposed method reduces the wrong estimation when facing a stronger variety of chirp-modulated multi-component signals. The theoretical analysis of the new reassignment ingredient is provided. Numerical experiments on some synthetic signals are presented to verify the effectiveness of the proposed high-order SCT. △ Less

Submitted 11 May, 2024; originally announced May 2024.

MSC Class: 65T99; 42C99; 42a38

arXiv:2401.08833 [pdf, other]

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

Authors: Alexander H. Liu, Sung-Lin Yeh, James Glass

Abstract: Existing studies on self-supervised speech representation learning have focused on develo** new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look int… ▽ More Existing studies on self-supervised speech representation learning have focused on develo** new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look into existing self-supervised methods of speech from an information-theoretic perspective. We aim to develop metrics using mutual information to help practical problems such as model design and selection. We use linear probes to estimate the mutual information between the target information and learned representations, showing another insight into the accessibility to the target information from speech representations. Further, we explore the potential of evaluating representations in a self-supervised fashion, where we estimate the mutual information between different parts of the data without using any labels. Finally, we show that both supervised and unsupervised measures echo the performance of the models on layer-wise linear probing and speech recognition. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: ICASSP 2024

arXiv:2312.10547 [pdf, other]

Advancing RAN Slicing with Offline Reinforcement Learning

Authors: Kun Yang, Shu-** Yeh, Menglei Zhang, Jerry Sydir, **g Yang, Cong Shen

Abstract: Dynamic radio resource management (RRM) in wireless networks presents significant challenges, particularly in the context of Radio Access Network (RAN) slicing. This technology, crucial for catering to varying user requirements, often grapples with complex optimization scenarios. Existing Reinforcement Learning (RL) approaches, while achieving good performance in RAN slicing, typically rely on onl… ▽ More Dynamic radio resource management (RRM) in wireless networks presents significant challenges, particularly in the context of Radio Access Network (RAN) slicing. This technology, crucial for catering to varying user requirements, often grapples with complex optimization scenarios. Existing Reinforcement Learning (RL) approaches, while achieving good performance in RAN slicing, typically rely on online algorithms or behavior cloning. These methods necessitate either continuous environmental interactions or access to high-quality datasets, hindering their practical deployment. Towards addressing these limitations, this paper introduces offline RL to solving the RAN slicing problem, marking a significant shift towards more feasible and adaptive RRM methods. We demonstrate how offline RL can effectively learn near-optimal policies from sub-optimal datasets, a notable advancement over existing practices. Our research highlights the inherent flexibility of offline RL, showcasing its ability to adjust policy criteria without the need for additional environmental interactions. Furthermore, we present empirical evidence of the efficacy of offline RL in adapting to various service-level requirements, illustrating its potential in diverse RAN slicing scenarios. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: 9 pages. 6 figures

arXiv:2311.11423 [pdf, other]

Offline Reinforcement Learning for Wireless Network Optimization with Mixture Datasets

Authors: Kun Yang, Cong Shen, **g Yang, Shu-** Yeh, Jerry Sydir

Abstract: The recent development of reinforcement learning (RL) has boosted the adoption of online RL for wireless radio resource management (RRM). However, online RL algorithms require direct interactions with the environment, which may be undesirable given the potential performance loss due to the unavoidable exploration in RL. In this work, we first investigate the use of \emph{offline} RL algorithms in… ▽ More The recent development of reinforcement learning (RL) has boosted the adoption of online RL for wireless radio resource management (RRM). However, online RL algorithms require direct interactions with the environment, which may be undesirable given the potential performance loss due to the unavoidable exploration in RL. In this work, we first investigate the use of \emph{offline} RL algorithms in solving the RRM problem. We evaluate several state-of-the-art offline RL algorithms, including behavior constrained Q-learning (BCQ), conservative Q-learning (CQL), and implicit Q-learning (IQL), for a specific RRM problem that aims at maximizing a linear combination {of sum and} 5-percentile rates via user scheduling. We observe that the performance of offline RL for the RRM problem depends critically on the behavior policy used for data collection, and further propose a novel offline RL solution that leverages heterogeneous datasets collected by different behavior policies. We show that with a proper mixture of the datasets, offline RL can produce a near-optimal RL policy even when all involved behavior policies are highly suboptimal. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: This paper is the camera ready version for Asilomar 2023

arXiv:2309.01007 [pdf]

Comparative Analysis of Deep Learning Architectures for Breast Cancer Diagnosis Using the BreaKHis Dataset

Authors: İrem Sayın, Muhammed Ali Soydaş, Yunus Emre Mert, Arda Yarkataş, Berk Ergun, Selma Sözen Yeh, Hüseyin Üvet

Abstract: Cancer is an extremely difficult and dangerous health problem because it manifests in so many different ways and affects so many different organs and tissues. The primary goal of this research was to evaluate deep learning models' ability to correctly identify breast cancer cases using the BreakHis dataset. The BreakHis dataset covers a wide range of breast cancer subtypes through its huge collect… ▽ More Cancer is an extremely difficult and dangerous health problem because it manifests in so many different ways and affects so many different organs and tissues. The primary goal of this research was to evaluate deep learning models' ability to correctly identify breast cancer cases using the BreakHis dataset. The BreakHis dataset covers a wide range of breast cancer subtypes through its huge collection of histopathological pictures. In this study, we use and compare the performance of five well-known deep learning models for cancer classification: VGG, ResNet, Xception, Inception, and InceptionResNet. The results placed the Xception model at the top, with an F1 score of 0.9 and an accuracy of 89%. At the same time, the Inception and InceptionResNet models both hit accuracy of 87% . However, the F1 score for the Inception model was 87, while that for the InceptionResNet model was 86. These results demonstrate the importance of deep learning methods in making correct breast cancer diagnoses. This highlights the potential to provide improved diagnostic services to patients. The findings of this study not only improve current methods of cancer diagnosis, but also make significant contributions to the creation of new and improved cancer treatment strategies. In a nutshell, the results of this study represent a major advancement in the direction of achieving these vital healthcare goals. △ Less

Submitted 10 September, 2023; v1 submitted 2 September, 2023; originally announced September 2023.

Comments: 7 pages, 1 figure, 2 tables

MSC Class: 68T01

arXiv:2210.15793 [pdf, ps, other]

Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution

Authors: Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, Hao Tang

Abstract: Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information… ▽ More Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information of the low-resolution audio via the reverse sampling process of DMs. The proposed method can be a drop-in replacement for the vanilla sampling process and can significantly improve the performance of the existing works. Moreover, by coupling the proposed sampling method with an unconditional DM, i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups. We also attain state-of-the-art results on the VCTK Multi-Speaker benchmark with this novel formulation. △ Less

Submitted 24 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2106.04624 [pdf, other]

SpeechBrain: A General-Purpose Speech Toolkit

Authors: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, Yoshua Bengio

Abstract: SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing… ▽ More SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Preprint

arXiv:2006.13372 [pdf, other]

Handling Spontaneous Traffic Variations in 5G+ via Offloading onto mmWave-Capable UAV `Bridges'

Authors: Nikita Tafintsev, Dmitri Moltchanov, Sergey Andreev, Shu-** Yeh, Nageen Himayat, Yevgeni Koucheryavy, Mikko Valkama

Abstract: Unmanned aerial vehicles (UAVs) are increasingly employed for numerous public and civil applications, such as goods delivery, medicine, surveillance, and telecommunications. For the latter, UAVs with onboard communication equipment may help temporarily offload traffic onto the neighboring cells in fifth-generation networks and beyond (5G+). In this paper, we propose and evaluate the use of UAVs tr… ▽ More Unmanned aerial vehicles (UAVs) are increasingly employed for numerous public and civil applications, such as goods delivery, medicine, surveillance, and telecommunications. For the latter, UAVs with onboard communication equipment may help temporarily offload traffic onto the neighboring cells in fifth-generation networks and beyond (5G+). In this paper, we propose and evaluate the use of UAVs traveling over the area of interest to relieve congestion in 5G+ systems under spontaneous traffic fluctuations. To this end, we assess two inherently different offloading schemes, named routed and controlled UAV `bridging'. Using the tools of renewal theory and stochastic geometry, we analytically characterize these schemes in terms of the fraction of traffic demand that can be offloaded onto the UAV `bridge' as our parameter of interest. This framework accounts for the unique features of millimeter-wave (mmWave) radio propagation and city deployment types with potential line-of-sight (LoS) link blockage by buildings. We also introduce enhancements to the proposed schemes that significantly improve the offloading gains. Our findings offer evidence that the UAV `bridges' may be used for efficient traffic offloading in various urban scenarios. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: This work has been accepted for publication in the IEEE Transactions on Vehicular Technology

arXiv:2005.12076 [pdf]

An Effective Entropy-assisted Mind-wandering Detection System with EEG Signals based on MM-SART Database

Authors: Yi-Ta Chen, Hsing-Hao Lee, Ching-Yen Shih, Zih-Ling Chen, Win-Ken Beh, Su-Ling Yeh, An-Yeu Wu

Abstract: Mind-wandering (MW), which usually defined as a lapse of attention, occurs between 20%-40% of the time, has negative effects on our daily life. Therefore, detecting when MW occurs can prevent us from those negative outcomes resulting from MW, such as failing to keep track of course during learning. In this work, we first collect a multi-modal Sustained Attention to Response Task (MM-SART) database… ▽ More Mind-wandering (MW), which usually defined as a lapse of attention, occurs between 20%-40% of the time, has negative effects on our daily life. Therefore, detecting when MW occurs can prevent us from those negative outcomes resulting from MW, such as failing to keep track of course during learning. In this work, we first collect a multi-modal Sustained Attention to Response Task (MM-SART) database for detecting MW. Eighty-two participants' data are collected in our experiments. For each participant, we collect measures of 32-channels electroencephalogram (EEG) signals, photoplethysmography (PPG) signals, galvanic skin response (GSR) signals, eye tracker signals, and several questionnaires for detailed analyses. Then, we propose an effective MW detection system based on the collected EEG signals. To explore the non-linear characteristics of EEG signals, we utilize the entropy-based features in time, frequency, and wavelet domains. The experimental results show that we can reach 0.712 AUC score by using the random forest (RF) classifier with the leave-one-subject-out cross-validation. Moreover, to lower the overall computational complexity of the MW detection system, we apply techniques of channel selection and feature selection. By using the only two most significant EEG channels, we can reduce the training time of the classifier by 44.16%. By performing correlation importance feature elimination (CIFE) on the feature set, we can further improve the AUC score to 0.725 but with only 14.6% of the selection time compared with the recursive feature elimination (RFE) method. By proposing the MW detection engine, current work can be applied to educational scenarios, especially in the era of remote learning nowadays. △ Less

Submitted 27 November, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

Comments: 15 pages, Journal version

arXiv:1903.09893 [pdf, other]

Full-duplex in 5G Small Cell Access: SystemDesign and Performance Aspects

Authors: **gwen Bai, Shu-** Yeh, Feng Xue, Yang-seok Choi, ** Wang, Shilpa Talwar

Abstract: Recent achievement in self-interference cancellation algorithms enables potential application of full-duplex (FD) in 5G radio access systems. The exponential growth of data traffic in 5G can be supported by having more spectrum and higher spectral efficiency. FD communication promises to double the spectral efficiency by enabling simultaneous uplink and downlink transmissions in the same frequency… ▽ More Recent achievement in self-interference cancellation algorithms enables potential application of full-duplex (FD) in 5G radio access systems. The exponential growth of data traffic in 5G can be supported by having more spectrum and higher spectral efficiency. FD communication promises to double the spectral efficiency by enabling simultaneous uplink and downlink transmissions in the same frequency band. Yet for cellular access network with FD base stations (BS) serving multiple users (UE), additional BS-to-BS and UE-to-UE interferences due to FD operation could diminish the performance gain if not tackled properly. In this article, we address the practical system design aspects to exploit FD gain at network scale. We propose efficient reference signal design, low-overhead channel state information feedback and signalling mechanisms to enable FD operation, and develop low-complexity power control and scheduling algorithms to effectively mitigate new interference introduced by FD operation. We extensively evaluate FD network-wide performance in various deployment scenarios and traffic environment with detailed LTE PHY/MAC modelling. We demonstrate that FD can achieve not only appreciable throughput gains (1.9x), but also significant transmission latency reduction~(5-8x) compared with the half-duplex system. △ Less

Submitted 23 March, 2019; originally announced March 2019.

Comments: Submitted to IEEE Communications Magazine

Showing 1–11 of 11 results for author: Yeh, S