Search | arXiv e-print repository

Mamba in Speech: Towards an Alternative to Self-Attention

Authors: Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps

Abstract: Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and comp… ▽ More Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing using two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results exhibit the superiority of bidirectional Mamba (BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research. △ Less

Submitted 30 June, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2402.13276 [pdf, other]

When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection

Authors: Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps

Abstract: Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, Large Language Models (LLMs) stand out for their versatility in mental healthcare applications. However, their primary limitation arises from their exclusive dependence on textual input, which constrains their overall capabilities. Furthermore, the… ▽ More Depression is a critical concern in global mental health, prompting extensive research into AI-based detection methods. Among various AI technologies, Large Language Models (LLMs) stand out for their versatility in mental healthcare applications. However, their primary limitation arises from their exclusive dependence on textual input, which constrains their overall capabilities. Furthermore, the utilization of LLMs in identifying and analyzing depressive states is still relatively untapped. In this paper, we present an innovative approach to integrating acoustic speech information into the LLMs framework for multimodal depression detection. We investigate an efficient method for depression detection by integrating speech signals into LLMs utilizing Acoustic Landmarks. By incorporating acoustic landmarks, which are specific to the pronunciation of spoken words, our method adds critical dimensions to text transcripts. This integration also provides insights into the unique speech patterns of individuals, revealing the potential mental states of individuals. Evaluations of the proposed approach on the DAIC-WOZ dataset reveal state-of-the-art results when compared with existing Audio-Text baselines. In addition, this approach is not only valuable for the detection of depression but also represents a new perspective in enhancing the ability of LLMs to comprehend and process speech signals. △ Less

Submitted 17 February, 2024; originally announced February 2024.

arXiv:2311.07037 [pdf, other]

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

Authors: Mostafa Shahin, Julien Epps, Beena Ahmed

Abstract: The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of tr… ▽ More The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent. △ Less

Submitted 12 November, 2023; originally announced November 2023.

arXiv:2310.10922 [pdf, other]

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio

Authors: Antoni Dimitriadis, Siqi Pan, Vidhyasaharan Sethu, Beena Ahmed

Abstract: Self-supervised learning has been used to leverage unlabelled data, improving accuracy and generalisation of speech systems through the training of representation models. While many recent works have sought to produce effective representations across a variety of acoustic domains, languages, modalities and even simultaneous speakers, these studies have all been limited to single-channel audio reco… ▽ More Self-supervised learning has been used to leverage unlabelled data, improving accuracy and generalisation of speech systems through the training of representation models. While many recent works have sought to produce effective representations across a variety of acoustic domains, languages, modalities and even simultaneous speakers, these studies have all been limited to single-channel audio recordings. This paper presents Spatial HuBERT, a self-supervised speech representation model that learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment by using multi-channel audio inputs. Spatial HuBERT learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks, particularly in reverberant and noisy environments. We also demonstrate the utility of the representations learned by Spatial HuBERT on a speech localisation downstream task. Along with this paper, we publicly release a new dataset of 100 000 simulated first-order ambisonics room impulse responses. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2211.07769 [pdf, other]

Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations

Authors: Renee Lu, Mostafa Shahin, Beena Ahmed

Abstract: Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies. The major challenge impeding progress in this domain is the lack of adequate child speech corpora; however, recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity. In this paper, we leverage self-supervised adult speec… ▽ More Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies. The major challenge impeding progress in this domain is the lack of adequate child speech corpora; however, recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity. In this paper, we leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition. We assess the performance of fine-tuning on both native and non-native children's speech, examine the effect of cross-domain child corpora, and investigate the minimum amount of child speech required to fine-tune a model which outperforms a state-of-the-art adult model. We also analyze speech recognition performance across children's ages. Our results demonstrate that fine-tuning with cross-domain child corpora leads to relative improvements of up to 46.08% and 45.53% for native and non-native child speech respectively, and absolute improvements of 14.70% and 31.10%. We also show that with as little as 5 hours of transcribed children's speech, it is possible to fine-tune a children's speech recognition system that outperforms a state-of-the-art adult model fine-tuned on 960 hours of adult speech. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: Under-review @ Speech Communication Journal

arXiv:2210.10231 [pdf, other]

Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Authors: Mostafa Shahin, Beena Ahmed, Julien Epps

Abstract: One of the major challenges in acoustic modelling of child speech is the rapid changes that occur in the children's articulators as they grow up, their differing growth rates and the subsequent high variability in the same age group. These high acoustic variations along with the scarcity of child speech corpora have impeded the development of a reliable speech recognition system for children. In t… ▽ More One of the major challenges in acoustic modelling of child speech is the rapid changes that occur in the children's articulators as they grow up, their differing growth rates and the subsequent high variability in the same age group. These high acoustic variations along with the scarcity of child speech corpora have impeded the development of a reliable speech recognition system for children. In this paper, a speaker- and age-invariant training approach based on adversarial multi-task learning is proposed. The system consists of one generator shared network that learns to generate speaker- and age-invariant features connected to three discrimination networks, for phoneme, age, and speaker. The generator network is trained to minimize the phoneme-discrimination loss and maximize the speaker- and age-discrimination losses in an adversarial multi-task learning fashion. The generator network is a Time Delay Neural Network (TDNN) architecture while the three discriminators are feed-forward networks. The system was applied to the OGI speech corpora and achieved a 13% reduction in the WER of the ASR. △ Less

Submitted 6 November, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP2023

arXiv:2102.04300 [pdf, other]

Deep Learning Models May Spuriously Classify Covid-19 from X-ray Images Based on Confounders

Authors: Kaoutar Ben Ahmed, Lawrence O. Hall, Dmitry B. Goldgof, Gregory M. Goldgof, Rahul Paul

Abstract: Identifying who is infected with the Covid-19 virus is critical for controlling its spread. X-ray machines are widely available worldwide and can quickly provide images that can be used for diagnosis. A number of recent studies claim it may be possible to build highly accurate models, using deep learning, to detect Covid-19 from chest X-ray images. This paper explores the robustness and generaliza… ▽ More Identifying who is infected with the Covid-19 virus is critical for controlling its spread. X-ray machines are widely available worldwide and can quickly provide images that can be used for diagnosis. A number of recent studies claim it may be possible to build highly accurate models, using deep learning, to detect Covid-19 from chest X-ray images. This paper explores the robustness and generalization ability of convolutional neural network models in diagnosing Covid-19 disease from frontal-view (AP/PA), raw chest X-ray images that were lung field cropped. Some concerning observations are made about high performing models that have learned to rely on confounding features related to the data source, rather than the patient's lung pathology, when differentiating between Covid-19 positive and negative labels. Specifically, these models likely made diagnoses based on confounding factors such as patient age or image processing artifacts, rather than medically relevant information. △ Less

Submitted 8 January, 2021; originally announced February 2021.

arXiv:2002.11188 [pdf, other]

IoT Based Real Time Noise Map** System for Urban Sound Pollution Study

Authors: Sakib Ahmed, Touseef Saleh Bin Ahmed, Sumaiya Jafreen, Jannatul Tajrin, Jia Uddin

Abstract: This paper describes the development of a system that enables real time data visualization via a webapp regarding sound intensity using multiple node devices connected through internet. The prototypes were realized using ATmega328 (Arduino Nano) and ESP8266 hardware modules, NodeMCU Arduino wrapper library, Google maps and firebase API along with JavaScript webapp. System architecture is such that… ▽ More This paper describes the development of a system that enables real time data visualization via a webapp regarding sound intensity using multiple node devices connected through internet. The prototypes were realized using ATmega328 (Arduino Nano) and ESP8266 hardware modules, NodeMCU Arduino wrapper library, Google maps and firebase API along with JavaScript webapp. System architecture is such that multiple node devices will be installed in different locations of the target area. On each node device, an Arduino Nano interfaced with a Sound Sensor measures the ambient sound intensity and ESP8266 Wi-Fi module transmits the data to a database via web API. On the webapp, it plots all the real-time data from the devices over Google maps according to the locations of the node devices. The logged data that is collected can then be used to carry out researches regarding sound pollution in targeted areas. △ Less

Submitted 25 February, 2020; originally announced February 2020.

Comments: Appendix by Sakib Ahmed Accepted as Conference Paper at ICIEV and icIVPR, 2018, Student Conference on Informatics, Electronics & Vision (SCIEV): Paper ID 175

arXiv:1803.02159 [pdf, other]

Exogenous Approach to Grid Cost Allocation in Peer-to-Peer Electricity Markets

Authors: T. Baroche, P. Pinson, R. Le Goff Latimier., H. Ben Ahmed

Abstract: The deployment of distributed energy resources, combined with a more proactive demand side, is inducing a new paradigm in power system operation and electricity markets. Within a consumer-centric market framework, peer-to-peer approaches have gained substantial interest. Peer-to-peer markets rely on multi-bilateral direct negotiation among all players to match supply and demand, and with product d… ▽ More The deployment of distributed energy resources, combined with a more proactive demand side, is inducing a new paradigm in power system operation and electricity markets. Within a consumer-centric market framework, peer-to-peer approaches have gained substantial interest. Peer-to-peer markets rely on multi-bilateral direct negotiation among all players to match supply and demand, and with product differentiation. These markets can yield a complete map** of exchanges onto the grid, hence allowing to rethink our approach to sharing costs related to usage of common infrastructure and services. We propose here to attribute such costs in a number of alternative ways that reflects different views on usage of the grid and on cost allocation, i.e., uniformly and based on the electrical distance between players. Since attribution mechanisms are defined in an exogenous manner and made transparent they eventually affect the trades of the market participants and related grid usage. The interest of our approach is illustrated on a test case using the IEEE 39 bus test system, underlying the impact of attribution mechanisms on trades and grid usage. △ Less

Submitted 6 March, 2018; originally announced March 2018.

arXiv:1404.6389 [pdf, other]

Computing an Optimal Control Policy for an Energy Storage

Authors: Pierre Haessig, Thibaut Kovaltchouk, Bernard Multon, Hamid Ben Ahmed, Stéphane Lascaud

Abstract: We introduce StoDynProg, a small library created to solve Optimal Control problems arising in the management of Renewable Power Sources, in particular when coupled with an Energy Storage System. The library implements generic Stochastic Dynamic Programming (SDP) numerical methods which can solve a large class of Dynamic Optimization problems. We demonstrate the library capabilities with a prototyp… ▽ More We introduce StoDynProg, a small library created to solve Optimal Control problems arising in the management of Renewable Power Sources, in particular when coupled with an Energy Storage System. The library implements generic Stochastic Dynamic Programming (SDP) numerical methods which can solve a large class of Dynamic Optimization problems. We demonstrate the library capabilities with a prototype problem: smoothing the power of an Ocean Wave Energy Converter. First we use time series analysis to derive a stochastic Markovian model of this system since it is required by Dynamic Programming. Then, we briefly describe the "policy iteration" algorithm we have implemented and the numerical tools being used. We show how the API design of the library is generic enough to address Dynamic Optimization problems outside the field of Energy Management. Finally, we solve the power smoothing problem and compare the optimal control with a simpler heuristic control. △ Less

Submitted 25 April, 2014; originally announced April 2014.

Comments: Part of the Proceedings of the 6th European Conference on Python in Science (EuroSciPy 2013), Pierre de Buyl and Nelle Varoquaux editors, (2014)

Report number: euroscipy-proceedings2013-08

Showing 1–10 of 10 results for author: Ahmed, B