Search | arXiv e-print repository

arXiv:2406.16128 [pdf, other]

Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions

Authors: Moreno La Quatra, Maria Francesca Turco, Torbjørn Svendsen, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi

Abstract: This work is concerned with devising a robust Parkinson's (PD) disease detector from speech in real-world operating conditions using (i) foundational models, and (ii) speech enhancement (SE) methods. To this end, we first fine-tune several foundational-based models on the standard PC-GITA (s-PC-GITA) clean data. Our results demonstrate superior performance to previously proposed models. Second, we… ▽ More This work is concerned with devising a robust Parkinson's (PD) disease detector from speech in real-world operating conditions using (i) foundational models, and (ii) speech enhancement (SE) methods. To this end, we first fine-tune several foundational-based models on the standard PC-GITA (s-PC-GITA) clean data. Our results demonstrate superior performance to previously proposed models. Second, we assess the generalization capability of the PD models on the extended PC-GITA (e-PC-GITA) recordings, collected in real-world operative conditions, and observe a severe drop in performance moving from ideal to real-world conditions. Third, we align training and testing conditions applaying off-the-shelf SE techniques on e-PC-GITA, and a significant boost in performance is observed only for the foundational-based models. Finally, combining the two best foundational-based models trained on s-PC-GITA, namely WavLM Base and Hubert Base, yielded top performance on the enhanced e-PC-GITA. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2404.16547 [pdf, other]

Develo** Acoustic Models for Automatic Speech Recognition in Swedish

Authors: Giampiero Salvi

Abstract: This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified… ▽ More This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 16 pages, 7 figures

MSC Class: 68T10 ACM Class: I.5.0; I.2.0; I.2.7

Journal ref: European Student Journal of Language and Speech, 1999

arXiv:2401.06588 [pdf, other]

doi 10.1016/j.specom.2005.05.005

Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints

Authors: Giampiero Salvi

Abstract: This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolut… ▽ More This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency. △ Less

Submitted 12 January, 2024; originally announced January 2024.

ACM Class: I.5.0; I.2.7; E.4

Journal ref: Speech Communication Volume 48, Issue 7, July 2006, Pages 802-818

arXiv:2401.05717 [pdf, other]

doi 10.1016/j.specom.2006.07.009

Segment Boundary Detection via Class Entropy Measurements in Connectionist Phoneme Recognition

Authors: Giampiero Salvi

Abstract: This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of th… ▽ More This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 msec of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries. △ Less

Submitted 12 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

ACM Class: I.5.0; I.2.7; E.4

Journal ref: Speech Communication Volume 48, Issue 12, December 2006, Pages 1666-1676

arXiv:2106.06147 [pdf, other]

doi 10.1109/TPAMI.2022.3194311

NAAQA: A Neural Architecture for Acoustic Question Answering

Authors: Jerome Abdelnour, Jean Rouat, Giampiero Salvi

Abstract: The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of var… ▽ More The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs. These include handling of variable duration scenes, and scenes built with elementary sounds that differ between training and test set. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The use of 1D convolutions in time and frequency to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. We show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by ~17 percentage points. On the other hand, frequency coordinate maps have little influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with ~4 times fewer parameters than the previously explored VQA model. We evaluate the perfomance of NAAQA on an independent data set reconstructed from DAQA. We also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA. We provide a detailed analysis of the results for the different question types. We release the code to produce CLEAR2 as well as NAAQA to foster research in this newly emerging machine learning task. △ Less

Submitted 12 January, 2024; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) in April 2021 (first revision February 2022)

ACM Class: I.2.7; I.2.10; I.5.0

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, Volume: 45 Issue: 4, Page(s): 4997-5009

arXiv:2009.05184 [pdf, other]

STEP-GAN: A Step-by-Step Training for Multi Generator GANs with application to Cyber Security in Power Systems

Authors: Mohammad Adiban, Arash Safari, Giampiero Salvi

Abstract: In this study, we introduce a novel unsupervised countermeasure for smart grid power systems, based on generative adversarial networks (GANs). Given the pivotal role of smart grid systems (SGSs) in urban life, their security is of particular importance. In recent years, however, advances in the field of machine learning, have raised concerns about cyber attacks on these systems. Power systems, amo… ▽ More In this study, we introduce a novel unsupervised countermeasure for smart grid power systems, based on generative adversarial networks (GANs). Given the pivotal role of smart grid systems (SGSs) in urban life, their security is of particular importance. In recent years, however, advances in the field of machine learning, have raised concerns about cyber attacks on these systems. Power systems, among the most important components of urban infrastructure, have, for example, been widely attacked by adversaries. Attackers disrupt power systems using false data injection attacks (FDIA), resulting in a breach of availability, integrity, or confidential principles of the system. Our model simulates possible attacks on power systems using multiple generators in a step-by-step interaction with a discriminator in the training phase. As a consequence, our system is robust to unseen attacks. Moreover, the proposed model considerably reduces the well-known mode collapse problem of GAN-based models. Our method is general and it can be potentially employed in a wide range of one of one-class classification tasks. The proposed model has low computational complexity and outperforms baseline systems about 14% and 41% in terms of accuracy on the highly imbalanced publicly available industrial control system (ICS) cyber attack power system dataset. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:1902.11280 [pdf, ps, other]

From Visual to Acoustic Question Answering

Authors: Jerome Abdelnour, Giampiero Salvi, Jean Rouat

Abstract: We introduce the new task of Acoustic Question Answering (AQA) to promote research in acoustic reasoning. The AQA task consists of analyzing an acoustic scene composed by a combination of elementary sounds and answering questions that relate the position and properties of these sounds. The kind of relational questions asked, require that the models perform non-trivial reasoning in order to answer… ▽ More We introduce the new task of Acoustic Question Answering (AQA) to promote research in acoustic reasoning. The AQA task consists of analyzing an acoustic scene composed by a combination of elementary sounds and answering questions that relate the position and properties of these sounds. The kind of relational questions asked, require that the models perform non-trivial reasoning in order to answer correctly. Although similar problems have been extensively studied in the domain of visual reasoning, we are not aware of any previous studies addressing the problem in the acoustic domain. We propose a method for generating the acoustic scenes from elementary sounds and a number of relevant questions for each scene using templates. We also present preliminary results obtained with two models (FiLM and MAC) that have been shown to work for visual reasoning. △ Less

Submitted 28 February, 2019; originally announced February 2019.

arXiv:1811.10561 [pdf, other]

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning

Authors: Jerome Abdelnour, Giampiero Salvi, Jean Rouat

Abstract: We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning. In this task an agent learns to answer questions on the basis of acoustic context. In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR (Johnson et al. 2017). We generate acoustic scenes by leveraging a bank elementary sounds. We also provide a number of func… ▽ More We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning. In this task an agent learns to answer questions on the basis of acoustic context. In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR (Johnson et al. 2017). We generate acoustic scenes by leveraging a bank elementary sounds. We also provide a number of functional programs that can be used to compose questions and answers that exploit the relationships between the attributes of the elementary sounds in each scene. We provide AQA datasets of various sizes as well as the data generation code. As a preliminary experiment to validate our data, we report the accuracy of current state of the art visual question answering models when they are applied to the AQA task without modifications. Although there is a plethora of question answering tasks based on text, image or video data, to our knowledge, we are the first to propose answering questions directly on audio streams. We hope this contribution will facilitate the development of research in the area. △ Less

Submitted 26 November, 2018; originally announced November 2018.

Comments: NeurIPS 2018 Visually Grounded Interaction and Language (ViGIL) Workshop

Showing 1–8 of 8 results for author: Salvi, G