-
Federated learning for secure development of AI models for Parkinson's disease detection using speech from different languages
Authors:
Soroosh Tayebi Arasteh,
Cristian David Rios-Urrego,
Elmar Noeth,
Andreas Maier,
Seung Hee Yang,
Jan Rusz,
Juan Rafael Orozco-Arroyave
Abstract:
Parkinson's disease (PD) is a neurological disorder impacting a person's speech. Among automatic PD assessment methods, deep learning models have gained particular interest. Recently, the community has explored cross-pathology and cross-language models which can improve diagnostic accuracy even further. However, strict patient data privacy regulations largely prevent institutions from sharing pati…
▽ More
Parkinson's disease (PD) is a neurological disorder impacting a person's speech. Among automatic PD assessment methods, deep learning models have gained particular interest. Recently, the community has explored cross-pathology and cross-language models which can improve diagnostic accuracy even further. However, strict patient data privacy regulations largely prevent institutions from sharing patient speech data with each other. In this paper, we employ federated learning (FL) for PD detection using speech signals from 3 real-world language corpora of German, Spanish, and Czech, each from a separate institution. Our results indicate that the FL model outperforms all the local models in terms of diagnostic accuracy, while not performing very differently from the model based on centrally combined training sets, with the advantage of not requiring any data sharing among collaborators. This will simplify inter-institutional collaborations, resulting in enhancement of patient outcomes.
△ Less
Submitted 21 August, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Representation Learning Strategies to Model Pathological Speech: Effect of Multiple Spectral Resolutions
Authors:
Gabriel Figueiredo Miller,
Juan Camilo Vásquez-Correa,
Juan Rafael Orozco-Arroyave,
Elmar Nöth
Abstract:
This paper considers a representation learning strategy to model speech signals from patients with Parkinson's disease and cleft lip and palate. In particular, it compares different parametrized representation types such as wideband and narrowband spectrograms, and wavelet-based scalograms, with the goal of quantifying the representation capacity of each. Methods for quantification include the abi…
▽ More
This paper considers a representation learning strategy to model speech signals from patients with Parkinson's disease and cleft lip and palate. In particular, it compares different parametrized representation types such as wideband and narrowband spectrograms, and wavelet-based scalograms, with the goal of quantifying the representation capacity of each. Methods for quantification include the ability of the proposed model to classify different pathologies and the associated disease severity. Additionally, this paper proposes a novel fusion strategy called multi-spectral fusion that combines wideband and narrowband spectral resolutions using a representation learning strategy based on autoencoders. The proposed models are able to classify the speech from Parkinson's disease patients with accuracy up to 95\%. The proposed models were also able to asses the dysarthria severity of Parkinson's disease patients with a Spearman correlation up to 0.75. These results outperform those observed in literature where the same problem was addressed with the same corpus.
△ Less
Submitted 17 September, 2022;
originally announced September 2022.
-
Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices
Authors:
Abner Hernandez,
Paula Andrea Pérez-Toro,
Juan Camilo Vásquez-Correa,
Juan Rafael Orozco-Arroyave,
Andreas Maier,
Seung Hee Yang
Abstract:
Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech…
▽ More
Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech. Converted voices retain a low word error rate within 1% of the original voice. Equal error rate increases from 1.52% to 46.24% on the LibriSpeech test set and from 3.75% to 45.84% on speakers from the VCTK corpus which signifies degraded performance on speaker verification. Lastly, we conduct experiments on dysarthric speech data to show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices for discriminating between healthy and pathological speech.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition
Authors:
Abner Hernandez,
Paula Andrea Pérez-Toro,
Elmar Nöth,
Juan Rafael Orozco-Arroyave,
Andreas Maier,
Seung Hee Yang
Abstract:
State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of sp…
▽ More
State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of speech such as articulation, prosody and phonation can be impaired. Specifically, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than filterbanks (Fbank) or models trained on a single language. Improvements were observed in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Common Phone: A Multilingual Dataset for Robust Acoustic Modelling
Authors:
Philipp Klumpp,
Tomás Arias-Vergara,
Paula Andrea Pérez-Toro,
Elmar Nöth,
Juan Rafael Orozco-Arroyave
Abstract:
Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machi…
▽ More
Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machine learning model to explore as much of the domain-specific input space as possible during parameter estimation. This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 11.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation. The architecture achieved a PER of 18.1 % on the entire test set, computed with all 101 unique phonetic symbols, showing slight differences between the individual languages. We conclude that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.
△ Less
Submitted 31 January, 2022; v1 submitted 15 January, 2022;
originally announced January 2022.
-
The Phonetic Footprint of Parkinson's Disease
Authors:
Philipp Klumpp,
Tomás Arias-Vergara,
Juan Camilo Vásquez-Correa,
Paula Andrea Pérez-Toro,
Juan Rafael Orozco-Arroyave,
Anton Batliner,
Elmar Nöth
Abstract:
As one of the most prevalent neurodegenerative disorders, Parkinson's disease (PD) has a significant impact on the fine motor skills of patients. The complex interplay of different articulators during speech production and realization of required muscle tension become increasingly difficult, thus leading to a dysarthric speech. Characteristic patterns such as vowel instability, slurred pronunciati…
▽ More
As one of the most prevalent neurodegenerative disorders, Parkinson's disease (PD) has a significant impact on the fine motor skills of patients. The complex interplay of different articulators during speech production and realization of required muscle tension become increasingly difficult, thus leading to a dysarthric speech. Characteristic patterns such as vowel instability, slurred pronunciation and slow speech can often be observed in the affected individuals and were analyzed in previous studies to determine the presence and progression of PD. In this work, we used a phonetic recognizer trained exclusively on healthy speech data to investigate how PD affected the phonetic footprint of patients. We rediscovered numerous patterns that had been described in previous contributions although our system had never seen any pathological speech previously. Furthermore, we could show that intermediate activations from the neural network could serve as feature vectors encoding information related to the disease state of individuals. We were also able to directly correlate the expert-rated intelligibility of a speaker with the mean confidence of phonetic predictions. Our results support the assumption that pathological data is not necessarily required to train systems that are capable of analyzing PD speech.
△ Less
Submitted 21 December, 2021;
originally announced December 2021.
-
Classification of Emotions and Evaluation of Customer Satisfaction from Speech in Real World Acoustic Environments
Authors:
Luis Felipe Parra-Gallego,
Juan Rafael Orozco-Arroyave
Abstract:
This paper focuses on finding suitable features to robustly recognize emotions and evaluate customer satisfaction from speech in real acoustic scenarios. The classification of emotions is based on standard and well-known corpora and the evaluation of customer satisfaction is based on recordings of real opinions given by customers about the received service during phone calls with call-center agent…
▽ More
This paper focuses on finding suitable features to robustly recognize emotions and evaluate customer satisfaction from speech in real acoustic scenarios. The classification of emotions is based on standard and well-known corpora and the evaluation of customer satisfaction is based on recordings of real opinions given by customers about the received service during phone calls with call-center agents. The feature sets considered in this study include two speaker models, namely x-vectors and i-vectors, and also the well known feature set introduced in the Interspeech 2010 Paralinguistics Challenge (I2010PC). Additionally, we introduce the use of phonation, articulation and prosody features extracted with the DisVoice framework as alternative feature sets to robustly model emotions and customer satisfaction from speech. The results indicate that the I2010PC feature set is the best approach to classify emotions in the standard databases typically used in the literature. When considering the recordings collected in the call-center, without any control over the acoustic conditions, the best results are obtained with our articulation features. The I2010PC feature set includes 1584 measures while the articulation approach only includes 488 measures. We think that the proposed approach is more suitable for real-world applications where the acoustic conditions are not controlled and also it is potentially more convenient for industrial applications.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
Gender Recognition in Informal and Formal Language Scenarios via Transfer Learning
Authors:
Daniel Escobar-Grisales,
Juan Camilo Vasquez-Correa,
Juan Rafael Orozco-Arroyave
Abstract:
The interest in demographic information retrieval based on text data has increased in the research community because applications have shown success in different sectors such as security, marketing, heath-care, and others. Recognition and identification of demographic traits such as gender, age, location, or personality based on text data can help to improve different marketing strategies. For ins…
▽ More
The interest in demographic information retrieval based on text data has increased in the research community because applications have shown success in different sectors such as security, marketing, heath-care, and others. Recognition and identification of demographic traits such as gender, age, location, or personality based on text data can help to improve different marketing strategies. For instance it makes it possible to segment and to personalize offers, thus products and services are exposed to the group of greatest interest. This type of technology has been discussed widely in documents from social media. However, the methods have been poorly studied in data with a more formal structure, where there is no access to emoticons, mentions, and other linguistic phenomena that are only present in social media. This paper proposes the use of recurrent and convolutional neural networks, and a transfer learning strategy for gender recognition in documents that are written in informal and formal languages. Models are tested in two different databases consisting of Tweets and call-center conversations. Accuracies of up to 75\% are achieved for both databases. The results also indicate that it is possible to transfer the knowledge from a system trained on a specific type of expressions or idioms such as those typically used in social media into a more formal type of text data, where the amount of data is more scarce and its structure is completely different.
△ Less
Submitted 23 June, 2021;
originally announced July 2021.
-
Exploring Facial Expressions and Affective Domains for Parkinson Detection
Authors:
Luis Felipe Gomez-Gomez,
Aythami Morales,
Julian Fierrez,
Juan Rafael Orozco-Arroyave
Abstract:
Parkinson's Disease (PD) is a neurological disorder that affects facial movements and non-verbal communication. Patients with PD present a reduction in facial movements called hypomimia which is evaluated in item 3.2 of the MDS-UPDRS-III scale. In this work, we propose to use facial expression analysis from face images based on affective domains to improve PD detection. We propose different domain…
▽ More
Parkinson's Disease (PD) is a neurological disorder that affects facial movements and non-verbal communication. Patients with PD present a reduction in facial movements called hypomimia which is evaluated in item 3.2 of the MDS-UPDRS-III scale. In this work, we propose to use facial expression analysis from face images based on affective domains to improve PD detection. We propose different domain adaptation techniques to exploit the latest advances in face recognition and Face Action Unit (FAU) detection. The principal contributions of this work are: (1) a novel framework to exploit deep face architectures to model hypomimia in PD patients; (2) we experimentally compare PD detection based on single images vs. image sequences while the patients are evoked various face expressions; (3) we explore different domain adaptation techniques to exploit existing models initially trained either for Face Recognition or to detect FAUs for the automatic discrimination between PD patients and healthy subjects; and (4) a new approach to use triplet-loss learning to improve hypomimia modeling and PD detection. The results on real face images from PD patients show that we are able to properly model evoked emotions using image sequences (neutral, onset-transition, apex, offset-transition, and neutral) with accuracy improvements up to 5.5% (from 72.9% to 78.4%) with respect to single-image PD detection. We also show that our proposed affective-domain adaptation provides improvements in PD detection up to 8.9% (from 78.4% to 87.3% detection accuracy).
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
Comparison of user models based on GMM-UBM and i-vectors for speech, handwriting, and gait assessment of Parkinson's disease patients
Authors:
J. C. Vasquez-Correa,
T. Bocklet,
J. R. Orozco-Arroyave,
E. Nöth
Abstract:
Parkinson's disease is a neurodegenerative disorder characterized by the presence of different motor impairments. Information from speech, handwriting, and gait signals have been considered to evaluate the neurological state of the patients. On the other hand, user models based on Gaussian mixture models - universal background models (GMM-UBM) and i-vectors are considered the state-of-the-art in b…
▽ More
Parkinson's disease is a neurodegenerative disorder characterized by the presence of different motor impairments. Information from speech, handwriting, and gait signals have been considered to evaluate the neurological state of the patients. On the other hand, user models based on Gaussian mixture models - universal background models (GMM-UBM) and i-vectors are considered the state-of-the-art in biometric applications like speaker verification because they are able to model specific speaker traits. This study introduces the use of GMM-UBM and i-vectors to evaluate the neurological state of Parkinson's patients using information from speech, handwriting, and gait. The results show the importance of different feature sets from each type of signal in the assessment of the neurological state of the patients.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Analysis and Evaluation of Handwriting in Patients with Parkinson's Disease Using kinematic, Geometrical, and Non-linear Features
Authors:
C. D. Rios-Urrego,
J. C. Vásquez-Correa,
J. F. Vargas-Bonilla,
E. Nöth,
F. Lopera,
J. R. Orozco-Arroyave
Abstract:
Background and objectives: Parkinson's disease is a neurological disorder that affects the motor system producing lack of coordination, resting tremor, and rigidity. Impairments in handwriting are among the main symptoms of the disease. Handwriting analysis can help in supporting the diagnosis and in monitoring the progress of the disease. This paper aims to evaluate the importance of different gr…
▽ More
Background and objectives: Parkinson's disease is a neurological disorder that affects the motor system producing lack of coordination, resting tremor, and rigidity. Impairments in handwriting are among the main symptoms of the disease. Handwriting analysis can help in supporting the diagnosis and in monitoring the progress of the disease. This paper aims to evaluate the importance of different groups of features to model handwriting deficits that appear due to Parkinson's disease; and how those features are able to discriminate between Parkinson's disease patients and healthy subjects.
Methods: Features based on kinematic, geometrical and non-linear dynamics analyses were evaluated to classify Parkinson's disease and healthy subjects. Classifiers based on K-nearest neighbors, support vector machines, and random forest were considered.
Results: Accuracies of up to $93.1\%$ were obtained in the classification of patients and healthy control subjects. A relevance analysis of the features indicated that those related to speed, acceleration, and pressure are the most discriminant. The automatic classification of patients in different stages of the disease shows $κ$ indexes between $0.36$ and $0.44$. Accuracies of up to $83.3\%$ were obtained in a different dataset used only for validation purposes.
Conclusions: The results confirmed the negative impact of aging in the classification process when we considered different groups of healthy subjects. In addition, the results reported with the separate validation set comprise a step towards the development of automated tools to support the diagnosis process in clinical practice.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Convolutional Neural Networks and a Transfer Learning Strategy to Classify Parkinson's Disease from Speech in Three Different Languages
Authors:
J. C. Vásquez-Correa,
T. Arias-Vergara,
C. D. Rios-Urrego,
M. Schuster,
J. Rusz,
J. R. Orozco-Arroyave,
E. Nöth
Abstract:
Parkinson's disease patients develop different speech impairments that affect their communication capabilities. The automatic assessment of the speech of the patients allows the development of computer aided tools to support the diagnosis and the evaluation of the disease severity. This paper introduces a methodology to classify Parkinson's disease from speech in three different languages: Spanish…
▽ More
Parkinson's disease patients develop different speech impairments that affect their communication capabilities. The automatic assessment of the speech of the patients allows the development of computer aided tools to support the diagnosis and the evaluation of the disease severity. This paper introduces a methodology to classify Parkinson's disease from speech in three different languages: Spanish, German, and Czech. The proposed approach considers convolutional neural networks trained with time frequency representations and a transfer learning strategy among the three languages. The transfer learning scheme aims to improve the accuracy of the models when the weights of the neural network are initialized with utterances from a different language than the used for the test set. The results suggest that the proposed strategy improves the accuracy of the models in up to 8\% when the base model used to initialize the weights of the classifier is robust enough. In addition, the results obtained after the transfer learning are in most cases more balanced in terms of specificity-sensitivity than those trained without the transfer learning strategy.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Characterization of the Handwriting Skills as a Biomarker for Parkinson Disease
Authors:
R. Castrillon,
A. Acien,
J. R. Orozco-Arroyave,
A. Morales,
J. F. Vargas,
R. Vera-Rodrıguez,
J. Fierrez,
J. Ortega-Garcia,
A. Villegas
Abstract:
In this paper we evaluate the suitability of handwriting patterns as potential biomarkers to model Parkinson disease (PD). Although the study of PD is attracting the interest of many researchers around the world, databases to evaluate handwriting patterns are scarce and knowledge about patterns associated to PD is limited and biased to the existing datasets. This paper introduces a database with a…
▽ More
In this paper we evaluate the suitability of handwriting patterns as potential biomarkers to model Parkinson disease (PD). Although the study of PD is attracting the interest of many researchers around the world, databases to evaluate handwriting patterns are scarce and knowledge about patterns associated to PD is limited and biased to the existing datasets. This paper introduces a database with a total of 935 handwriting tasks collected from 55 PD patients and 94 healthy controls (45 young and 49 old). Three feature sets are extracted from the signals: neuromotor, kinematic, and nonlinear dynamic. Different classifiers are used to discriminate between PD and healthy subjects: support vector machines, knearest neighbors, and a multilayer perceptron. The proposed features and classifiers enable to detect PD with accuracies between 81% and 97%. Additionally, new insights are presented on the utility of the studied features for monitoring and detecting PD.
△ Less
Submitted 19 March, 2019;
originally announced March 2019.