\interspeechcameraready\name

[affiliation=1]MorenoLa Quatra \name[affiliation=1]Maria FrancescaTurco \name[affiliation=2]TorbjørnSvendsen \name[affiliation=2]GiampieroSalvi \name[affiliation=3]Juan RafaelOrozco-Arroyave \name[affiliation=2,4]Sabato MarcoSiniscalchi

Exploiting Foundation Models and Speech Enhancement for Parkinson’s Disease Detection from Speech in Real-World Operative Conditions

Abstract

This work is concerned with devising a robust Parkinson’s (PD) disease detector from speech in real-world operating conditions using (i) foundational models, and (ii) speech enhancement (SE) methods. To this end, we first fine-tune several foundational-based models on the standard PC-GITA (s-PC-GITA) clean data. Our results demonstrate superior performance to previously proposed models. Second, we assess the generalization capability of the PD models on the extended PC-GITA (e-PC-GITA) recordings, collected in real-world operative conditions, and observe a severe drop in performance moving from ideal to real-world conditions. Third, we align training and testing conditions applaying off-the-shelf SE techniques on e-PC-GITA, and a significant boost in performance is observed only for the foundational-based models. Finally, combining the two best foundational-based models trained on s-PC-GITA, namely WavLM Base and Hubert Base, yielded top performance on the enhanced e-PC-GITA.

keywords:

Parkinson Detection, Foundational Models, Deep Neural Networks, Speech for Health

1 Introduction

Parkinson’s disease (PD) is a neurodegenerative condition marked by the progressive decline of dopaminergic neurons in the mid-brain [1, 2, 3]. This leads to a range of motor and non-motor symptoms, for instance, tremors, bradykinesia, cognitive deterioration, and depression [1, 4, 5]. During the prodromal stages of PD, patients may exhibit speech disorders, with these symptoms emerging as early as five years before the onset of significant motor impairments [6, 7]. Research indicates a rapid escalation in neuronal damage during the four years following diagnosis [8]. The evaluation of PD through speech analysis has thus gained attention among researchers due to its automated, cost-effective, and non-intrusive nature, making it a promising method for early PD detection.

In recent years, several works have thus been proposed to detect speech impairments and predict PD progression by leveraging solutions based on both shallow and deep models. For example, some studies considered models based on Convolutional Neural Networks (CNN) using spectrograms as input to classify PD patients and Healthy Controls (HCs) or to detect dysarthria and predict its severity level [9, 10]. A model combining unidimensional-CNN (1D-CNN) and bidimensional-CNN (2D-CNN) to capture frequency and time information was implemented in [11, 12]. Other works have focused on representations that aim to model acoustic cues of PD; in [13, 14], authors used different feature sets to model different speech dimensions such as phonation, prosody, and articulation, and then these feature sets are used to classify PD patients and HCs. A similar approach was proposed in [15], where the authors modeled the phoneme articulation precision in PD patients using phonetic information. In [16], prosody, articulation, and phonemic information features were combined to classify PD patients and HCs, showing that phonemic information was useful in discriminating PD patients with cognitive impairment from control subjects in a retelling task. In [17], a traditional pipeline solution based on Support Vector Machines (SVMs) and a 1D-CNN end-to-end approach were contrasted and compared, showing that the latter attain better results. An exemplar-based sparse representation was investigated in [18] to avoid the time-consuming training phase of traditional machine learning solutions and to have a representation more robust to redundancy and noise in the data. A multimodal solution based on linguistics and acoustics cues was proposed in [19] with promising results.

Despite the growing attention within the speech community, the majority of suggested solutions are typically evaluated under optimal conditions, i.e., training and testing data are matched, and often recordings are collected in a sound-proof booth using a professional audio setting. A few recent works consider moderately challenging scenarios, e.g., [20, 21], where training and testing conditions are not aligned. Nonetheless, only sustained vowel phonations and isolated words were evaluated. Consequently, the experimental setup may not accurately reflect a real production scenario in which automatic speech analysis would occur, i.e., speech production is typically evaluated by neurologists in continuous speech [22]. This work seeks to overcome these limitations in experimentation by advocating for the exploration of foundational models and speech enhancement (SE) techniques to improve PD detection from speech in a real-world evaluation scenario. First, we demonstrate that foundational-based Parkinson’s detectors are a viable solution in controlled conditions like those in the standard PC-GITA [23] corpus (s-PG-GITA), which recordings were collected in ideal conditions. Next, we show that foundational-based models trained on s-PC-GITA generalize better than both pipeline techniques [16] and end-to-end convolutional solutions [17] when tested on realistic conditions provided with the extended PC-GITA [20] (e-PC-GITA) dataset, which recordings were collected in a real-world operating environment. Finally, we investigated the effect of off-the-shelf SE techniques¹¹1In this work, we use the term speech enhancement in a broad sense to indicate any pre-processing step that improves the quality of the speech recordings and not limited to reduction of the background additive noise. on the automatic analysis of Parkinson’s speech, where the goal is to align non-ideal testing recordings to the ideal training ones. Specifically, we process the e-PC-GITA recordings by performing the following three steps: (i) voice activity detection (VAD), (ii) speech dereverberation, and (iii) speech denoising. Experimental evidence demonstrates that speech enhancement is not only beneficial to deploy robust foundational-based PD detectors but also helps balancing sensitivity and specificity of the best self-supervised models. Finally, we significantly boosted the PD results on the enhanced e-PC-GITA corpus by combining the two best foundational-based models, i.e., WavLM Base and Hubert Base, trained on s-PC-GITA data. To the best of the authors’ knowledge, this is the first work assessing PD detection from speech in a real-world operating environment.

2 Proposed Foundational-based PD Detector & Speech Enhancement

Refer to caption — Figure 1: Proposed foundational-based solution. Training in the upper panel, and testing in the bottom panel. Feature extraction is activated for SL models. At testing time e-PC-GITA data can be optionally enhanced; whereas training uses clean s-PC-GITA data only. The single-node output layer uses a Sigmoid. $F$ blocks indicate speech frames.

In Figure 1, the proposed foundational-based model is depicted in the training (top panel), and testing (bottom panel) stages. In training, only s-PC-GITA data and labels are used. When testing on e-PC-GITA, recordings can be optionally enhanced (see Section 2.2). Models evaluated in this work are based on the Transformer architecture, and both self-supervised learning (SSL) and (weakly) supervised learning (SL) pre-trained models are investigated for the task of PD detection from speech. SSL models are effective in learning representations from large amounts of unlabelled data [24, 25, 26], but SL-based foundational models have also recently shown to learn generalizable representations from large amounts of curated data [27, 28].

2.1 PD Detector

Foundational pre-trained encoders represent the backbone of the proposed PD detection model and serve to extract high-level representations from the input speech data. Through a weighted sum operation over frame representations across various layers, the backbone dynamically adjusts the importance of features extracted at different levels of abstraction, enhancing the model’s ability to capture relevant information. The foundational encoder is followed up by an attention pooling head to aggregate the frame-level representations into a single vector representation. This component employs learnable attention scores to focus on specific frames of the input, enabling the model to prioritize relevant information for the detection task. After attention pooling, two fully connected linear layers with ReLU activation functions are used to facilitate feature adaptation and transformation, enabling the model to refine its representations and capture discriminative features from the input speech data. The single-node output layer uses a Sigmoid activation function. Our PD model is trained end-to-end, so that both the fully connected layers and the backbone itself are adapted to the final task. Hence the model learns task-specific features and optimizes its performance for PD detection²²2Code and pre-trained models are available at:
https://github.com/K-STMLab/SSL4PR/.

2.1.1 Foundational Backbone

SSL Models The employed SSL models operate directly on raw waveform and consist of an encoder network, which includes a CNN followed by a stack of transformer layers. Three different SSL solutions were considered: Wav2Vec2 [26], HuBERT [29], and WavLM [24]. Despite sharing a similar underlying architecture, these models differ in their pre-training objectives, offering distinct approaches to representation learning. Wav2Vec2 utilizes a contrastive pre-training task on frame representations obtained by masking input audio, focusing on learning discriminative features. HuBERT employs a similar strategy but also leverages MFCC features and clustering to create target representations for the contrastive task. This approach is designed to learn more robust representations by leveraging the complementary information provided by MFCCs. WavLM introduces masked speech denoising and prediction tasks during pre-training, enhancing the model’s robustness to noise and variations in the input data. By considering these diverse pre-training tasks, we aim to explore the effectiveness of different strategies for learning representations from speech data and their impact on PD detection. Finally, we also investigate the effects of multilingual pre-training and test the larger version of WavLM, and XLSR [25], both pre-trained on multilingual data. XLSR is a larger variant of Wav2Vec2.

SL models: We consider Whisper [27], a family of models that have been pre-trained on large-scale, curated datasets. Whisper is an encoder-decoder model designed to process log-mel spectrograms of speech signals and is trained end-to-end in a supervised manner to generate text transcriptions. Since our goal is PD detection, we focus solely on the encoder component of the Whisper model. The Whisper encoder comprises a CNN block that aggregates frame-level representations, specifically derived from log-mel representations, and a stack of transformer layers. To ensure a fair comparison in terms of model capacity, we consider the small version of Whisper, namely Whisper Small. The encoder component of this model has a comparable number of parameters with Wav2Vec2, and HuBERT, easing a fair comparison between SSL and SL models.

2.2 Speech Enhancement Module

Audio recordings collected in real-world conditions, e.g., patients visiting the clinic for a screening, usually contain background noise, reverberation, and interfering speech - the interviewer suggests with quiet voice the next word to be spoken. To handle different real-world acoustic conditions and improve the robustness of our PD detectors, we can leverage speech enhancement techniques. In particular, we first removed the background interviewer interfering speech using rVADfast, an unsupervised segment-based voice activity detector (VAD) presented in [30]. Next, we used a diffusion-based model for dereverberation [31]. Finally, MP-SENet [32] is used to mitigate background noise. Speech enhancement takes place only in the testing phase, as shown in the bottom panel in Figure 1.

Table 1: PD results averaged over 10-folds on s-PC-GITA. Mean value and standard deviation are reported.

Model	Accuracy	F1-score	ROC-AUC	Sensitivity	Specificity
1D-CNN [17]	64.48 $\pm$ 12.03	60.15 $\pm$ 15.88	64.49 $\pm$ 12.02	53.67 $\pm$ 31.63	75.32 $\pm$ 30.15
Phonemic [19]	68.76 $\pm$ 8.80	68.52 $\pm$ 8.85	68.77 $\pm$ 8.80	65.33 $\pm$ 10.69	72.20 $\pm$ 13.74
Whisper Small [27]	78.04 $\pm$ 11.22	77.89 $\pm$ 11.34	78.04 $\pm$ 11.21	77.11 $\pm$ 13.32	78.98 $\pm$ 13.60
Wav2Vec2 Base [26]	80.04 $\pm$ 8.75	79.80 $\pm$ 8.90	80.05 $\pm$ 8.74	74.56 $\pm$ 12.20	85.54 $\pm$ 12.59
XLSR Large [25]	78.04 $\pm$ 9.25	77.83 $\pm$ 9.39	78.05 $\pm$ 9.24	73.33 $\pm$ 13.07	82.77 $\pm$ 11.42
HuBERT Base [29]	81.15 $\pm$ 8.84	80.84 $\pm$ 9.17	81.16 $\pm$ 8.84	76.44 $\pm$ 12.46	85.88 $\pm$ 14.94
WavLM Base [24]	82.21 $\pm$ 8.17	81.99 $\pm$ 8.34	82.22 $\pm$ 8.17	75.22 $\pm$ 12.33	89.21 $\pm$ 9.22
WavLM Large [24]	79.59 $\pm$ 9.14	79.15 $\pm$ 9.46	79.61 $\pm$ 9.13	70.78 $\pm$ 14.27	88.44 $\pm$ 13.12

3 Experiments

3.1 Dataset

Standard PC-GITA (s-PC-GITA): This dataset comprises samples collected from a total of 100 individuals, evenly divided into two groups: 50 with PD and 50 HC. There are 25 men and 25 women in each group. The patients were diagnosed by a neurologist. The healthy controls are free of any reported PD symptoms or other neurodegenerative disorder. The speaker’s age is also matched and varies from 31 to 86 years old. The recordings were performed in a sound-proof booth at Clínica Noel of Medellín in Colombia. The recording’s sampling frequency is 44.1 KHz with a 16-bit resolution. All speech signals were downsampled to 16 kHz in this work, as in [17]. More information about the dataset can be found in [23].

Extended PC-GITA (e-PC-GITA): It consists of recordings from 20 PD patients and 20 HC. As in the case of s-PC-GITA, age and gender are matched. The speakers were recorded in the Parkinson’s foundation of Medellín. In this case, however, the acoustic conditions were not ideal, since recordings were collected in a real-world operating scenarios. Recordings were sampled at 16 KHz with a 16 bit-resolution. Patients were evaluated by the same neurologist who performed the evaluations in the s-PC-GITA. e-PG-GITA is used as an independent dataset to evaluate PD solutions under realistic conditions, i.e., speakers simulate patients visiting the clinic for a screening.

The speech tasks from both datasets considered in this work are: DDK exercises³³3Repetition of the sequence of syllables: /pa-ta-ka/, /pe-ta-ka/, /pa-ka-ta/, /pa/, /ka/, /ta/., read sentences, and monologues.

Table 2: In the upper part, results original e-PC-GITA are shown. In the lower part, results on enhanced e-PC-GITA are given. The average performance of the models trained on s-PC-GITA across individual folds and their standard deviations are reported.

Original e-PC-GITA Recordings
Model	Accuracy	F1-score	ROC-AUC	Sensitivity	Specificity
1D-CNN [17]	64.17 $\pm$ 10.85	59.64 $\pm$ 16.95	64.17 $\pm$ 10.85	84.83 $\pm$ 9.53	43.50 $\pm$ 27.76
Phonemic [19]	50.17 $\pm$ 1.38	46.47 $\pm$ 0.93	50.17 $\pm$ 1.38	76.33 $\pm$ 3.93	24.00 $\pm$ 1.70
Whisper Small [27]	74.83 $\pm$ 5.65	73.83 $\pm$ 7.21	74.83 $\pm$ 5.65	87.33 $\pm$ 6.67	62.33 $\pm$ 16.55
Wav2Vec2 Base [26]	67.75 $\pm$ 4.33	67.17 $\pm$ 4.60	67.75 $\pm$ 4.33	77.00 $\pm$ 6.57	58.50 $\pm$ 12.42
HuBERT Base [29]	67.33 $\pm$ 7.95	64.10 $\pm$ 10.37	67.33 $\pm$ 7.95	92.00 $\pm$ 4.64	42.67 $\pm$ 19.89
WavLM Base [24]	68.58 $\pm$ 5.96	65.98 $\pm$ 7.65	68.58 $\pm$ 5.96	90.50 $\pm$ 9.25	46.67 $\pm$ 19.54
Enhanced e-PC-GITA Recordings
1D-CNN [17]	57.00 $\pm$ 7.70	50.50 $\pm$ 13.28	57.00 $\pm$ 7.70	60.67 $\pm$ 31.81	53.33 $\pm$ 35.06
Phonemic [19]	69.50 $\pm$ 2.36	69.40 $\pm$ 2.37	69.50 $\pm$ 2.36	66.00 $\pm$ 4.90	73.00 $\pm$ 5.52
Whisper Small [27]	76.25 $\pm$ 4.51	75.70 $\pm$ 5.52	76.25 $\pm$ 4.51	80.50 $\pm$ 8.63	72.00 $\pm$ 16.22
Wav2Vec2 Base [26]	76.75 $\pm$ 2.80	75.80 $\pm$ 3.09	76.75 $\pm$ 2.80	57.33 $\pm$ 5.07	96.17 $\pm$ 1.50
HuBERT Base [29]	82.75 $\pm$ 1.94	82.68 $\pm$ 1.96	82.75 $\pm$ 1.94	81.50 $\pm$ 5.45	84.00 $\pm$ 7.35
WavLM Base [24]	83.33 $\pm$ 3.48	83.13 $\pm$ 3.66	83.33 $\pm$ 3.48	77.17 $\pm$ 9.04	89.50 $\pm$ 8.53

Table 3: Results with Hubert Base (H) and WavLM Base (H) trained on all s-PC-GITA data and tested on enhanced e-PC-GITA. H+W indicates system combination.

Model	Accuracy	F1-score	ROC-AUC	Sensitivity	Specificity
HuBERT Base (H)	86.67	86.63	86.67	81.67	91.67
WavLM Base (W)	85.00	85.00	85.00	83.33	86.67
H + W	88.33	88.32	88.33	85.00	91.67

3.2 Experimental Setup

All PD detection models studied in this work are trained on s-PC-GITA using a 10-fold speaker-independent cross-validation approach following [17] if not stated otherwise. In each fold, speakers are partitioned such that a balanced set of speakers both in terms of gender and class is allocated for testing, while the remaining speakers are utilized for training. Every speaker was used only once for testing. Furthermore, the same speaker was not used in both, training and testing. The foundational-based models were trained and evaluated using audio samples having a fixed duration of 10 seconds, with silence padding applied when needed.

Foundational-based models are initialized with pre-trained weights and fine-tuned end-to-end for the PD detection task. Base and the Whisper-based models have $\sim$ 95M parameters; whereas, large models have $\sim$ 300M parameters. We employed the Adam optimizer [33] with a learning rate of $10^{-4}$ , which is warmup-decayed linearly throughout training, with the warmup phase comprising 10% of the total training steps. The batch size is set to 32, and training continues for up to 10 epochs. Performance is evaluated using accuracy, F1-score, ROC-AUC, sensitivity, and specificity. Results are presented as mean values and standard deviations across the 10 folds of cross-validation. Fine-tuning and evaluation procedures are conducted on a single Nvidia^® A100 GPU with 80GB of memory. This setup facilitates the efficient exploration of various pre-trained models within a standardized experimental framework. To assess the effectiveness of the proposed solution in real operative conditions, all models are also tested on the e-PC-GITA dataset. Finally, we investigated the effect of the SE techniques, discussed in Section 2.2, on the e-PC-GITA data.

Baselines: To better assess the proposed foundational-based solutions, we have built two baseline systems, which were proposed to tackle PD detection using the s-PC-GITA data. The first model leverages a 1-dimensional CNN (1D-CNN) as proposed in [17]. It serves as a comparison against other deep models. The 1D-CNN model has 60K trainable parameters and was deployed following the protocol outlined in the original paper [17] for training, validation, and testing⁴⁴4Note that [17] did not provide the definition of the folds, and that might explain differences in the standard deviations with our results.. The second model employs an SVM-based classifier trained on phonemic features [19], which are based on the posterior probability of a speech frame belonging to one (or more) of 18 phonological classes [34]. This model has demonstrated high efficacy in addressing the PD detection task [19].

3.3 Results

Ideal Testing Conditions - s-PC-GITA: Table 1 presents experimental results obtained under matched training and testing conditions. Foundational-based models consistently outperform baseline solutions, with the best-performing foundational model, WavLM Base, achieving accuracy, F1-score, and ROC-AUC of 82.21%, 81.99%, and 82.22%, respectively. Interestingly, SSL solutions demonstrate better performance compared to the SL model in most cases, underscoring the superior capabilities of SSL-based features. The SL model trained to assess specific tasks may tend to eliminate information available in the input speech potentially limiting their effectiveness Another interesting outcome in Table 1 is that base models outperform larger multilingual models (roughly +2% in terms of average accuracy), indicating that multilingual pre-training may not significantly benefit the PD detection task. Given the limited amount of training data, larger models may also be more prone to over-fitting, potentially resulting in decreased overall performance. In sum, foundational-based models outperform both CNN-based and SVM-based baselines under ideal testing conditions, indicating the effectiveness of leveraging pre-trained models for PD detection. While SSL models generally exhibit a better performance, the impact of multilingual pre-training on performance remains limited, suggesting careful consideration in model selection.

Real-World Testing Conditions - e-PC-GITA: In evaluating the generalization capability of the foundational-based models presented above, we assess their performance on the e-PC-GITA dataset. Table 2 reports the average performance of the models trained through the 10-fold cross-validation scheme on the s-PC-GITA dataset and tested (without re-training) on the e-PC-GITA data. The upper part of Table 2 shows results on the original e-PC-GITA data; it can be observed a substantial reduction in performance across all models when compared to the ideal testing conditions on s-PC-GITA. However, foundational-based models still consistently outperform both CNN- and SVM-based baselines.⁵⁵5In s-PC-GITA tests, variability comes from both model parameters, and testing data. In e-PC-GITA, variability is only from the model parameters. That may explain the lower standard deviations despite more challenging testing conditions represented by e-PC-GITA. Notably, SSL-based models show a more pronounced performance drop than Whisper Small. This may be due to the inherent multi-conditional training of the SL model, trained on diverse datasets covering a broad distribution of audio.

The latter observation convinced us that we could boost the performance of the studied models, especially SSL-based solutions, by improving the quality of the testing recording to better resemble the ideal training conditions. A visual inspection of the lower part of Table 2, which lists the experimental results on the enhanced e-PC-GITA data, confirms our intuition. Foundational-based solutions, particularly SSL-based ones, outperform the two baselines systems. Moreover, HuBERT and WavLM Base, i.e., two SSL-based models, achieve the highest performance, with accuracy, F1-score, and ROC of 83.33%, 83.13%, and 83.33%, respectively, and they also show more balanced sensitivity and specificity. Interestingly, while the foundational-base models show considerable enhancement, the 1D-CNN model exhibits a notable decrease in performance.

System Combination: In the above experiments on e-PC-GITA, we have used the 10 models built on the 10-fold s-PC-GITA in order to ease the comparison between ideal and real-world operating conditions. Nonetheless, better models can be deployed by using all s-PC-GITA data to train a single model. We thus retrained the best-performing models, namely HuBERT Base (H) and WavLM Base (W), using s-PC-GITA dataset for training and validation, and assessed these two models on the enhanced version of e-PC-GITA. By comparing the first two rows in Table 3 with the corresponding entries in Table 2, we can confirm our expectations. Finally, combining the predictions of these two models at an inference time using a simple averaging approach resulted in a system combination (H + W) that outperformed individual models, achieving an accuracy, F1-score, and ROC-AUC of around 88.33%, 88.13%, and 88.23%, respectively.

4 Conclusion

In this work, we have first shown that foundational models can be effectively fine-tuned to the PD detection task, outperforming conventional models previously reported in the literature. Next, we have shown that (weakly) supervised foundational models fine-tuned on s-PC-GITA attain the best results on unseen real-world conditions, represented by e-PC-GITA. We have speculated that this is caused by the inherent Whisper multi-condition training. Nonetheless, a more severe drop in performance was observed for all other evaluated models. Therefore, we used off-the-shelf speech enhancement techniques to try to align training and testing conditions. As expected, foundational-based models improved significantly, and Hubert Base and WavLM Base, which are SSL-based models, reported the best results on the enhanced e-PC-GITA. Finally, a simple system combination of Huber Base and WavLM Base significantly boosted the PD detection results on the enhanced e-PC-GITA dataset.

References

[1] O. Hornykiewicz, “Biochemical aspects of Parkinson’s disease,” Neurology, vol. 51, no. 2 Suppl 2, pp. S2–S9, 1998.
[2] W. Poewe et al., “Parkinson disease,” Nature Rev. Dis. Primers, vol. 23, 2017.
[3] M. C. de Rijk et al., “Prevalence of parkinson’s disease in europe: A collaborative study of population-based cohorts. neurologic diseases in the elderly research group,” Neurology, vol. 54, pp. S21–S23, 2020.
[4] J. Mu et al., “Parkinson’s disease subtypes identified from cluster analysis of motor and non-motor symptoms,” Frontiers in aging neuroscience, vol. 9, p. 301, 2017.
[5] J. Jankovic, “Parkinson’s disease: Clinical features and diagnosis,” J. Neural. Neurosurg. Psychiatry, vol. 79, pp. 368–376, 2008.
[6] S. Pinto et al., “Treatments for dysarthria in parkinson’s disease,” The Lancet Neurology, vol. 3, no. 9, pp. 547–556, 2004.
[7] J. Rusz et al., “Imprecise vowel articulation as a potential early marker of Parkinson’s disease: effect of speaking task,” Journal of the Acoustical Society of America, vol. 134, no. 3, pp. 2171–2181, 2013.
[8] M. C. Rodriguez-Oroz et al., “Initial clinical manifestations of parkinson’s disease: Features and pathophysiological mechanisms,” Lancet Neurol, vol. 8, pp. 1128–1139, 2009.
[9] B. Sonawane and P. Sharma, “Speech-based solution to Parkinson’s disease management,” Multimedia Tools and Applications, vol. 80, no. 19, pp. 29 437–29 451, 2021.
[10] C. Rios-Urrego et al., “End-to-end Parkinson’s disease detection using a deep convolutional recurrent network,” in International Conference on Text, Speech, and Dialogue. Springer, 2022, pp. 326–338.
[11] J. Vásquez-Correa et al., “Transfer learning helps to improve the accuracy to classify patients with different speech disorders in different languages,” Pattern Recognition Letters, vol. 150, pp. 272–279, 2021.
[12] C. Quan et al., “End-to-end deep learning approach for Parkinson’s disease detection from speech signals,” Biocybernetics and Biomedical Engineering, vol. 42, no. 2, pp. 556–574, 2022.
[13] J. C. Vásquez-Correa et al., “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,” Journal of communication disorders, vol. 76, pp. 21–36, 2018.
[14] Y. Liu et al., “Automatic assessment of parkinson’s disease using speech representations of phonation and articulation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 242–255, 2022.
[15] P. Klumpp et al., “The phonetic footprint of parkinson’s disease,” Computer Speech & Language, vol. 72, p. 101321, 2022.
[16] A. M. García et al., “Cognitive determinants of dysarthria in parkinson’s disease: an automated machine learning approach,” Movement Disorders, vol. 36, no. 12, pp. 2862–2873, 2021.
[17] N. Narendra, B. Schuller, and P. Alku, “The detection of parkinson’s disease from speech using voice source information,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1925–1936, 2021.
[18] M. K. Reddy and P. Alku, “Exemplar-based sparse representations for detection of parkinson’s disease from speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1386–1396, 2023.
[19] D. Escobar-Grisales, T. Arias-Vergara, C. D. Ríos-Urrego, E. Nöth, A. M. García, and J. R. Orozco-Arroyave, “An Automatic Multimodal Approach to Analyze Linguistic and Acoustic Cues on Parkinson’s Disease Patients,” in Proc. INTERSPEECH, 2023, pp. 1703–1707.
[20] B. Karan, S. S. Sahu, J. R. Orozco-Arroyave, and K. Mahto, “Non-negative matrix factorization-based time-frequency feature extraction of voice signal for parkinson’s disease prediction,” Computer Speech & Language, vol. 69, 2021.
[21] “Robust language independent voice data driven parkinson’s disease detection,” Engineering Applications of Artificial Intelligence, vol. 129, 2024.
[22] P. Hecker, N. Steckhan, F. Eyben, B. Schuller, and B. Arnrich, “Voice analysis for neurological disorder recognition-a systematic review and perspective on emerging trends,” Front Digit Health, vol. 4, 2022.
[23] J. Orozco-Arroyave, J. Arias-Londonõ, J. Vargas-Bonilla, M. Gonzalez-Rativa, and E. Noth, “New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease,” in Proc. LREC, 2014, p. 342–347.
[24] S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[25] A. Babu et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
[26] A. Baevski et al., “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
[27] A. Radford et al., “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518.
[28] Y. Gong et al., “Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers,” in Proc. INTERSPEECH 2023, 2023, pp. 2798–2802.
[29] W.-N. Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[30] Z.-H. Tan, A. kr. Sarkar, and N. Dehak, “rvad: An unsupervised segment-based robust voice activity detection method,” Computer Speech & Language, vol. 59, pp. 1–21, 2020.
[31] S. Welker, J. Richter, and T. Gerkmann, “Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain,” in Proc. Interspeech 2022, 2022, pp. 2928–2932.
[32] Y.-X. Lu, Y. Ai, and Z.-H. Ling, “MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra,” in Proc. INTERSPEECH 2023, 2023, pp. 3834–3838.
[33] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
[34] J. Vásquez-Correa et al., “Phonet: A Tool Based on Gated Recurrent Neural Networks to Extract Phonological Posteriors from Speech,” in Proc. Interspeech 2019, 2019, pp. 549–553.