\interspeechcameraready\name

[affiliation=1]LimingWang \name[affiliation=1]YuanGong \name[affiliation=1]NaumanDawalatabad \name[affiliation=2]MarcoVilela \name[affiliation=2]KaterinaPlacek \name[affiliation=2]BrianTracey \name[affiliation=2]YishuGong \name[affiliation=3]AlanPremasiri \name[affiliation=3]FernandoVieira \name[affiliation=1]JamesGlass

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

Abstract

Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression from longitudinal speech recordings of ALS patients. By taking advantage of high-quality pretrained speech features and longitudinal information in the recordings, our best model achieves 91.0% AUC, improving upon the previous best model by 5.6% relative on the ALS TDI dataset. Careful analysis reveals that ALST is capable of fine-grained and interpretable predictions of ALS progression, especially for distinguishing between rarer and more severe cases. Code is publicly available.

1 Introduction

Amyotrophic lateral sclerosis (ALS) is a progressive multisystem neurodegenerative disease characterized by muscular atrophy due to motor neuron degeneration and, in up to 50% of patients, cognitive-behavioral impairment due to frontotemporal neuronal degeneration [1, 2]. More than 5,000 individuals are diagnosed with ALS annually in the US, and the typical lifespan from diagnosis is 2-5 years. As yet, there is no cure for ALS and limited treatment options aiming at slowing disease progression or managing symptoms. Precisely predicting ALS disease progression is crucial for early diagnosis and for develo** new treatments. The standard-of-care evaluation for monitoring ALS disease progression is the ALS Functional Rating Scale - Revised [3]. The ALSFRS-R uses clinician evaluation to ascertain ALS functional symptom severity across various functional domains including walking and speech. Importantly, a lower ALSFRS-R score at diagnosis as well as faster decline on the ALSFRS-R over time is linked to shorter survival from symptom onset [4, 5].

In this paper, we examine the utility of an automated system to predict ALS disease progression from digitally-collected speech samples, inspired by previous research devoted to automatic ALS disease progression prediction [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. These systems predict patient survival rates or temporal changes in ALSFRS-R scores using machine and deep learning algorithms such as random forests [6], dynamic Bayesian networks [7, 10, 11, 15] and neural networks [12, 14]. They leverage diverse information sources from clinical documents [6, 7, 10, 11, 12, 14] to various sensory signals such as magnetic resonance imaging scans [8], motion [9, 13, 15] and speech [13]. The successful application of these methods could offer a more objective and consistent evaluation of ALS disease progression, potentially reducing the need for extensive human involvement in the process.

Focusing on predicting ALS disease progression from speech has four main advantages. First, the collection of speech samples is economical and straightforward without requiring any medical expertise. This aspect makes speech-based systems not only more accessible but a cost-effective approach for tracking the ALS disease progression. Second, speech impairments are significant indicators of ALS progression. Significant speech impairment, such as lower speaking rate and intelligibility is a well-known symptom of ALS, especially the bulbar-onset cases where motor control of the tongue and lips are initially affected [17]. Third, speech-based systems discern latent variables that can aid in the prediction of ALS disease progression. For example, prior studies in patients with bulbar-onset ALS indicate that the physiologic biomechanics of speech production [18] as well as the acoustic analysis of speech lead to increased sensitivity of motor speech impairment [19]. Last, recent advances in self-supervised speech representation [20, 21] and large-scale automatic speech recognition systems such as Whisper [22] have exploited the large amount of speech data online to obtain effective features that capture both linguistic and paralinguistic information in speech. Such speech features have demonstrated remarkable transferability in various related tasks such as cognitive disorder detection [23], pronunciation assessment [24] and audio classification [25].

This paper’s main contributions are threefold. First, we redefine ALS progression prediction as a longitudinal sequence prediction problem. Second, we introduce a novel model architecture that utilizes the latest pretrained speech representation models for predicting ALS disease progression on the ALSFRS-R. Lastly, we provide an extensive evaluation, examining the effects of different model features, architectural choices, and training objectives, as well as the impact of various phonemes on model predictions.

2 Related work

Previous studies [13, 19, 15, 26, 27, 28, 23, 29] have shown the potential of machine learning models in predicting the progression of ALS and other neurodegenerative diseases like Alzheimer’s from speech. For ALS, a multi-class classifier was trained to predict ALSFRS-R scores using speech and accelerometer data, with mel spectrograms and a convolutional neural network (CNN) [13]. In addition, [19] has proposed automatic measures for ALS severity based on forced alignment and formant analysis of spontaneous speech. In Alzheimer’s research, various models were developed for detection and progression prediction using different techniques such as decision trees [26], recurrent neural nets [27], wav2vec 2.0 [28] and Whisper [23]. However, these Alzheimer’s models mainly focus on cross-sectional binary classification, unlike the longitudinal, multi-class setting in ALS research.

3 Methods

3.1 Dataset

We use the ALS Therapy Development Institute (ALS TDI) for evaluation, which is part of a ALS Research Collaborative (formerly known as the Precision Medicine Program) [15, 13]. This dataset comprises speech recordings accompanied by self-assessed ALSFRS-R scores from a longitudinal study of ALS patients, primarily based in the U.S. and distributed evenly across states [13]. Participants in the study were asked to repeat the sentence “I owe you a yoyo today” five times in each recording. Additionally, participants provided their ALSFRS-R ratings for 12 different \limingcognitive and behavioral functions at the time of recording. However, our experiments only utilize the ALSFRS-R speech sub-score. Due to missing recordings and labels and quality of text transcriptions, our experiments focus on a subset of 2,851 voice samples from 425 patients. We divide the dataset into two distinct groups: a training set with 2,124 utterances from 344 patients and a testing set with 727 utterances from 81 patients. The average years since diagnosis is 2.8 years and the average tracking period is 0.8 years with an average gap of about 50 days between successive recordings from the same patient. About 12% and 88% of the recordings are from bulbar-onset and limb-onset ALS patients respectively, and the average ALSFRS-R score at baseline measurement is 3.3.

3.2 ALS transformer

For each patient, we have a list of utterances $[\mathbf{x}^{1},\cdots,\mathbf{x}^{n}]$ labeled with self-rated ALSFRS-R scores at a 5-point scale, $[y_{1},\cdots,y_{n}]\in\{0,\cdots,4\}^{n}$ and phoneme transcripts $[\mathbf{\phi}^{1},\cdots,\mathbf{\phi}^{n}]$ . Further, we also have access to the longitudinal information of the utterances, i.e., the patient identities and recording dates $[d_{1},\cdots,d_{n}]$ of each utterance.

Instead of predicting each ALSFRS-R score independently, we formulate the problem as a sequence prediction problem and predict the whole self-rated ALSFRS-R scores for each patient jointly. To this end, we propose a novel architecture, ALS transformer (ALST), capable of modeling longitudinal context. As shown in Fig. 1, ALST consists of four main components: a pretrained feature extractor, a phoneme aligner, a longitudinal speech encoder and an ALSFRS-R scorer.

The pretrained feature extractor, either a wav2vec 2.0 [20] or Whisper encoder [22], converts each raw speech waveform into 20ms frame-level features $[\mathbf{z}^{1},\cdots,\mathbf{z}^{n}]$ . We then concatenate the speech features for all the utterances of each patient in increasing order of their recording dates and extract the phoneme-level forced alignment boundaries as \liming $[s_{1},\cdots,s_{m_{1}}^{1},\cdots,s_{1}^{n},\cdots,s_{m_{n}}^{n}]$ , where $m_{i}$ is the total number of phonemes in utterance $i$ and $s_{j}^{i}$ is the starting frame index of the $j$ -th phoneme in utterance $i$ . The phoneme aligner then converts the frame-level features to phoneme-level representation using the forced alignments:

\displaystyle\mathbf{z}^{\mathrm{phn},i}_{j}=\frac{1}{s_{j+1}^{i}-s_{j}^{i}}% \sum_{s_{j}^{i}\leq t<s_{j+1}^{i}}\mathbf{z}_{t},

(1)

for $1\leq j\leq m_{i},\,1\leq i\leq n$ \liming, with $m_{i}\equiv 1$ for no alignment.

The longitudinal speech encoder further contextualizes the phoneme-level features by jointly encoding features across the longitudinal sequence. To this end, it first creates a sequence of trainable positional embeddings using the recording dates and a sequence of phoneme label embeddings:

	$\displaystyle\mathbf{e}_{j}^{\mathrm{pos},i}$	$\displaystyle=\mathrm{Embed}(j)+\mathrm{Embed}(d_{i}),$		(2)
	$\displaystyle\mathbf{e}^{\mathrm{phn},i}_{j}$	$\displaystyle=\mathrm{Embed}(\phi^{i}_{j}),\,1\leq j\leq m_{i},1\leq i\leq n.$		(3)

We experiment with two main types of position embeddings for dates, namely, the order embedding to simply encode the ranking order of the dates and the day embedding to encode the number of days of the recording since the diagnosis date of the patient. Next, it combines the phoneme-level speech features and the embeddings using multiple standard transformer blocks:

	$\displaystyle\bar{\mathbf{h}}_{j}^{i}$	$\displaystyle=\mathrm{Transformer}_{j}(\mathrm{Linear}(\mathbf{z}^{\mathrm{phn% },i}_{1:m})+\mathbf{e}_{1:m}^{\mathrm{pos},i})$		(4)
	$\displaystyle\mathbf{h}_{j}^{i}$	$\displaystyle=\mathrm{Linear}(\bar{\mathbf{h}}_{j}^{i})+\mathbf{e}_{j}^{% \mathrm{phn},i},\,1\leq j\leq m,\,1\leq i\leq n.$		(5)

The ALSFRS-R scorer first converts the hidden encodings from the encoder to the utterance level via mean pooling:

\displaystyle\mathbf{z}_{i}^{\mathrm{utt}}=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}% \mathbf{h}_{j}^{i}.

(6)

Next, it combines the scores sequence with the utterance-level features into two predictions, a continuous estimated score $\hat{y}_{i}$ and an estimated probability over score classes $\hat{\mathbf{p}}_{i}$ :

	$\displaystyle[\hat{y}_{i},\mathbf{o}_{i}]$	$\displaystyle=\mathrm{Linear}(z_{i}^{\mathrm{utt}}),$		(7)
	$\displaystyle\hat{\mathbf{p}}_{i}$	$\displaystyle=\text{Softmax}(\mathbf{o}_{i}),\,1\leq i\leq n.$		(8)

The training objective for ALST is then a linear combination of cross entropy and mean-squared error (MSE):

\displaystyle\mathcal{L}_{\text{ALST}}=\sum_{i=1}^{n}|\hat{y}_{i}-y_{i}|^{2}+% \lambda_{\text{CE}}\log\hat{p}_{i}(y_{i}).

(9)

Refer to caption — Figure 1: Architecture of the proposed ALST. The Whisper encoder is frozen during training.

3.3 Preprocessing

For the pretrained feature extractor, we use either the wav2vec 2.0 (large) [20] pretrained and finetuned on the 960-hour LibriSpeech dataset [30], or the encoder output of Whisper [22]. We have experimented with outputs from all encoder layers for both types of encoders as we will discuss in the analysis section. For the longitudinal speech encoder, we use a two-layer, single-head transformer [31] with 512 hidden nodes. We set $\lambda_{\text{CE}}=1$ by default and train the ALST using Adam optimizer [32] with a batch size of 32, a warmup step of 100 and an initial learning rate of $10^{-4}$ for 100 epochs. We also use a multi-step learning rate decay starting from the 20 ${}^{\text{th}}$ epoch with a step size of 5 and a decay rate of 0.5. On a NVIDIA RTX A5000 GPU, the training takes about 2.2 minutes. All our models has around 7 million parameters. Models from the last epoch provides all the reported test results.

3.4 Evaluation metrics

To evaluate the ALSFRS-R speech score classification performance of our models, we report both macro F1, i.e., one-versus-all F1 averaged over classes and accuracy. To account for inter-patient inconsistency of the self-rated ALSFRS-R speech scores and to assess how well the model captures longitudinal variation in the speech, we also report intra-patient ranking measures such as Spearman’s $\rho$ , Kendall’s $\tau$ and the pairwise accuracy score. Using such measures, we can evaluate how well the model ranks the voice samples by the ALS severity of the patient rather than on the absolute values of the scores, which can have large variability among patients. For the pairwise score, we label $1$ a pair of scores $(y_{1},y_{2})$ in increasing order of their recording dates if the $y_{1}>y_{2}$ , $-1$ if $y_{1}<y_{2}$ and $0$ otherwise. To assess how well our models provide a more fine-grained, continuous ALS progression measure. We also compute the MSE between predicted and true scores on the test set.

4 Results

Table 1: Overall ALSFRS-R speech score prediction results. ^∗ denotes results obtained from a larger dataset containing our dataset along with additional voice samples. \limingALST is used for AUC, while ALST (R) is used for other metrics.

Scorer

Macro F1

\uparrow

AUC

\uparrow

Spearman

\rho

\uparrow

Kendall

\tau

\uparrow

Pairwise Acc.

\uparrow

MSE

\downarrow

FBank

CNN [13]

.865^∗

wav2vec 2.0

SVM

.578

\pm.000

.906

\pm.000

.445

\pm.000

.428

\pm.000

.751

\pm.000

ALST

.544

\pm.075

.900

\pm.011

.526

\pm.061

.513

\pm.060

.801

\pm.017

.785

\pm.009

whisper-medium

SVM

.549

\pm.000

.910

\pm.000

.470

\pm.000

.458

\pm.000

.760

\pm.000

ALST

.577

\pm.054

.906

\pm.007

.594

\pm.022

.575

\pm.024

.818

\pm.015

.780

\pm.004

whisper-large v2

SVM

.552

\pm.000

.908

\pm.000

.446

\pm.000

.429

\pm.000

.773

\pm.000

ALST

.561

\pm.032

.907

\pm.013

.552

\pm.053

.537

\pm.053

.804

\pm.019

.777

\pm.011

Table 2: ALSFRS-R speech score prediction results vs. different variants of ALSTs. ALST (R) stands for the regression branch of the ALST; ALST (FA) stands for ALST with forced aligned phoneme-level input features and outputs from the regression branch.

Variant

Macro F1

\uparrow

AUC

\uparrow

Spearman

\rho

\uparrow

Kendall

\tau

\uparrow

Pairwise Acc.

\uparrow

MSE

\downarrow

wav2vec 2.0

ALST

.482

\pm.037

.900

\pm.011

.583

\pm.029

.567

\pm.031

.806

\pm.015

ALST (R)

.544

\pm.075

.526

\pm.061

.513

\pm.060

.801

\pm.017

.785

\pm.009

ALST (FA)

.529

\pm.028

.898

\pm.007

.448

\pm.045

.432

\pm.044

.732

\pm.013

.795

\pm.005

whisper-medium

ALST

.560

\pm.016

.906

\pm.007

.585

\pm.020

.566

\pm.019

.822

\pm.012

ALST (R)

.577

\pm.054

.594

\pm.022

.575

\pm.024

.818

\pm.015

.780

\pm.004

ALST (FA)

.571

\pm.069

.907

\pm.009

.469

\pm.070

.451

\pm.067

.731

\pm.012

.783

\pm.010

whisper-large v2

ALST

.486

\pm.049

.907

\pm.013

.470

\pm.041

.458

\pm.039

.799

\pm.012

ALST (R)

.561

\pm.032

.552

\pm.053

.537

\pm.053

.804

\pm.019

.777

\pm.011

ALST (FA)

.594

\pm.035

.909

\pm.005

.520

\pm.049

.502

\pm.048

.753

\pm.027

.780

\pm.009

Table 3: ALSFRS-R speech score prediction results vs. different setups for ALST (R)s with whisper-medium encoders. The setup “longitudinal” uses longitudinal sequence without positional embeddings.

Setup

Macro F1

\uparrow

AUC

\uparrow

Spearman

\rho

\uparrow

Kendall

\tau

\uparrow

Pairwise Acc.

\uparrow

MSE

\downarrow

No longitudinal

.581

\pm.017

.916

\pm.003

.546

\pm.081

.529

\pm.080

.796

\pm.022

.768

\pm.005

Longitudinal

.552

\pm.008

.912

\pm.010

.611

\pm.046

.592

\pm.045

.826

\pm.015

.772

\pm.009

Order, trainable

.558

\pm.017

.914

\pm.011

.564

\pm.083

.543

\pm.081

.813

\pm.014

.774

\pm.009

Order, sinusoid

.520

\pm.028

.912

\pm.009

.561

\pm.107

.543

\pm.106

.805

\pm.022

.771

\pm.006

Day, sinusoid

.536

\pm.023

.903

\pm.005

.522

\pm.019

.506

\pm.018

.803

\pm.017

.782

\pm.006

The overall results of our models are shown in Table 1. We compare ALST with two comparator baseline models: a CNN-based model with mel filterbank (FBank) features as inputs [13] and a linear SVM with the same speech features as the ALST. Models using pretrained speech representations consistently outperform the FBank-based baseline by around 5.6% relative AUC. Among the pretrained speech representations, whisper-medium encoder performs the best in all the ranking-based metrics, even better than the larger whisper-large v2 encoder in all except the macro F1 score. While this might be due to training instability or scaling issues for whisper-large v2, another possible explanation is that whisper-large v2 has filtered out some of the paralinguistic information due to their irrelevance to its original task of ASR.

Our experiments also demonstrate the advantage of nonlinear models such as transformers [31] over linear models for ALS prediction. With the same speech representation, the best ALSTs generally outperforms the linear SVMs in all comparable metrics, especially ranking-based metrics, with around 5% relative improvements in pairwise accuracy and around 25% relative improvements in both Spearman $\rho$ and Kendall $\tau$ . For the Macro F1, the Whisper-based ALSTs outperform the SVM by an average of 6.5% relative, while the wav2vec 2.0-based ALSTs performs worse than the SVM by an 5.9% relative. Moreover, as shown in Table 2 the regression branch of ALST consistently outperforms the classification branch in macro F1 for all pretrained speech representations, as well as all the ranking-based metrics for whisper representations. The use of phoneme segmentation helps whisper-large v2 in Macro F1 and AUC, but not ranking-based metrics or with other pretrained speech representations.

To better understand the role of longitudinal information for ALS prediction, we experiment with different setups including using no longitudinal information, using longitudinal information without position embeddings and with different types of position embeddings. Our experiments indicate that longitudinal information significantly improves ranking-based metrics, but not classification metrics. The results demonstrate that the two types of metrics do measure different aspects of ALS prediction performance, and that longitudinal information helps ALST to better estimate the disease progression for a single speaker, which is closer to our goal than to predict the absolute values of the ALS speech scores due to inter-speaker variability in rating standards. However, we observe both the order-based position embeddings and the day-based position embeddings degrades the performance obtained using longitudinal information without position embeddings. Designing a better position embedding for longitudinal ALS prediction will be a future research direction.

We also visualize the confusion matrix generated by our best model in terms of macro F1 in Fig. 2(a). ALST performs the best for recordings with a ALSFRS-R speech score of 4, the most frequent score class while the worst for score of 0, the rarest score class. Interestingly, most mistakes made by ALST are between similar score classes such as 0 vs. 1 and 3 vs. 4, many of which may be due to individual differences in rating standards among patients. Therefore, macro F1 potentially underestimates the ability of ALST for ALS prediction. Another ablation study is on the effect of relative weights between the classification (CE) and the regression (MSE) losses, shown in Fig. 2(b). Our results suggest that training the model with MSE alone leads to poor performance, and combining CE and MSE losses with $\lambda_{CE}\approx 0.9$ leads the best result. The benefit of using the CE loss saturates after $\lambda_{CE}$ exceeds around 0.6. In Fig. 2(c), we estimate the importance of a given phoneme in ALS prediction. we leave one segment of that phoneme intact and mask out the rest of the recording as input to a trained ALST (FA) system, and measure the macro F1 of its prediction. While we observe high variance across the phonemes, we can see that vowels such as “UW1” and “OW1” lead to higher macro F1s than consonants such as “T” and “D”.

The final ablation concerns the importance of pretrained speech representations from different encoder layers, we train an ALST and a linear SVM for each layer of the pretrained speech encoders, as shown in Fig. 3. For F1 score (Fig. 3(a)), later layers generally play a bigger role in all three pretrained encoders, whereas for pairwise accuracy (Fig. 3(b)), earlier layers play a bigger role.

5 Conclusions

This paper introduces ALST, a speech-based automatic system for predicting ALS disease progression. ALST stands out for its ability to learn from longitudinal speech data, enabling hassle-free, long-term monitoring tailored to individual ALS patients. Notably, ALST surpasses existing speech-based system by a 5.6% increase in relative AUC score. Future works include a more in-depth comparison with other ALS disease progression models and enhanced ways to process and interpret longitudinal speech data.

6 Acknowledgements

This research was supported by Takeda Development Center Americas, INC. (successor in interest to Millennium Pharmaceuticals, INC.) This research could not be possible without the immense contributions from people living with ALS who have participated for months and years in the ALS Research Collaborative Study. We thank the ALS patient and caregiver community, in particular Augie’s Quest for a Cure, for financially supporting ALS TDI’s data collection work.

References

[1] A. Chiò et al., “Cognitive impairment across ALS clinical stages in a population-based cohort,” Neurology, vol. 93, no. 10, pp. 984–994, 2019.
[2] M. J. Strong et al., “Consensus criteria for the diagnosis of frontotemporal cognitive and behavioural syndromes in amyotrophic lateral sclerosis,” Amyotroph Lateral Scler, vol. 10, no. 3, pp. 131–146, 2009.
[3] J. M. Cedarbaum et al., “The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments of respiratory function. BDNF ALS Study Group (Phase III),” J. Neurol. Sci., vol. 169, pp. 13–21, 1999.
[4] P. Kaufmann, G. Levy, J. L. Thompson, M. L. DelBene, V. Battista, P. H. Gordon, L. P. Rowland, B. Levin, and H. Mitsumoto, “The ALSFRS-R predicts survival time in an ALS clinic population,” Neurology, vol. 64, no. 1, pp. 38–43, 2005. [Online]. Available: https://www.neurology.org/doi/abs/10.1212/01.WNL.0000148648.38313.64
[5] K. Kollewe, U. Mauss, K. Krampfl, S. Petri, R. Dengler, and B. Mohammadi, “ALSFRS-R score and its ratio: A useful predictor for ALS-progression,” Journal of the Neurological Sciences, vol. 275, no. 1-2, pp. 69–73, 2008. [Online]. Available: https://doi.org/10.1016/j.jns.2008.07.016
[6] T. Hothorn and H. H. Jung, “RandomForest4Life: A Random Forest for predicting ALS disease progression,” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, vol. 15, no. 5-6, pp. 444–452, 2014.
[7] R. Gomeni and M. Fava, “Amyotrophic lateral sclerosis disease progression model,” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, vol. 15, no. 1-2, pp. 119–129, 2014.
[8] Hannelore K. van der Burgh and Ruben Schmidt and Henk-Jan Westeneng and Marcel A. de Reus and Leonard H. van den Berg and Martijn P. van den Heuvel, “Deep learning predictions of survival based on MRI in amyotrophic lateral sclerosis,” NeuroImage: Clinical, vol. 13, pp. 361–369, 2017.
[9] A. Bandini et al., “Kinematic features of jaw and lips distinguish symptomatic from presymptomatic stages of bulbar decline in amyotrophic lateral sclerosis,” J Speech Lang Hear Res, vol. 61, no. 5, pp. 1118–1129, 2018. [Online]. Available: https://pubs.asha.org/doi/10.1044/2018_JSLHR-S-17-0262
[10] H.-J. Westeneng et al., “Prognosis for patients with amyotrophic lateral sclerosis: Development and validation of a personalised prediction model,” Lancet Neurol, vol. 17, pp. 423–433, 2018.
[11] V. Grollemund et al., “Manifold learning for amyotrophic lateral sclerosis functional loss assessment: Development and validation of a prognosis model,” J Neurol, vol. 268, no. 3, pp. 825–850, 2021.
[12] C. Pancotti et al., “Deep learning methods to predict amyotrophic lateral sclerosis disease progression,” Sci Rep, vol. 12, no. 1, p. 13738, 2022. [Online]. Available: https://www.nature.com/articles/s41598-022-17805-9
[13] F. Vieira, S. Venugopalan, A. Premasiri et al., “A machine-learning based objective measure for ALS disease severity,” NPJ Digital Medicine, vol. 5, p. 45, 2022. [Online]. Available: https://doi.org/10.1038/s41746-022-00588-8
[14] M. A. D. A. Jabbar et al., “Predicting amyotrophic lateral sclerosis (ALS) progression with machine learning,” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, vol. 0, no. 0, pp. 1–14, 2023.
[15] A. S. Gupta, S. Patel, A. Premasiri et al., “At-home wearables and machine learning sensitively capture disease progression in amyotrophic lateral sclerosis,” Nature Communications, vol. 14, p. 5080, 2023. [Online]. Available: https://doi.org/10.1038/s41467-023-40917-3
[16] E. Tavazzi et al., “Artificial intelligence and statistical methods for stratification and prediction of progression in amyotrophic lateral sclerosis: A systematic review,” Artificial Intelligence in Medicine, vol. 142, p. 102588, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0933365723001021
[17] L. J. Ball, “Timing of speech deterioration in people with amyotrophic lateral sclerosis,” Journal of Medical Speech-Language Pathology, vol. 10, pp. 231–235, 2002.
[18] S. Shellikeri, J. R. Green, M. Kulkarni, P. Rong, R. Martino, L. Zinman, and Y. Yunusova, “Speech movement measures as markers of bulbar disease in amyotrophic lateral sclerosis,” Journal of Speech, Language, and Hearing Research, vol. 59, no. 5, pp. 887–899, 2016.
[19] S. Shellikeri et al., “Digital markers of motor speech impairments in spontaneous speech of patients with als-ftd spectrum disorders,” Amyotroph Lateral Scler Frontotemporal Degener, vol. 2023, no. December 5, pp. 1–9, 2023.
[20] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
[21] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, pp. 3451–3460, 2021.
[22] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
[23] J. Li, K. Song, J. Li, B. Zheng, D. Li, X. Wu, X. Liu, and H. Meng, “Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,” in ArXiv, 2023. [Online]. Available: https://arxiv.longhoe.net/pdf/2303.08019.pdf
[24] Y.-W. Chen, Z. Yu, and J. Hirschberg, “multiPA: A multi-task speech pronunciation assessment system for a closed and open response scenario,” in Interspeech, 2023.
[25] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass, “Whisper-AT: Noise-robust automatic speech recognizers are also strong audio event taggers,” in Proc. Interspeech 2023, 2023.
[26] S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhinney, “Detecting cognitive decline using speech only: The ADReSSo challenge,” in Interspeech, 2021, pp. 3780–3784.
[27] M. Rohanian, J. Hough, and M. Purver, “Alzheimer’s dementia recognition using acoustic, lexical, disfluency and speech pause features robust to noisy inputs,” in Interspeech, 2021, pp. 3820–3824.
[28] A. Balagopalan and J. Novikova, “Comparing acoustic-based approaches for alzheimer’s disease detection,” in Interspeech, 2021, pp. 3800–3804.
[29] M. Bowden et al., “A systematic review and narrative analysis of digital speech biomarkers in motor neuron disease,” NPJ Digit Med, vol. 6, no. 1, p. 228, 2023. [Online]. Available: https://www.nature.com/articles/s41746-023-00959-9
[30] P. Vassil, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015, p. pp. 5206–5210.
[31] Vaswani et al., “Attention is all you need,” in NeurIPS, 2017, p. 6000–6010.
[32] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015. [Online]. Available: https://arxiv.longhoe.net/pdf/1412.6980.pdf