\interspeechcameraready\name

[affiliation=1]LimingWang \name[affiliation=1]YuanGong \name[affiliation=1]NaumanDawalatabad \name[affiliation=2]MarcoVilela \name[affiliation=2]KaterinaPlacek \name[affiliation=2]BrianTracey \name[affiliation=2]YishuGong \name[affiliation=3]AlanPremasiri \name[affiliation=3]FernandoVieira \name[affiliation=1]JamesGlass

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

Abstract

Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression from longitudinal speech recordings of ALS patients. By taking advantage of high-quality pretrained speech features and longitudinal information in the recordings, our best model achieves 91.0% AUC, improving upon the previous best model by 5.6% relative on the ALS TDI dataset. Careful analysis reveals that ALST is capable of fine-grained and interpretable predictions of ALS progression, especially for distinguishing between rarer and more severe cases. Code is publicly available.

1 Introduction

Amyotrophic lateral sclerosis (ALS) is a progressive multisystem neurodegenerative disease characterized by muscular atrophy due to motor neuron degeneration and, in up to 50% of patients, cognitive-behavioral impairment due to frontotemporal neuronal degeneration [1, 2]. More than 5,000 individuals are diagnosed with ALS annually in the US, and the typical lifespan from diagnosis is 2-5 years. As yet, there is no cure for ALS and limited treatment options aiming at slowing disease progression or managing symptoms. Precisely predicting ALS disease progression is crucial for early diagnosis and for develo** new treatments. The standard-of-care evaluation for monitoring ALS disease progression is the ALS Functional Rating Scale - Revised [3]. The ALSFRS-R uses clinician evaluation to ascertain ALS functional symptom severity across various functional domains including walking and speech. Importantly, a lower ALSFRS-R score at diagnosis as well as faster decline on the ALSFRS-R over time is linked to shorter survival from symptom onset [4, 5].

In this paper, we examine the utility of an automated system to predict ALS disease progression from digitally-collected speech samples, inspired by previous research devoted to automatic ALS disease progression prediction [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. These systems predict patient survival rates or temporal changes in ALSFRS-R scores using machine and deep learning algorithms such as random forests [6], dynamic Bayesian networks [7, 10, 11, 15] and neural networks [12, 14]. They leverage diverse information sources from clinical documents [6, 7, 10, 11, 12, 14] to various sensory signals such as magnetic resonance imaging scans [8], motion [9, 13, 15] and speech [13]. The successful application of these methods could offer a more objective and consistent evaluation of ALS disease progression, potentially reducing the need for extensive human involvement in the process.

Focusing on predicting ALS disease progression from speech has four main advantages. First, the collection of speech samples is economical and straightforward without requiring any medical expertise. This aspect makes speech-based systems not only more accessible but a cost-effective approach for tracking the ALS disease progression. Second, speech impairments are significant indicators of ALS progression. Significant speech impairment, such as lower speaking rate and intelligibility is a well-known symptom of ALS, especially the bulbar-onset cases where motor control of the tongue and lips are initially affected [17]. Third, speech-based systems discern latent variables that can aid in the prediction of ALS disease progression. For example, prior studies in patients with bulbar-onset ALS indicate that the physiologic biomechanics of speech production [18] as well as the acoustic analysis of speech lead to increased sensitivity of motor speech impairment [19]. Last, recent advances in self-supervised speech representation [20, 21] and large-scale automatic speech recognition systems such as Whisper [22] have exploited the large amount of speech data online to obtain effective features that capture both linguistic and paralinguistic information in speech. Such speech features have demonstrated remarkable transferability in various related tasks such as cognitive disorder detection [23], pronunciation assessment [24] and audio classification [25].

This paper’s main contributions are threefold. First, we redefine ALS progression prediction as a longitudinal sequence prediction problem. Second, we introduce a novel model architecture that utilizes the latest pretrained speech representation models for predicting ALS disease progression on the ALSFRS-R. Lastly, we provide an extensive evaluation, examining the effects of different model features, architectural choices, and training objectives, as well as the impact of various phonemes on model predictions.

2 Related work

Previous studies [13, 19, 15, 26, 27, 28, 23, 29] have shown the potential of machine learning models in predicting the progression of ALS and other neurodegenerative diseases like Alzheimer’s from speech. For ALS, a multi-class classifier was trained to predict ALSFRS-R scores using speech and accelerometer data, with mel spectrograms and a convolutional neural network (CNN) [13]. In addition, [19] has proposed automatic measures for ALS severity based on forced alignment and formant analysis of spontaneous speech. In Alzheimer’s research, various models were developed for detection and progression prediction using different techniques such as decision trees [26], recurrent neural nets [27], wav2vec 2.0 [28] and Whisper [23]. However, these Alzheimer’s models mainly focus on cross-sectional binary classification, unlike the longitudinal, multi-class setting in ALS research.

3 Methods

3.1 Dataset

We use the ALS Therapy Development Institute (ALS TDI) for evaluation, which is part of a ALS Research Collaborative (formerly known as the Precision Medicine Program) [15, 13]. This dataset comprises speech recordings accompanied by self-assessed ALSFRS-R scores from a longitudinal study of ALS patients, primarily based in the U.S. and distributed evenly across states [13]. Participants in the study were asked to repeat the sentence “I owe you a yoyo today” five times in each recording. Additionally, participants provided their ALSFRS-R ratings for 12 different \limingcognitive and behavioral functions at the time of recording. However, our experiments only utilize the ALSFRS-R speech sub-score. Due to missing recordings and labels and quality of text transcriptions, our experiments focus on a subset of 2,851 voice samples from 425 patients. We divide the dataset into two distinct groups: a training set with 2,124 utterances from 344 patients and a testing set with 727 utterances from 81 patients. The average years since diagnosis is 2.8 years and the average tracking period is 0.8 years with an average gap of about 50 days between successive recordings from the same patient. About 12% and 88% of the recordings are from bulbar-onset and limb-onset ALS patients respectively, and the average ALSFRS-R score at baseline measurement is 3.3.

3.2 ALS transformer

For each patient, we have a list of utterances [𝐱1,,𝐱n]superscript𝐱1superscript𝐱𝑛[\mathbf{x}^{1},\cdots,\mathbf{x}^{n}][ bold_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] labeled with self-rated ALSFRS-R scores at a 5-point scale, [y1,,yn]{0,,4}nsubscript𝑦1subscript𝑦𝑛superscript04𝑛[y_{1},\cdots,y_{n}]\in\{0,\cdots,4\}^{n}[ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ∈ { 0 , ⋯ , 4 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and phoneme transcripts [ϕ1,,ϕn]superscriptitalic-ϕ1superscriptitalic-ϕ𝑛[\mathbf{\phi}^{1},\cdots,\mathbf{\phi}^{n}][ italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ]. Further, we also have access to the longitudinal information of the utterances, i.e., the patient identities and recording dates [d1,,dn]subscript𝑑1subscript𝑑𝑛[d_{1},\cdots,d_{n}][ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] of each utterance.

Instead of predicting each ALSFRS-R score independently, we formulate the problem as a sequence prediction problem and predict the whole self-rated ALSFRS-R scores for each patient jointly. To this end, we propose a novel architecture, ALS transformer (ALST), capable of modeling longitudinal context. As shown in Fig. 1, ALST consists of four main components: a pretrained feature extractor, a phoneme aligner, a longitudinal speech encoder and an ALSFRS-R scorer.

The pretrained feature extractor, either a wav2vec 2.0 [20] or Whisper encoder [22], converts each raw speech waveform into 20ms frame-level features [𝐳1,,𝐳n]superscript𝐳1superscript𝐳𝑛[\mathbf{z}^{1},\cdots,\mathbf{z}^{n}][ bold_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , bold_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ]. We then concatenate the speech features for all the utterances of each patient in increasing order of their recording dates and extract the phoneme-level forced alignment boundaries as \liming[s1,,sm11,,s1n,,smnn]subscript𝑠1superscriptsubscript𝑠subscript𝑚11superscriptsubscript𝑠1𝑛superscriptsubscript𝑠subscript𝑚𝑛𝑛[s_{1},\cdots,s_{m_{1}}^{1},\cdots,s_{1}^{n},\cdots,s_{m_{n}}^{n}][ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ], where misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total number of phonemes in utterance i𝑖iitalic_i and sjisuperscriptsubscript𝑠𝑗𝑖s_{j}^{i}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the starting frame index of the j𝑗jitalic_j-th phoneme in utterance i𝑖iitalic_i. The phoneme aligner then converts the frame-level features to phoneme-level representation using the forced alignments:

𝐳jphn,i=1sj+1isjisjit<sj+1i𝐳t,subscriptsuperscript𝐳phn𝑖𝑗1superscriptsubscript𝑠𝑗1𝑖superscriptsubscript𝑠𝑗𝑖subscriptsuperscriptsubscript𝑠𝑗𝑖𝑡superscriptsubscript𝑠𝑗1𝑖subscript𝐳𝑡\displaystyle\mathbf{z}^{\mathrm{phn},i}_{j}=\frac{1}{s_{j+1}^{i}-s_{j}^{i}}% \sum_{s_{j}^{i}\leq t<s_{j+1}^{i}}\mathbf{z}_{t},bold_z start_POSTSUPERSCRIPT roman_phn , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ italic_t < italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)

for 1jmi, 1informulae-sequence1𝑗subscript𝑚𝑖1𝑖𝑛1\leq j\leq m_{i},\,1\leq i\leq n1 ≤ italic_j ≤ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_n\liming, with mi1subscript𝑚𝑖1m_{i}\equiv 1italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ 1 for no alignment.

The longitudinal speech encoder further contextualizes the phoneme-level features by jointly encoding features across the longitudinal sequence. To this end, it first creates a sequence of trainable positional embeddings using the recording dates and a sequence of phoneme label embeddings:

𝐞jpos,isuperscriptsubscript𝐞𝑗pos𝑖\displaystyle\mathbf{e}_{j}^{\mathrm{pos},i}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_pos , italic_i end_POSTSUPERSCRIPT =Embed(j)+Embed(di),absentEmbed𝑗Embedsubscript𝑑𝑖\displaystyle=\mathrm{Embed}(j)+\mathrm{Embed}(d_{i}),= roman_Embed ( italic_j ) + roman_Embed ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (2)
𝐞jphn,isubscriptsuperscript𝐞phn𝑖𝑗\displaystyle\mathbf{e}^{\mathrm{phn},i}_{j}bold_e start_POSTSUPERSCRIPT roman_phn , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =Embed(ϕji), 1jmi,1in.formulae-sequenceformulae-sequenceabsentEmbedsubscriptsuperscriptitalic-ϕ𝑖𝑗1𝑗subscript𝑚𝑖1𝑖𝑛\displaystyle=\mathrm{Embed}(\phi^{i}_{j}),\,1\leq j\leq m_{i},1\leq i\leq n.= roman_Embed ( italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , 1 ≤ italic_j ≤ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_n . (3)

We experiment with two main types of position embeddings for dates, namely, the order embedding to simply encode the ranking order of the dates and the day embedding to encode the number of days of the recording since the diagnosis date of the patient. Next, it combines the phoneme-level speech features and the embeddings using multiple standard transformer blocks:

𝐡¯jisuperscriptsubscript¯𝐡𝑗𝑖\displaystyle\bar{\mathbf{h}}_{j}^{i}over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =Transformerj(Linear(𝐳1:mphn,i)+𝐞1:mpos,i)absentsubscriptTransformer𝑗Linearsubscriptsuperscript𝐳phn𝑖:1𝑚superscriptsubscript𝐞:1𝑚pos𝑖\displaystyle=\mathrm{Transformer}_{j}(\mathrm{Linear}(\mathbf{z}^{\mathrm{phn% },i}_{1:m})+\mathbf{e}_{1:m}^{\mathrm{pos},i})= roman_Transformer start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_Linear ( bold_z start_POSTSUPERSCRIPT roman_phn , italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT ) + bold_e start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_pos , italic_i end_POSTSUPERSCRIPT ) (4)
𝐡jisuperscriptsubscript𝐡𝑗𝑖\displaystyle\mathbf{h}_{j}^{i}bold_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =Linear(𝐡¯ji)+𝐞jphn,i, 1jm, 1in.formulae-sequenceformulae-sequenceabsentLinearsuperscriptsubscript¯𝐡𝑗𝑖superscriptsubscript𝐞𝑗phn𝑖1𝑗𝑚1𝑖𝑛\displaystyle=\mathrm{Linear}(\bar{\mathbf{h}}_{j}^{i})+\mathbf{e}_{j}^{% \mathrm{phn},i},\,1\leq j\leq m,\,1\leq i\leq n.= roman_Linear ( over¯ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_phn , italic_i end_POSTSUPERSCRIPT , 1 ≤ italic_j ≤ italic_m , 1 ≤ italic_i ≤ italic_n . (5)

The ALSFRS-R scorer first converts the hidden encodings from the encoder to the utterance level via mean pooling:

𝐳iutt=1mij=1mi𝐡ji.superscriptsubscript𝐳𝑖utt1subscript𝑚𝑖superscriptsubscript𝑗1subscript𝑚𝑖superscriptsubscript𝐡𝑗𝑖\displaystyle\mathbf{z}_{i}^{\mathrm{utt}}=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}% \mathbf{h}_{j}^{i}.bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_utt end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (6)

Next, it combines the scores sequence with the utterance-level features into two predictions, a continuous estimated score y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and an estimated probability over score classes 𝐩^isubscript^𝐩𝑖\hat{\mathbf{p}}_{i}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

[y^i,𝐨i]subscript^𝑦𝑖subscript𝐨𝑖\displaystyle[\hat{y}_{i},\mathbf{o}_{i}][ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] =Linear(ziutt),absentLinearsuperscriptsubscript𝑧𝑖utt\displaystyle=\mathrm{Linear}(z_{i}^{\mathrm{utt}}),= roman_Linear ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_utt end_POSTSUPERSCRIPT ) , (7)
𝐩^isubscript^𝐩𝑖\displaystyle\hat{\mathbf{p}}_{i}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Softmax(𝐨i), 1in.formulae-sequenceabsentSoftmaxsubscript𝐨𝑖1𝑖𝑛\displaystyle=\text{Softmax}(\mathbf{o}_{i}),\,1\leq i\leq n.= Softmax ( bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , 1 ≤ italic_i ≤ italic_n . (8)

The training objective for ALST is then a linear combination of cross entropy and mean-squared error (MSE):

ALST=i=1n|y^iyi|2+λCElogp^i(yi).subscriptALSTsuperscriptsubscript𝑖1𝑛superscriptsubscript^𝑦𝑖subscript𝑦𝑖2subscript𝜆CEsubscript^𝑝𝑖subscript𝑦𝑖\displaystyle\mathcal{L}_{\text{ALST}}=\sum_{i=1}^{n}|\hat{y}_{i}-y_{i}|^{2}+% \lambda_{\text{CE}}\log\hat{p}_{i}(y_{i}).caligraphic_L start_POSTSUBSCRIPT ALST end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT roman_log over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (9)
Refer to caption
Figure 1: Architecture of the proposed ALST. The Whisper encoder is frozen during training.

3.3 Preprocessing

For the pretrained feature extractor, we use either the wav2vec 2.0 (large) [20] pretrained and finetuned on the 960-hour LibriSpeech dataset [30], or the encoder output of Whisper [22]. We have experimented with outputs from all encoder layers for both types of encoders as we will discuss in the analysis section. For the longitudinal speech encoder, we use a two-layer, single-head transformer [31] with 512 hidden nodes. We set λCE=1subscript𝜆CE1\lambda_{\text{CE}}=1italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = 1 by default and train the ALST using Adam optimizer [32] with a batch size of 32, a warmup step of 100 and an initial learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for 100 epochs. We also use a multi-step learning rate decay starting from the 20thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT epoch with a step size of 5 and a decay rate of 0.5. On a NVIDIA RTX A5000 GPU, the training takes about 2.2 minutes. All our models has around 7 million parameters. Models from the last epoch provides all the reported test results.

3.4 Evaluation metrics

To evaluate the ALSFRS-R speech score classification performance of our models, we report both macro F1, i.e., one-versus-all F1 averaged over classes and accuracy. To account for inter-patient inconsistency of the self-rated ALSFRS-R speech scores and to assess how well the model captures longitudinal variation in the speech, we also report intra-patient ranking measures such as Spearman’s ρ𝜌\rhoitalic_ρ, Kendall’s τ𝜏\tauitalic_τ and the pairwise accuracy score. Using such measures, we can evaluate how well the model ranks the voice samples by the ALS severity of the patient rather than on the absolute values of the scores, which can have large variability among patients. For the pairwise score, we label 1111 a pair of scores (y1,y2)subscript𝑦1subscript𝑦2(y_{1},y_{2})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) in increasing order of their recording dates if the y1>y2subscript𝑦1subscript𝑦2y_{1}>y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 11-1- 1 if y1<y2subscript𝑦1subscript𝑦2y_{1}<y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 00 otherwise. To assess how well our models provide a more fine-grained, continuous ALS progression measure. We also compute the MSE between predicted and true scores on the test set.

4 Results

Table 1: Overall ALSFRS-R speech score prediction results. denotes results obtained from a larger dataset containing our dataset along with additional voice samples. \limingALST is used for AUC, while ALST (R) is used for other metrics.
Scorer Macro F1\uparrow AUC\uparrow Spearman ρ𝜌\rhoitalic_ρ\uparrow Kendall τ𝜏\tauitalic_τ\uparrow Pairwise Acc.\uparrow MSE\downarrow
FBank
CNN [13] - .865 - - - -
wav2vec 2.0
SVM
.578
±.000plus-or-minus.000\pm.000± .000
.906
±.000plus-or-minus.000\pm.000± .000
.445
±.000plus-or-minus.000\pm.000± .000
.428
±.000plus-or-minus.000\pm.000± .000
.751
±.000plus-or-minus.000\pm.000± .000
-
ALST
.544
±.075plus-or-minus.075\pm.075± .075
.900
±.011plus-or-minus.011\pm.011± .011
.526
±.061plus-or-minus.061\pm.061± .061
.513
±.060plus-or-minus.060\pm.060± .060
.801
±.017plus-or-minus.017\pm.017± .017
.785
±.009plus-or-minus.009\pm.009± .009
whisper-medium
SVM
.549
±.000plus-or-minus.000\pm.000± .000
.910
±.000plus-or-minus.000\pm.000± .000
.470
±.000plus-or-minus.000\pm.000± .000
.458
±.000plus-or-minus.000\pm.000± .000
.760
±.000plus-or-minus.000\pm.000± .000
-
ALST
.577
±.054plus-or-minus.054\pm.054± .054
.906
±.007plus-or-minus.007\pm.007± .007
.594
±.022plus-or-minus.022\pm.022± .022
.575
±.024plus-or-minus.024\pm.024± .024
.818
±.015plus-or-minus.015\pm.015± .015
.780
±.004plus-or-minus.004\pm.004± .004
whisper-large v2
SVM
.552
±.000plus-or-minus.000\pm.000± .000
.908
±.000plus-or-minus.000\pm.000± .000
.446
±.000plus-or-minus.000\pm.000± .000
.429
±.000plus-or-minus.000\pm.000± .000
.773
±.000plus-or-minus.000\pm.000± .000
-
ALST
.561
±.032plus-or-minus.032\pm.032± .032
.907
±.013plus-or-minus.013\pm.013± .013
.552
±.053plus-or-minus.053\pm.053± .053
.537
±.053plus-or-minus.053\pm.053± .053
.804
±.019plus-or-minus.019\pm.019± .019
.777
±.011plus-or-minus.011\pm.011± .011
Table 2: ALSFRS-R speech score prediction results vs. different variants of ALSTs. ALST (R) stands for the regression branch of the ALST; ALST (FA) stands for ALST with forced aligned phoneme-level input features and outputs from the regression branch.
Variant Macro F1\uparrow AUC\uparrow Spearman ρ𝜌\rhoitalic_ρ\uparrow Kendall τ𝜏\tauitalic_τ\uparrow Pairwise Acc.\uparrow MSE\downarrow
wav2vec 2.0
ALST
.482
±.037plus-or-minus.037\pm.037± .037
.900
±.011plus-or-minus.011\pm.011± .011
.583
±.029plus-or-minus.029\pm.029± .029
.567
±.031plus-or-minus.031\pm.031± .031
.806
±.015plus-or-minus.015\pm.015± .015
-
ALST (R)
.544
±.075plus-or-minus.075\pm.075± .075
-
.526
±.061plus-or-minus.061\pm.061± .061
.513
±.060plus-or-minus.060\pm.060± .060
.801
±.017plus-or-minus.017\pm.017± .017
.785
±.009plus-or-minus.009\pm.009± .009
ALST (FA)
.529
±.028plus-or-minus.028\pm.028± .028
.898
±.007plus-or-minus.007\pm.007± .007
.448
±.045plus-or-minus.045\pm.045± .045
.432
±.044plus-or-minus.044\pm.044± .044
.732
±.013plus-or-minus.013\pm.013± .013
.795
±.005plus-or-minus.005\pm.005± .005
whisper-medium
ALST
.560
±.016plus-or-minus.016\pm.016± .016
.906
±.007plus-or-minus.007\pm.007± .007
.585
±.020plus-or-minus.020\pm.020± .020
.566
±.019plus-or-minus.019\pm.019± .019
.822
±.012plus-or-minus.012\pm.012± .012
-
ALST (R)
.577
±.054plus-or-minus.054\pm.054± .054
-
.594
±.022plus-or-minus.022\pm.022± .022
.575
±.024plus-or-minus.024\pm.024± .024
.818
±.015plus-or-minus.015\pm.015± .015
.780
±.004plus-or-minus.004\pm.004± .004
ALST (FA)
.571
±.069plus-or-minus.069\pm.069± .069
.907
±.009plus-or-minus.009\pm.009± .009
.469
±.070plus-or-minus.070\pm.070± .070
.451
±.067plus-or-minus.067\pm.067± .067
.731
±.012plus-or-minus.012\pm.012± .012
.783
±.010plus-or-minus.010\pm.010± .010
whisper-large v2
ALST
.486
±.049plus-or-minus.049\pm.049± .049
.907
±.013plus-or-minus.013\pm.013± .013
.470
±.041plus-or-minus.041\pm.041± .041
.458
±.039plus-or-minus.039\pm.039± .039
.799
±.012plus-or-minus.012\pm.012± .012
-
ALST (R)
.561
±.032plus-or-minus.032\pm.032± .032
-
.552
±.053plus-or-minus.053\pm.053± .053
.537
±.053plus-or-minus.053\pm.053± .053
.804
±.019plus-or-minus.019\pm.019± .019
.777
±.011plus-or-minus.011\pm.011± .011
ALST (FA)
.594
±.035plus-or-minus.035\pm.035± .035
.909
±.005plus-or-minus.005\pm.005± .005
.520
±.049plus-or-minus.049\pm.049± .049
.502
±.048plus-or-minus.048\pm.048± .048
.753
±.027plus-or-minus.027\pm.027± .027
.780
±.009plus-or-minus.009\pm.009± .009
Table 3: ALSFRS-R speech score prediction results vs. different setups for ALST (R)s with whisper-medium encoders. The setup “longitudinal” uses longitudinal sequence without positional embeddings.
Setup Macro F1\uparrow AUC\uparrow Spearman ρ𝜌\rhoitalic_ρ\uparrow Kendall τ𝜏\tauitalic_τ\uparrow Pairwise Acc.\uparrow MSE\downarrow
No longitudinal
.581
±.017plus-or-minus.017\pm.017± .017
.916
±.003plus-or-minus.003\pm.003± .003
.546
±.081plus-or-minus.081\pm.081± .081
.529
±.080plus-or-minus.080\pm.080± .080
.796
±.022plus-or-minus.022\pm.022± .022
.768
±.005plus-or-minus.005\pm.005± .005
Longitudinal
.552
±.008plus-or-minus.008\pm.008± .008
.912
±.010plus-or-minus.010\pm.010± .010
.611
±.046plus-or-minus.046\pm.046± .046
.592
±.045plus-or-minus.045\pm.045± .045
.826
±.015plus-or-minus.015\pm.015± .015
.772
±.009plus-or-minus.009\pm.009± .009
Order, trainable
.558
±.017plus-or-minus.017\pm.017± .017
.914
±.011plus-or-minus.011\pm.011± .011
.564
±.083plus-or-minus.083\pm.083± .083
.543
±.081plus-or-minus.081\pm.081± .081
.813
±.014plus-or-minus.014\pm.014± .014
.774
±.009plus-or-minus.009\pm.009± .009
Order, sinusoid
.520
±.028plus-or-minus.028\pm.028± .028
.912
±.009plus-or-minus.009\pm.009± .009
.561
±.107plus-or-minus.107\pm.107± .107
.543
±.106plus-or-minus.106\pm.106± .106
.805
±.022plus-or-minus.022\pm.022± .022
.771
±.006plus-or-minus.006\pm.006± .006
Day, sinusoid
.536
±.023plus-or-minus.023\pm.023± .023
.903
±.005plus-or-minus.005\pm.005± .005
.522
±.019plus-or-minus.019\pm.019± .019
.506
±.018plus-or-minus.018\pm.018± .018
.803
±.017plus-or-minus.017\pm.017± .017
.782
±.006plus-or-minus.006\pm.006± .006

The overall results of our models are shown in Table 1. We compare ALST with two comparator baseline models: a CNN-based model with mel filterbank (FBank) features as inputs [13] and a linear SVM with the same speech features as the ALST. Models using pretrained speech representations consistently outperform the FBank-based baseline by around 5.6% relative AUC. Among the pretrained speech representations, whisper-medium encoder performs the best in all the ranking-based metrics, even better than the larger whisper-large v2 encoder in all except the macro F1 score. While this might be due to training instability or scaling issues for whisper-large v2, another possible explanation is that whisper-large v2 has filtered out some of the paralinguistic information due to their irrelevance to its original task of ASR.

Our experiments also demonstrate the advantage of nonlinear models such as transformers [31] over linear models for ALS prediction. With the same speech representation, the best ALSTs generally outperforms the linear SVMs in all comparable metrics, especially ranking-based metrics, with around 5% relative improvements in pairwise accuracy and around 25% relative improvements in both Spearman ρ𝜌\rhoitalic_ρ and Kendall τ𝜏\tauitalic_τ. For the Macro F1, the Whisper-based ALSTs outperform the SVM by an average of 6.5% relative, while the wav2vec 2.0-based ALSTs performs worse than the SVM by an 5.9% relative. Moreover, as shown in Table 2 the regression branch of ALST consistently outperforms the classification branch in macro F1 for all pretrained speech representations, as well as all the ranking-based metrics for whisper representations. The use of phoneme segmentation helps whisper-large v2 in Macro F1 and AUC, but not ranking-based metrics or with other pretrained speech representations.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: (a) Confusion matrix of ALST (R) trained using whisper-medium speech encoders (layer 22); (b) Effect of λCEsubscript𝜆CE\lambda_{\text{CE}}italic_λ start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT on ALSFRS-R speech score prediction performance; (c) Single-phoneme ALS prediction F1 vs phoneme used for the ALST models with different pretrained speech representations. We merge results for the phonemes ‘AH0’ and ‘UW0’ due to the multiple possible pronunciations in the second phoneme of the word ‘today’.
Refer to caption
(a) Macro F1 vs. layer of the pretrained model
Refer to caption
(b) Pairwise accuracy vs. layer of the pretrained model
Figure 3: Score prediction performance vs. layer used for feature extraction of the pretrained speech representation models

To better understand the role of longitudinal information for ALS prediction, we experiment with different setups including using no longitudinal information, using longitudinal information without position embeddings and with different types of position embeddings. Our experiments indicate that longitudinal information significantly improves ranking-based metrics, but not classification metrics. The results demonstrate that the two types of metrics do measure different aspects of ALS prediction performance, and that longitudinal information helps ALST to better estimate the disease progression for a single speaker, which is closer to our goal than to predict the absolute values of the ALS speech scores due to inter-speaker variability in rating standards. However, we observe both the order-based position embeddings and the day-based position embeddings degrades the performance obtained using longitudinal information without position embeddings. Designing a better position embedding for longitudinal ALS prediction will be a future research direction.

We also visualize the confusion matrix generated by our best model in terms of macro F1 in Fig. 2(a). ALST performs the best for recordings with a ALSFRS-R speech score of 4, the most frequent score class while the worst for score of 0, the rarest score class. Interestingly, most mistakes made by ALST are between similar score classes such as 0 vs. 1 and 3 vs. 4, many of which may be due to individual differences in rating standards among patients. Therefore, macro F1 potentially underestimates the ability of ALST for ALS prediction. Another ablation study is on the effect of relative weights between the classification (CE) and the regression (MSE) losses, shown in Fig. 2(b). Our results suggest that training the model with MSE alone leads to poor performance, and combining CE and MSE losses with λCE0.9subscript𝜆𝐶𝐸0.9\lambda_{CE}\approx 0.9italic_λ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ≈ 0.9 leads the best result. The benefit of using the CE loss saturates after λCEsubscript𝜆𝐶𝐸\lambda_{CE}italic_λ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT exceeds around 0.6. In Fig. 2(c), we estimate the importance of a given phoneme in ALS prediction. we leave one segment of that phoneme intact and mask out the rest of the recording as input to a trained ALST (FA) system, and measure the macro F1 of its prediction. While we observe high variance across the phonemes, we can see that vowels such as “UW1” and “OW1” lead to higher macro F1s than consonants such as “T” and “D”.

The final ablation concerns the importance of pretrained speech representations from different encoder layers, we train an ALST and a linear SVM for each layer of the pretrained speech encoders, as shown in Fig. 3. For F1 score (Fig. 3(a)), later layers generally play a bigger role in all three pretrained encoders, whereas for pairwise accuracy (Fig. 3(b)), earlier layers play a bigger role.

5 Conclusions

This paper introduces ALST, a speech-based automatic system for predicting ALS disease progression. ALST stands out for its ability to learn from longitudinal speech data, enabling hassle-free, long-term monitoring tailored to individual ALS patients. Notably, ALST surpasses existing speech-based system by a 5.6% increase in relative AUC score. Future works include a more in-depth comparison with other ALS disease progression models and enhanced ways to process and interpret longitudinal speech data.

6 Acknowledgements

This research was supported by Takeda Development Center Americas, INC. (successor in interest to Millennium Pharmaceuticals, INC.) This research could not be possible without the immense contributions from people living with ALS who have participated for months and years in the ALS Research Collaborative Study. We thank the ALS patient and caregiver community, in particular Augie’s Quest for a Cure, for financially supporting ALS TDI’s data collection work.

References

  • [1] A. Chiò et al., “Cognitive impairment across ALS clinical stages in a population-based cohort,” Neurology, vol. 93, no. 10, pp. 984–994, 2019.
  • [2] M. J. Strong et al., “Consensus criteria for the diagnosis of frontotemporal cognitive and behavioural syndromes in amyotrophic lateral sclerosis,” Amyotroph Lateral Scler, vol. 10, no. 3, pp. 131–146, 2009.
  • [3] J. M. Cedarbaum et al., “The ALSFRS-R: a revised ALS functional rating scale that incorporates assessments of respiratory function. BDNF ALS Study Group (Phase III),” J. Neurol. Sci., vol. 169, pp. 13–21, 1999.
  • [4] P. Kaufmann, G. Levy, J. L. Thompson, M. L. DelBene, V. Battista, P. H. Gordon, L. P. Rowland, B. Levin, and H. Mitsumoto, “The ALSFRS-R predicts survival time in an ALS clinic population,” Neurology, vol. 64, no. 1, pp. 38–43, 2005. [Online]. Available: https://www.neurology.org/doi/abs/10.1212/01.WNL.0000148648.38313.64
  • [5] K. Kollewe, U. Mauss, K. Krampfl, S. Petri, R. Dengler, and B. Mohammadi, “ALSFRS-R score and its ratio: A useful predictor for ALS-progression,” Journal of the Neurological Sciences, vol. 275, no. 1-2, pp. 69–73, 2008. [Online]. Available: https://doi.org/10.1016/j.jns.2008.07.016
  • [6] T. Hothorn and H. H. Jung, “RandomForest4Life: A Random Forest for predicting ALS disease progression,” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, vol. 15, no. 5-6, pp. 444–452, 2014.
  • [7] R. Gomeni and M. Fava, “Amyotrophic lateral sclerosis disease progression model,” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, vol. 15, no. 1-2, pp. 119–129, 2014.
  • [8] Hannelore K. van der Burgh and Ruben Schmidt and Henk-Jan Westeneng and Marcel A. de Reus and Leonard H. van den Berg and Martijn P. van den Heuvel, “Deep learning predictions of survival based on MRI in amyotrophic lateral sclerosis,” NeuroImage: Clinical, vol. 13, pp. 361–369, 2017.
  • [9] A. Bandini et al., “Kinematic features of jaw and lips distinguish symptomatic from presymptomatic stages of bulbar decline in amyotrophic lateral sclerosis,” J Speech Lang Hear Res, vol. 61, no. 5, pp. 1118–1129, 2018. [Online]. Available: https://pubs.asha.org/doi/10.1044/2018_JSLHR-S-17-0262
  • [10] H.-J. Westeneng et al., “Prognosis for patients with amyotrophic lateral sclerosis: Development and validation of a personalised prediction model,” Lancet Neurol, vol. 17, pp. 423–433, 2018.
  • [11] V. Grollemund et al., “Manifold learning for amyotrophic lateral sclerosis functional loss assessment: Development and validation of a prognosis model,” J Neurol, vol. 268, no. 3, pp. 825–850, 2021.
  • [12] C. Pancotti et al., “Deep learning methods to predict amyotrophic lateral sclerosis disease progression,” Sci Rep, vol. 12, no. 1, p. 13738, 2022. [Online]. Available: https://www.nature.com/articles/s41598-022-17805-9
  • [13] F. Vieira, S. Venugopalan, A. Premasiri et al., “A machine-learning based objective measure for ALS disease severity,” NPJ Digital Medicine, vol. 5, p. 45, 2022. [Online]. Available: https://doi.org/10.1038/s41746-022-00588-8
  • [14] M. A. D. A. Jabbar et al., “Predicting amyotrophic lateral sclerosis (ALS) progression with machine learning,” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, vol. 0, no. 0, pp. 1–14, 2023.
  • [15] A. S. Gupta, S. Patel, A. Premasiri et al., “At-home wearables and machine learning sensitively capture disease progression in amyotrophic lateral sclerosis,” Nature Communications, vol. 14, p. 5080, 2023. [Online]. Available: https://doi.org/10.1038/s41467-023-40917-3
  • [16] E. Tavazzi et al., “Artificial intelligence and statistical methods for stratification and prediction of progression in amyotrophic lateral sclerosis: A systematic review,” Artificial Intelligence in Medicine, vol. 142, p. 102588, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0933365723001021
  • [17] L. J. Ball, “Timing of speech deterioration in people with amyotrophic lateral sclerosis,” Journal of Medical Speech-Language Pathology, vol. 10, pp. 231–235, 2002.
  • [18] S. Shellikeri, J. R. Green, M. Kulkarni, P. Rong, R. Martino, L. Zinman, and Y. Yunusova, “Speech movement measures as markers of bulbar disease in amyotrophic lateral sclerosis,” Journal of Speech, Language, and Hearing Research, vol. 59, no. 5, pp. 887–899, 2016.
  • [19] S. Shellikeri et al., “Digital markers of motor speech impairments in spontaneous speech of patients with als-ftd spectrum disorders,” Amyotroph Lateral Scler Frontotemporal Degener, vol. 2023, no. December 5, pp. 1–9, 2023.
  • [20] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
  • [21] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, pp. 3451–3460, 2021.
  • [22] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  • [23] J. Li, K. Song, J. Li, B. Zheng, D. Li, X. Wu, X. Liu, and H. Meng, “Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,” in ArXiv, 2023. [Online]. Available: https://arxiv.longhoe.net/pdf/2303.08019.pdf
  • [24] Y.-W. Chen, Z. Yu, and J. Hirschberg, “multiPA: A multi-task speech pronunciation assessment system for a closed and open response scenario,” in Interspeech, 2023.
  • [25] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass, “Whisper-AT: Noise-robust automatic speech recognizers are also strong audio event taggers,” in Proc. Interspeech 2023, 2023.
  • [26] S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhinney, “Detecting cognitive decline using speech only: The ADReSSo challenge,” in Interspeech, 2021, pp. 3780–3784.
  • [27] M. Rohanian, J. Hough, and M. Purver, “Alzheimer’s dementia recognition using acoustic, lexical, disfluency and speech pause features robust to noisy inputs,” in Interspeech, 2021, pp. 3820–3824.
  • [28] A. Balagopalan and J. Novikova, “Comparing acoustic-based approaches for alzheimer’s disease detection,” in Interspeech, 2021, pp. 3800–3804.
  • [29] M. Bowden et al., “A systematic review and narrative analysis of digital speech biomarkers in motor neuron disease,” NPJ Digit Med, vol. 6, no. 1, p. 228, 2023. [Online]. Available: https://www.nature.com/articles/s41746-023-00959-9
  • [30] P. Vassil, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015, p. pp. 5206–5210.
  • [31] Vaswani et al., “Attention is all you need,” in NeurIPS, 2017, p. 6000–6010.
  • [32] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015. [Online]. Available: https://arxiv.longhoe.net/pdf/1412.6980.pdf