WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech Health Diagnostic Model

Yi Zhu, and Tiago Falk Authors are with INRS. E-mail: [email protected] received XXX.

Abstract

Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.

Index Terms:

Health embeddings, speech, diagnostics, generalizability, privacy-preserving

I Introduction

During recent years, speech has emerged as a promising modality for disease diagnosis and remote health monitoring. Speech health diagnostics is typically based on the assumption that diseases causing abnormalities in articulatory and/or respiratory systems would lead to an atypical pattern in the human voice signal [1]. Such abnormality could be due to a variety of reasons, such as impaired neuromuscular control or an inflammation in the vocal tract and lungs [1]. While the impact on the speech signal may sometimes be imperceptible to humans, a machine learning (ML) model could be trained to detect certain disease-related vocal biomarkers.

Over the years, there has been a substantial body of work that has explored the use of speech processing for diagnostics, including but not limited to COVID-19 [2], dysarthria [3], Parkinson’s [4] and Alzheimer’s disease [5], as well as many other general respiratory symptoms [6]. Several challenges related to speech-based health diagnostics have also emerged, showing the impact of deep learning [7]. Despite these many published works, very few systems exist commercially or are used in real-world settings. There may be several reasons for this. First, existing system architectures are usually tailored for a single type of disease, i.e., are disease-dependent. While disease-related biomarkers can be well captured by the models, other health attributes are likely to be overlooked. For example, systems that focus on speech intelligibility may be useful at diagnosing dysarthria, but may fail at detecting COVID-19 infections. As recent innovations in self-supervised learning (SSL) are showing [8, 9, 10], it is possible to learn universal representations that can be used across many different downstream tasks [11]. The same is desirable for health diagnostics, where the same architecture can be applied to a variety of diseases (i.e., disease-agnostic) where only downstream fine-tuning is needed.

Second, it is expected that a well-trained diagnostic model will generalize well across datasets that share the same or similar pathology. However, recent studies have reported severe degradation in performance across several state-of-the-art diagnostic systems when tested on unseen data from different patients with the same disease [12]. This has been attributed to different confounding factors (e.g., noise level, gender) generated unwarily during data collection [13, 14]. These factors could lead to models, especially deep learning-based ones, to overfit to a certain database property (e.g., changes in sampling frequency for different disease labels [15]) and not necessarily to diagnostic information. The lack of generalizability makes the reliability of existing models questionable and further exacerbates the criticism around the lack of explainability and the “black-box” aspect of deep neural networks.

Lastly, since voice carries personal identity attributes, such as gender, age, and race [16], uploading the voice signal to an online platform for model training and evaluation is dangerous, especially considering the rapidly growing voice cloning techniques [17]. One method to alleviate this privacy concern is to extract a speech representation locally, then upload only the representation itself. However, studies have shown that health information is likely to entangle with speaker identity in most widely used speech representations (e.g., openSMILE features, ECAPA-TDNN embeddings, and universal representations) [18, 19, 20], suggesting that existing health representations still suffer from speaker leakage. While some privacy-preserving methods have been proposed as an alternative, including adversarial training [21] and voice anonymization [22], such methods may alter the speech signal, thus potentially removing disease-discriminatory details; this was recently shown to be the case for COVID-19 detection [18].

To tackle these three major limitations, in this paper we propose a new speech health diagnostic model that is disease-agnostic, generalizable across datasets, and privacy-preserving. The proposed model, termed WavRx, is built on top of the well-known WavLM model [9] and incorporates a novel modulation dynamics module, which mixes the high-resolution temporal WavLM representation with the long-term modulation dynamics of speech. While the WavLM representation can carry both linguistic and paralinguistic attributes [9] at a $50\text{\,}\mathrm{H}\mathrm{z}$ rate, these attributes focus more on transient temporal changes. Articulation and respiration related abnormalities, on the other hand, may modulate these short-time features at a much lower rate. As such, the proposed modulation dynamics block is designed to capture long-term variability and to provide complementary information to the temporal details.

Our main contributions in this paper can be summarized as follows:

1.

We propose a new speech health diagnostics model, WavRx, that mixes the universal temporal representation with long-term modulation dynamics. WavRx is tested on six different pathological speech datasets, spanning four different speech pathologies, and achieves state-of-the-art (SOTA) performance on 4 out of 6, with the highest average performance among six benchmark models.
2.

We show that the modulation dynamics block, while being parameter-free, can significantly improve the overall model generalizability across datasets and diseases that share similar symptoms.
3.

We demonstrate that the modulation dynamics block helps to markedly remove the speaker attributes from the health embeddings learned by WavRx, without the need for extra guidance during training.
4.

We find that the health embeddings learned from the dynamics representation are twice more sparse than from the temporal representation, which helps to remove disease-irrelevant information.

II Related work

II-A Speech-based diagnostic models

Earlier works have focused on knowledge-based features to characterize the underlying speech pathology. Besides conventional speech features, such as mel-spectrograms or mel-frequency cepstral coefficients (MFCCs), studies have examined a wide variety of features associated with health status. The openSMILE ComParE set [23], for example, has been used as a baseline across several challenges, such as the 2021 COVID-19 detection challenge [24], the 2017 cold&snoring recognition challenge [25], and the 2012 pathology sub-challenge that predicts speech intelligibility for individuals that received cervical cancer surgeries [24].

Other studies have proposed features designed specifically for certain types of diseases, such as phonation and articulation features for Parkinson’s disease [26, 27], linear prediction (LP) based features for COVID-19 [28], and voice quality features for depression [29], just to name a few. These hand-crafted features aim to capture certain aspects of the speech signal affected by the disease using signal processing techniques. The engineered features are then fed into classical ML classifiers, such as support vector machine or random forest classifiers. Major advantages of hand-crated features are that they provide some explainability and interpretability (e.g., LP residuals represent vocal cord vibration patterns), are suitable for small datasets, which are typically the case in healthcare settings, and tend to generalize better across datasets [30].

More recently, models based on deep learning (DL) have started to burgeon [7]. These models typically take as input the speech waveform or some variant of the spectrogram (e.g., a mel-scaled spectrogram) and learn the underlying biomarkers via a data-driven approach. DL models are usually designed for one specific type of disease. For example, convolutional (CNN) and recurrent (RNN) networks have been used for COVID-19 detection using cough, speech, and breathing signals as input [31, 32, 33]. These models were trained from scratch using a limited amount of data, hence their power is yet to be fully explored. To address this issue, some studies have investigated transfer learning with large-size pre-trained models, such as the VGGish networks [34], ECAPA-TDNN [35], and audio transformers [36]. Studies have shown that pre-training on out-of-domain data (e.g., image datasets, speaker verification datasets, audio events) could also benefit speech diagnostics performance [6, 37, 38, 39].

While pre-training is usually conducted in a supervised manner, there have been some initial attempts to leverage SSL pre-trained models for diagnostics [20, 40, 41]. The underlying assumption is that the universal speech representation resultant from the models, such as Wav2vec [10], carries a variety of speech information, including linguistic, paralinguistic, and diagnostic information [10, 8, 9]. It has been shown that self-supervised pre-training is less biased by the upstream datasets than by supervised training [42], thus making universal speech representations an excellent candidate for diagnostic tasks.

However, while most existing works have taken different universal representations directly as the feature input to downstream diagnostic classifiers (e.g., [20, 40, 41]), we argue that existing universal representations are suboptimal for diagnostics tasks due to two main reasons. First, SSL models, such as Wav2vec [10], WavLM [9], and HuBERT [8], aggregate the input waveform into short segments by a convolutional layer before feeding into the transformer layers. In the case of WavLM, the receptive field of each unit in the CNN output is around $20\text{\,}\mathrm{m}\mathrm{s}$ , similar to the frame lengths used in the mel-spectrogram. While this is short enough to capture linguistic content (e.g., phonemes) as well as other temporal details (e.g., speaker details), more longer-term dynamics, such as speaking rate, respiration, and emotions, may not be well encoded. This corroborates with the improved performance achieved by appending different downstream layers to the temporal representation, such as 1D CNNs ([43]) and RNNs ([44]). Second, the existence of the linguistic content in temporal representation may bias the diagnostic models, as the disease biomarkers should be independent of spoken content. Given these limitations, we propose a new representation that also captures long-term dynamics, following the widely-used concept of the speech modulation spectrum, but instead, applied to a universal speech representation.

II-B Modulation dynamics of speech

Speech is produced by the vibration of vocal cords, the vibration is then transmitted through the vocal tract and modulated by the articulatory movement and respiration, generating the speech hearable by humans [45, 46]. Typical speech analysis focuses on short-time analysis to capture transient changes caused by changes in phonemes. For example, the window size for the short-time Fourier transform (STFT) is usually $8\text{\,}$ to $32\text{\,}\mathrm{m}\mathrm{s}$ [47]. Speech modulation, in turn, changes at a much lower rate due to the limit of human physiology. For articulatory movement, for instance, between 2 and 10 syllables are being uttered per second for most of the languages [48]. However, such underlying modulation is not well captured by a spectrogram and measures such as delta and double-delta cepstral parameters have been used for decades as measures of velocity and acceleration of changes in the cepstral parameters over somewhat larger window durations.

To address this issue, several researchers have relied on the so-called modulation spectrum (e.g., [49, 50]), which applies a second STFT to each frequency component obtained from the spectrogram. This extends the conventional spectrogram to a 3-dimensional space with an added modulation frequency axis. With a window size of over $128\text{\,}\mathrm{m}\mathrm{s}$ , the modulation spectrum analyzes the hidden periodicity of human speech. While most of the linguistic content is lost in the modulation frequency domain, other vocal characteristics such as speaking rate [48], vocal hoarseness [51], and whispering [52] may be better manifested. Features derived from such representation have been previously applied in the detection of dysarthric speech [53], whispered speech [52], voice pathologies [51], COVID-19 [28], and emotional speech [54], to name a few. Motivated by the idea of the modulation spectrum, we here applied the modulation transformation to the universal representations to better capture their health-related attributes.

III Proposed model architecture

The proposed WavRx comprises three main components: (1) a pre-trained encoder to extract temporal representations from the raw waveform; (2) the modulation dynamics block to capture long-term dynamics of the encoded temporal representations; and (3) attentive statistic pooling and output layers to fuse representations from the previous two blocks and generate a final decision. Details about each component are described in the following subsections.

The model architecture is depicted in Figure 1. Considering the privacy requirement in real-world applications, WavRx encoder can be deployed locally to extract health embeddings, which are then uploaded to a central cloud server for decision-making. In later sections, we show that the health embeddings entail minimal speaker identity information, hence preventing the leakage of user identity. Our code is made publicly available on GitHub¹¹1https://github.com/zhu00121/WavRx. Owing to the data sharing terms of the employed datasets, pre-trained model backbones are released upon requests.

Refer to caption — Figure 1: Architecture of the proposed WavRx model.

III-A Temporal representation encoder

The proposed model builds on top of the pre-trained WavLM as the temporal representation encoder [9]. WavLM takes a raw speech waveform as input and firstly feeds it into a CNN block comprised 7 temporal CNN layers with 512 channels, cascaded by layer normalization and a GELU activation layer. Each time step in the output from CNN block represents $25\text{\,}\mathrm{m}\mathrm{s}$ of audio with $20\text{\,}\mathrm{m}\mathrm{s}$ hop length. The CNN output is then sent into a transformer backbone, which comprises 13 layers with 768-dimensional hidden states. We employed the WavLM Base+ version²²2HuggingFace link: https://huggingface.co/microsoft/wavlm-base-plus. Accessed May 23rd, 2024. which was pre-trained on 60K hours of Libri-light [55], 10K hours of Gigaspeech [56], and 24K hours of VoxPopuli [57].

Previous studies have shown that later transformer layers in WavLM carry more linguistic content, while early layers are likely to encode paralinguistic information [58]. For diagnostics, it remains unclear which layers are more crucial. Hence, we aggregated outputs from all 12 layers (with the first input layer excluded) by assigning weights to each of them. These weights were learned through supervised training on downstream tasks. The layer-weighted output from WavLM encoder is a time by feature representation $\{\mathbf{T}\times\mathbf{F}\}$ , which can be seen as a temporal representation showing how each feature changes over time. Given the temporal pooling configurations of the CNN layers, the resultant temporal representation has a temporal resolution of $50\text{\,}\mathrm{H}\mathrm{z}$ . However, speech production is modulated at a lower rate and the temporal representation may carry redundant linguistic information that is less essential for disease diagnosis. Thus, we proposed the modulation dynamics block to provide complementary information that is missing from the temporal representation.

III-B Modulation dynamics block

A visual demonstration of the modulation dynamics block is provided in Fig. 2. Given an output $\mathbf{T(m,n)}$ from WavLM (i.e., the weight sum of twelve transformer layer outputs), where $\mathbf{m}$ represents the number of time windows and $\mathbf{n}$ represents the number of features, we applied a short-time Fourier transform (STFT) to each feature channel, leading to a 3-dimensional modulation dynamics representation $\mathbf{D_{n}(j,f_{j})}$ :

\mathbf{D_{n}(j,f_{j})}=\left|\mathcal{STFT}\mathbf{(T(m,n))}\right|^{2},

(1)

where $\mathbf{j}$ refers to the number of time frames used for STFT and $\mathbf{f_{j}}$ to the number of modulation frequency channels. The results of the STFT include both real and imaginary parts; here, we keep only the real part by taking the absolute value operation (denoted by $|\cdot|$ ) and calculate the power.

For phoneme-level speech applications (e.g., speech recognition), the STFT usually relies on short time windows (e.g., $16\text{\,}$ - $32\text{\,}\mathrm{m}\mathrm{s}$ ) [47], enabling the temporal resolution high enough to discriminate transitory events. The articulatory movement and respiration, on the other hand, are relatively steady and change at a much lower rate than the vibration of vocal cords. Therefore, we extended the window length to $\geq$ $128\text{\,}\mathrm{m}\mathrm{s}$ with a hop length $\geq$ $32\text{\,}\mathrm{m}\mathrm{s}$ to capture the dynamics at a wider range. To achieve the optimal performance, we experimented window length values from $128\text{\,}\mathrm{m}\mathrm{s}$ to $1\text{\,}\mathrm{s}$ with 25% hop length, and found the best to be around $256\text{\,}\mathrm{m}\mathrm{s}$ . The effects of window sizes are detailed in Section V.C. The resultant dynamics has three different axes, namely feature, time, and modulation frequency, where each slice along the time axis carries the decomposed modulation frequency values for all features.

III-C Downstream components

Similar to speaker embeddings, we assume that the health embeddings correspond to utterance-level characteristics. Thus, both temporal and dynamics representations require a temporal pooling operation to obtain the time-invariant embeddings. We compared average pooling and attentive statistic pooling (ASP) and found the latter to be a better suited for the task at hand. The original ASP aims to integrate the frame-level attention when calculating mean and standard deviation as follows:

\mu=\sum_{t}^{T}\alpha_{t}h_{t},

(2)

\sigma=\sqrt{\sum_{t}^{T}\alpha_{t}h_{t}\odot h_{t}-\mu\odot\mu},

(3)

where $\alpha_{t}$ represents the weight assigned to the $t$ th time frame.

The ASP can be used directly on temporal features to flatten them into a 1-dimensional vector. With modulation dynamics, we first computed the average along the time axis, which leads to the shape $\{\mathbf{Freq}\times\mathbf{F}\}$ , where $\mathbf{Freq}$ stands for frequency and $\mathbf{F}$ for features. We then applied attention to different frequency channels, and calculated the attentive mean and standard deviation. The temporal and dynamics vectors were firstly concatenated then fed into a fully-connected (FC) layer to map into a 768-dimension vector, which was used as the health embedding. A dropout layer and a LeakyReLU with the negative slope of 0.1 were appended after. The second FC layer maps the health embeddings to a single value as the final decision. Additionally, we applied pruning on top of the last FC layer, where the percentage of neurons to be pruned was set as a hyperparameter.

IV Experimental setup

IV-A Dataset

To diversify the types of speech pathologies to be tested, we used six publicly available datasets covering four different speech-related abnormalities. Since they all differ in data collection procedures and some were originally designed for other purposes (e.g., ASR), we here outline the details for each dataset regarding the data collection procedure, data composition and demographics, and data partitions for our downstream ML tasks.

IV-A1 Respiratory Symptoms Datasets

Respiratory symptoms refer to the symptoms induced by infections in the respiratory system, such as coughs, fever, sore throat, etc. [59]. The appearance of respiratory symptoms is commonly seen with asthma, obstructive pulmonary disease (COPD), and pneumonia, just to name a few. At the time of writing, the largest publicly available speech database with various respiratory symptoms is the COVID-19 Sounds [6]. It contains a total of $552\text{\,}\mathrm{h}\mathrm{o}\mathrm{u}\mathrm{r}\mathrm{s}$ of audio data recorded remotely from 36,116 individuals around the globe via an app interface. During data collection, volunteers were prompted to conduct three tasks: (1) scripted speech, where all participants uttered the same sentence – ‘I hope my data can help to manage the virus pandemic’ – three times in their mother tongue; (2) voluntary cough for three times; and (3) deep breathing through the mouth for three to five minutes. In addition, they also self-reported their COVID-status along with certain metadata information (e.g., gender, age, pre-existing medication condition, respiratory symptoms). In our study, we used only the speech signals and the metadata. It should also be emphasized that not all participants had conducted a PCR test before recording, hence the COVID-status was in the form of a subjective evaluation (e.g., ‘I think never had COVID-19’) rather than a binary label (i.e., positive vs negative). Such ambiguity in COVID-19 labels motivated us to use this database for respiratory abnormality detection instead of COVID-19 prediction.

Although the COVID-19 sounds database is advantageous in its size, it may not be the optimal version to train a diagnostics model considering that multiple factors were not controlled, such as language, sampling rate, or acoustic environment. Hence, we set up two subsets from the original database by screening out several potential confounding factors. The first subset was released along with the original database, which was used as the benchmark data for the respiratory symptom prediction task in the COVID-19 Sounds paper [6]. This subset is henceforth referred to as CS-Res. CS-Res contains English samples from 6,623 individuals with respiratory symptoms (e.g., sore throat, cough, etc.), resulting in a total of $31.3\text{\,}\mathrm{h}$ speech data. The sampling rates varied upon different devices used, with the majority sampled at $44.1\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ (67.4%) and $16\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ (29.8%). CS-Res was carefully curated so that the recording quality and class balance were controlled. The second subset is similar to CS-Res (in that only English samples are used) but without controlling for the other factors. This subset is referred to as CS-Res-L, with a total of $123.1\text{\,}\mathrm{h}$ of speech, of which 57.1% were sampled at $16\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ and 40.4% at $44.1\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ and the rest (2.5%) were sampled at $8\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ and $12\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ . For both subsets, participants were labelled into two classes, namely the positive ones who reported at least one respiratory symptom, and the negative ones reporting no symptoms at all. With CS-Res, we followed the official partitions as described in [6]. With CS-Res-L, a customized speaker-independent split was performed with a ratio of 7:1:2 (train:validation:test). Meanwhile, we ensured that the distribution of symptom labels, gender, and age were similar in all three splits.

IV-A2 DiCOVA2 Dataset

This dataset contains speech data used in the Second Diagnosing COVID-19 using Acoustics challenge organized in India [32]. DiCOVA2 collected multi-modal acoustic data (i.e., speech, cough, and breathing) remotely from a total of 965 participants via Android and Web apps. Participants were advised to keep the device $10\text{\,}\mathrm{c}\mathrm{m}$ from their mouth during recording. For the speech track, participants did number counting from 1 to 20 in a normal pace in English. The recordings were sampled at $48\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ . Furthermore, participants self-reported their metadata, such as gender, experienced symptoms, and COVID-19 status which was grouped into binary labels (either positive or negative). Since the test labels were not made accessible to the public, we used the validation data as the new test set, and partitioned the original training data into the new training set and validation set (8:2).

IV-A3 TORGO Dataset

This dataset consists of speech recordings and synchronized 3D articulatory features collected from healthy controls and speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), the two most prevalent causes of dysarthria [60]. TORGO was originally designed to develop ASR models for dysarthric individuals. The publicly available version of TORGO includes 8 individuals with dysarthria and 7 healthy controls. During data collection, all subjects were asked to read English text from the screen. The speech data were recorded from two microphones, one facing the participant at a distance of $61\text{\,}\mathrm{c}\mathrm{m}$ with a sampling rate of $22.1\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ while the other is head-mounted with a sampling rate of $44.1\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ . Only the data from the front-facing microphone were employed herein. All subjects conducted four different reading tasks: (1) non-words (e.g., high- and low-pitch vowels); (2) short words (e.g. ‘yes’, ‘no’, ‘back’, etc.); (3) restricted sentence (e.g., “The quick brown fox jumps over the lazy dog”); (4) unrestricted sentence (e.g., spontaneously describe 30 images from the Webber Photo Cards). We included data from all four tasks in our analysis. As there were no official data partitions, we followed the speaker-independent principle to split all 15 subjects into three sets¹¹1‘F’ and ‘M’ stand for female and male; ‘C’ stands for healthy controls.: (1) training set (‘FC02’,‘F03’,‘F01’,‘MC04’,‘MC03’,‘M02’); (2) validation set (‘MC02’,‘FC01’,‘M03’,‘M01’); and (3) test set (‘FC03’,‘F04’,‘MC01’,‘M05’,‘M04’). The average dysarthria severity was made similar for all three sets.

IV-A4 Nemours Dataset

This is a collection of speech recordings from 12 males, 11 with different levels of dysarthria and 1 healthy control [61]. Each participant was asked to record 74 nonsense sentences of the form “The $X$ is $Y$ ing the $Z$ .” ( $X\neq Z$ ). Sentences were generated by randomly selecting $X$ and $Z$ without replacement from a set of 74 monosyllabic nouns and selecting $Y$ without replacement from a set of 37 disyllabic verbs. All recordings were collected in a small sound dampened room with one table-mounted microphone, and digitized subsequently using a $16\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ sampling rate. Apart from recording sessions, Nemours also included a perception session where 5 listeners tried to identify the words of the nonsense sentences. The average number of correct identifications was calculated per speaker and the Frenchay speaker assessment scores were reported, which reflects the severity of dysarthria. The average assessment score of the dysarthric speakers is 74.68 with a standard deviation of 14.54. We labelled all speakers into two classes, namely the relatively severe individuals with scores lower than 74.68 (6 dysarthria speakers), and the mild ones with scores higher than 74.68 (5 dysarthria speakers plus 1 healthy control).

IV-A5 NCSC Dataset

This refers to the “NKI CCRT Speech Corpus” [62]. NCSC contains speech recordings and perceptual evaluations of 55 speakers (10 female and 45 male), who underwent concomitant chemo-radiation treatment (CCRT) for cancer of the head and neck region. Recordings and evaluations were made at three moments: (1) before CCRT; (2) 10-weeks after CCRT; and (3) 12-months after CCRT. All subjects read a 189-word passage from a Dutch fairy tale in a sound-treated room. Speech data were collected using a microphone with a $44.1\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ sampling rate at a distance of $30\text{\,}\mathrm{c}\mathrm{m}$ from mouth. 13 speech pathologists rated the intelligibility of these speech recordings on a scale of 1 to 7. We employed the NCSC data released by the INTERSPEECH 2012 Pathology Sub-Challenge [63], where all recordings were labelled either as ‘intelligible’ or ‘non-intelligible’, and were split into three independent sets for model training and evaluation. However, since the test labels were not accessible to the public, we used the validation set as the new test set and split the original training set into the new training and validation set with a ratio of 8:2.

An overview of the data set-up can be found in Table I. For reproducibility, we also report if the data split was official or customized, and if a baseline model was released together with the dataset. We further release all data partition details in our code repository for future comparisons. Note that issues were seen with a few samples during our exploratory analysis, such as an empty recording or failures during loading. The file names of these recordings can be found here³³3TORGO/FC01/Session1/wav_arrayMic/0256.wav is an empty recording; Failed to load Nemours/RL/WAV/JPRL39.WAV with torchaduio.. These error files were discarded in our experiments.

TABLE I: Employed pathological speech datasets. For reproducibility, we also report if the data split was official and if a baseline model was released together with the dataset, which are indicated by the ‘Official’ and ‘Baseline’ columns. Pos/Neg represents the positive to negative ratio.

Pathology	Dataset	Lang	#hours	#spk	#utt	ave_dur ( $\text{\,}\mathrm{s}$ )	Data split					Baseline
Pathology	Dataset	Lang	#hours	#spk	#utt	ave_dur ( $\text{\,}\mathrm{s}$ )	Official	Pos/Neg	Train	Valid	Test	Baseline
Resp symptom	CS-Res	EN	31.3	6,623	9,456	11.93 $\pm$ 4.66	✓	1.05	6,648	1,914	894	✓
Resp symptom	CS-Res-L	EN	123.1	24,134	37,140	11.94 $\pm$ 4.97	✗	0.78	22,308	7,969	3,863	✗
COVID-19	DiCOVA2	EN	3.93	975	975	14.33 $\pm$ 4.15	✓	0.20	617	154	193	✓
Dysarthria	TORGO	EN	8.1	15	9,417	3.09 $\pm$ 2.13	✗	0.51	4,564	1,753	3,100	✗
Dysarthria	Nemours	EN	1.5	12	1,628	3.35 $\pm$ 2.79	✗	0.38	1,184	148	296	✗
Chemo treated	NCSC	NL	1.4	55	1647	3.13 $\pm$ 1.74	✓	1.27	701	200	746	✓

IV-B Benchmark models

As mentioned in Section IV, some challenge datasets were released with a baseline model, namely mel-spectrogram+VGG16 for CS-Res [6], mel-spectrogram deltas+BiLSTM for DiCOVA2 [32], and openSMILE+RandomForest for NCSC [63]. Though performing well on one dataset, studies have shown that these models lack generalizability across datasets, even within the same type of disease [12]. For simplicity, we group the best performance reported by these baseline models in one row (bottom row in Table III). Recent work has reported better performance achieved with larger speech models, such as TDNN and transformer-based ones [35, 39].

In our study, we compared WavRx to five state-of-the-art speech classification baselines, namely two that leverage SSL encoders, including Wav2vec [10] and Hubert [8], two different versions of AST pre-trained with speech and audio data respectively [36] (denoted as AST_speech and AST_audio), and ECAPA-TDNN [35]. Modifications were made to these baseline models for compatibility with our tasks. The same ASP layer and classification head implemented in WavRx were appended to Wav2vec and Hubert encoders, and a single FC layer was applied to ECAPA-TDNN embeddings to map these pre-trained representations to a binary output. AST_speech and AST_audio were already compatible with our tasks, hence no modifications were made. Three versions of WavRx were compared, namely the original version fusing temporal and dynamics information, and two simplified versions removing either one of the two branches. Details about the baseline models can be found in their corresponding references.

IV-C Tasks

To test cross-disease, cross-dataset, and privacy-preserving properties of the proposed method, three tasks are proposed. A fourth task is also included to enhance interpretability. These four tasks include:

IV-C1 Task 1 – In-domain diagnostic

This task aims to compare the proposed WavRx to the other baseline models in an in-domain setting. Models were trained and evaluated within each of the six datasets. An ablation study is also conducted to demonstrate the effects of different model components of WavRx.

IV-C2 Task 2 – Zero-shot diagnostic

This task investigates the model generalizability in a stringent setting, where models were trained on one dataset and made predictions on unseen datasets. During inference, both the health embedding encoder and classification head were fixed. This task emulates a scenario where no training data is available from the target domain (e.g., an unseen disease).

IV-C3 Task 3 – Privacy of health embeddings

This task examines if the speaker identity is concealed in the WavRx health embeddings by running an automatic speaker verification (ASV) task on top. Since ASV requires multiple recordings from each single individual, TORGO (15 speakers) and Nemours (10 speakers) were selected for this task. With each individual, 10% of the speech samples were used for training and the remaining 90% were used for testing. We first extracted the health embeddings using the pre-trained WavRx from Task 1, then applied LDA as the speaker classifier. The WavLM model fine-tuned on Voxceleb 1&2 [64] was used as the baseline speaker embedding encoder for comparison purposes.

IV-C4 Task 4 – Analysis/interpretability of the modulation dynamics block

Previous tasks have quantified the changes in diagnostic performance, generalizability, and speaker privacy when integrating the modulation dynamics block. This task aims to explore the reason behind these changes by analyzing the characteristics of modulation dynamics and how it shaped the information learned by the upstream WavLM encoder.

IV-D Training and evaluation details

For training efficiency, we limited all input recordings to be within $10\text{\,}\mathrm{s}$ by cutting off the over-length part. For those with left and right channels, we took the average to obtain a single-channel audio. All recordings were re-sampled to $16\text{\,}\mathrm{k}\mathrm{H}\mathrm{z}$ and the amplitude was normalized between -1 and 1. Since the STFT operation in the modulation block requires a minimum of $1\text{\,}\mathrm{s}$ -signal, short audios were zero-padded to $1\text{\,}\mathrm{s}$ . The aforementioned pre-processing was achieved using the Torchaudio library [65].

Regarding data augmentation, we injected two types of environmental corruptions in each training batch, namely noise and reverberation, and concatenated the augmented samples with the original samples. Furthermore, we added speed perturbations by slightly speeding up (105%) and down (95%) the signal. These approaches were implemented via the SpeechBrain toolkit [66].

We used the same hyperparameters for training WavRx on all six datasets, changing only the data augmentation and pruning parameters. These hyperparameters are reported in Table II. Data augmentation was only used when trained on DiCOVA2 and TORGO; the optimal pruning percentage was set to 90% for DiCOVA2 and NCSC, and 0% for the others. With the baseline models, we employed the same data augmentation methods used to train WavRx, and tuned the hyperparameters separately for each one of them.

Diagnostic performance is measured by two metrics, namely the area under the receiver operating curve (AUC-ROC) and the F1 score. The former has been used widely in disease detection tasks as a baseline metric [32, 6]. However, AUC-ROC has been shown to be over-optimistic when evaluating on extremely imbalanced datasets [67]. F1, on the other side, is more robust in an imbalanced setting. With both metrics, we calculated for each class and took the unweighted mean (i.e., macro). This is because positive samples (i.e. symptomatic) are usually the minority class, but the missed prediction of a positive sample is more disastrous than that of a negative sample. Hence, the macro average is more suitable than the weighted average. Furthermore, we found that a model could perform decently on the test set but poorly on the validation set (or vice versa). As such, we report F1 scores achieved with both test and validation sets, where the difference between these two can indicate the model robustness.

Experiments were conducted on the Compute Canada platform [68] with four NVIDIA V100-SXM2 ( $32\text{\,}\mathrm{G}\mathrm{B}$ RAM per GPU). The training time with WavRx was approximately 3-4 hours for CS-Res and CS-Res-L, and less than 2 hours for the other datasets (excluding job waiting time). The shell scripts are also provided in our code repository for simpler replication.

TABLE II: Optimal hyperparameters set for WavRx.

Category	Hyper-parameter	Adopted value
Training	Batch size	1
	Learning rate scheduler	Linear
	lr_start	$1e^{-}4$
	lr_end	$1e^{-}5$
	Epochs	30
	Optimizer	AdamW
	Early-stop	✓
	limit_start	2
	limit_stop	3
Model	STFT window size	$256\text{\,}\mathrm{m}\mathrm{s}$
	STFT hope length	$64\text{\,}\mathrm{m}\mathrm{s}$
	N_fft	400
	Window type	Hamming
	Pad type	Zero-padding
	FC dropout	0.25
	Pruning percentage^‡	90%
Data aug^†	Prob_noise	1
	Prob_reverb	1
	SNR_min	$0\text{\,}\mathrm{d}\mathrm{B}$
	SNR_max	$15\text{\,}\mathrm{d}\mathrm{B}$
	Speed_min	0.95
	Speed_max	1.05

•

^‡Pruning was used with DiCOVA2 and NCSC. ^†Data augmentation was applied when training on DiCOVA2 and TORGO.

V Results and Discussion

V-A Task 1: In-domain diagnostic performance

In-domain diagnostics usually indicates the highest performance that can be achieved by each model in an ideal setting, where training and evaluation data share the same distribution. As shown in Table III, the proposed WavRx obtains the highest test F1 scores in 4 out of 6 datasets, along with the highest average F1 score of 0.744 (combining test and validation) among all models. With the three datasets that were released with official baseline systems (i.e., CS-Res, DiCOVA2, and NCSC), WavRx markedly outperforms the baselines. When using only the modulation dynamics branch for detection, while the overall performance is not competitive as other benchmarks, it is shown to be the top-performer in the Nemours dysarthria detection task. Together, these results suggest that the dynamics of universal representations is crucial for disease detection.

When comparing different model categories, SSL models (i.e., WavRx, Wav2vec, Hubert) in general outperform those pre-trained in a supervised manner (i.e., AST, ECAPA-TDNN), though both did not include pathological speech during the pre-training. This again demonstrates the benefits of SSL pre-training when evaluated on a variety of downstream tasks. Interestingly, as the only backbone that was pre-trained not on speech data, the AST_audio outperforms its speech version. Since existing speech foundation models are usually trained with only speech data, potential improvement might be achieved when adding audio data to the pre-training stage, such as music and other non-speech acoustic events.

TABLE III: Comparison of model performance on six speech diagnostics datasets. Note that only CS-Res, DiCOVA2, and NCSC had official baselines. ‘ROC’ corresponds to AUC-ROC; ‘F1_$t$’ refers to test F1 score and ‘F1_$v$’ to validation F1 score. The last ’Ave’ column is the average of F1_$t$ and F1_$v$ across all datasets. For all three metrics, higher values suggest better performance. Highlighted values represent the best performing model (s) for the metric.

Model	Respiratory abnormality						COVID-19			Dysarthria						Cancer			Ave
	CS-Res			CS-Res-L			DiCOVA2			TORGO			Nemours			NCSC
	ROC	F1_t	F1_v	ROC	F1_t	F1_v	ROC	F1_t	F1_v	ROC	F1_t	F1_v	ROC	F1_t	F1_v	ROC	F1_t	F1_v
WavRx	\cellcolorlightgray.815	\cellcolorlightgray.730	\cellcolorlightgray.725	\cellcolorlightgray.694	\cellcolorlightgray.655	\cellcolorlightgray.624	\cellcolorlightgray.878	\cellcolorlightgray.600	.524	.918	.767	\cellcolorlightgray.756	.939	.959	.946	\cellcolorlightgray.774	\cellcolorlightgray.737	.910	\cellcolorlightgray.744
WavRx_mod	.740	.645	.650	.620	.571	.504	.801	.550	.466	.810	.659	.636	\cellcolorlightgray.961	\cellcolorlightgray.980	.932	.753	.716	.804	.676
WavRx_tem	.807	.721	.720	.691	.649	.594	.861	.589	.478	.918	.768	\cellcolorlightgray.756	.855	.872	.916	.735	.695	.910	.722
Wav2vec	.798	.712	.707	.682	.640	.591	.841	.576	.480	.827	.677	.624	.945	.966	\cellcolorlightgray.959	.759	.721	.905	.713
Hubert	.796	.711	.707	.689	.645	.592	.829	.568	.479	\cellcolorlightgray.931	\cellcolorlightgray.782	.687	.843	.858	.973	.705	.666	\cellcolorlightgray .928	.717
AST_speech	.683	.582	.595	.610	.554	.507	.738	.510	\cellcolorlightgray.571	.762	.611	.612	.727	.736	.676	.636	.598	.804	.613
AST_audio	.722	.625	.661	.613	.548	.584	.539	.383	.462	.770	.620	.603	.902	.919	.703	.639	.610	.822	.628
ECAPA-TDNN	.687	.571	.638	.582	.523	.555	.755	.498	.543	.636	.487	.499	.640	.636	.637	.703	.663	.712	.580
Baselines	.695	.594	.620	$-$	$-$	$-$	.817	.561	.544	$-$	$-$	$-$	$-$	$-$	$-$	.699	.658	.710	$-$

V-B Task 1: Ablation study

Next, we carefully examined the improvements brought by the different components of WavRx. Figure 3 shows the improvement in F1 scores averaged across six datasets with different design choices. The largest improvement was seen when all layer outputs were used instead of relying on the last single layer. This corroborates existing SSL model layer analyses, which suggest that later layers likely encode speech semantics while a higher percentage of paralinguistic attributes (e.g., speaker identity, emotion, prosody) are encoded in the early and middle layers [69, 9, 10, 70, 71]. For health diagnostics, the importance of utterance-level characteristics is expected to outweigh frame-level details, since the respiration and articulation patterns do not change from frame to frame. Hence, relying on only the last layer output is suboptimal for health diagnostic tasks. We also explored different WavLM backbone versions, with some further fine-tuned for other tasks, such as ASR and ASV. However, no major difference was seen compared to the raw backbone. The dropout rate is also shown to be important, which helps with model generalization. Since data augmentation is not a major focus of this study, we explored only adding noise and reverberation to the waveforms, which led to minor improvements. Lastly, once other components were optimized, addition of the modulation dynamics branch markedly boosted overall performance, demonstrating its complementarity to the temporal SSL representation.

V-C Task 2: Zero-shot diagnostic performance

When applied in real-world settings, the amount of data collected from one disease is usually quite limited, as can be seen from the size of the existing pathological speech datasets [6, 32, 61, 25, 63, 24]. Hence, it can be beneficial when a diagnostic model can generalize to unseen diseases with similar symptoms or pathological origins. As the top-performers in Task 1, we systematically tested WavRx as well as its two individual branches in a cross-dataset setting. Table IV reports the AUC-ROC scores achieved for the model trained on one disease and tested across unseen diseases, as well as the average over all unseen diseases. When comparing different test diseases, respiratory abnormality is shown as the pathology that is distinct from the others, which can be seen from the lowest AUC-ROC score (bottom row in each sub-table). The two dysarthric speech datasets, on the other hand, can lead to decent generalization to each other, although the speech content and data collection protocols differ. Models trained with dysarthric speech can also benefit the detection of COVID-19, as well as chemo-treated speech, which indicates that neuromuscular deficiency can be a shared characteristic among these three pathologies. When comparing the three sub-tables, significant improvements can be seen for all five pathologies when combining modulation dynamics with temporal embeddings. Together with Task 1 results, findings here suggest that integrating modulation dynamics of universal representations can help capture the disease-related biomarkers and improve the model generalizability to diseases sharing similar pathological origins.

TABLE IV: Cross-disease zero-shot prediction performance using different representations. Values reported are AUC-ROC scores. For each train-test disease combination, the most generalizable representation is color-shaded. Scores without significant difference between the three or below chance-level are ignored. The diagonal values represent the in-domain diagnostic performance.

		Test set					Ave
		Resp	COVID	Dys-1	Dys-2	Cancer	Ave
Dyn+Tem	Resp	.815	.369	.489	.836	.554	.613
	COVID	.493	.878	.567	.693	.454	.617
	Dys-1	.504	.684	.918	.984	.690	.756
	Dys-2	.542	.659	.763	.989	.614	.713
	Chemo	.478	.504	.638	.878	.774	.654
	Ave	.566	.619	.675	.876	.617
Dynamics	Resp	.700	.447	.652	.708	.631	.628
	COVID	.498	.827	.635	.934	.346	.648
	Dys-1	.510	.759	.821	.978	.631	.740
	Dys-2	.533	.798	.750	.998	.495	.715
	Chemo	.490	.337	.419	.391	.753	.647
	Ave	.546	.634	.655	.802	.571
Temporal	Resp	.721	.598	.568	.756	.552	.639
	COVID	.492	.861	.580	.783	.437	.631
	Dys-1	.522	.600	.916	.993	.679	.742
	Dys-2	.550	.406	.682	.968	.563	.634
	Chemo	.495	.378	.598	.760	.746	.595
	Ave	.556	.569	.669	.852	.595

V-D Task 3: Do WavRx health embeddings carry speaker identities?

Given the system shown in Fig. 1, the health embeddings encoded by the local model are expected to carry minimal speaker identity attributes while maximally representing the health information. In this task, we investigate if the health embeddings encoded by the temporal representation alone carry speaker identities, and if the modulation dynamics block can help tackle this issue. The speaker verification accuracies and diagnostic AUC-ROC scores are shown side-by-side in Table V. Ideally, privacy-preserving health embeddings should have a low ASV accuracy and a high diagnostic AUC-ROC score. As can be seen from row 4 and 7, when relying on only the temporal representation, the learned health embeddings carry a higher amount of speaker identity information than the baseline speaker embeddings. This is likely because that pathological speech follow a different feature distribution than the healthy speech in Voxceleb, hence leading to suboptimal performance of the pre-trained speaker embeddings. The health embeddings obtained from the temporal representation may encode both speaker identity and health attributes, therefore resulting in high ASV accuracies. The modulation dynamics representation, on the other hand, decreases the ASV accuracies by an average rate of 31.9% and 13.5% for TORGO and Nemours respectively (row 3 and 6). When fusing the two branches together, the resultant health embeddings lead to the best diagnostic performance, while maintaining the leakage of speaker identity at a lower level than the baseline speaker embeddings (row 2 and 5).

We further visualize the health embeddings learned from temporal and dynamics representations, which are shown, respectively, in Fig. 4. Colors represent different speakers and marker types represent disease states. While in both plots, the positive and negative classes can be well separated apart, a more clear distinction between speaker clusters can be seen in the left plot (temporal) than the right one (modulation dynamics), suggesting that speaker identities are better concealed by the proposed modulation dynamics representation.

TABLE V: Speaker verification accuracy and diagnostic AUC-ROC scores obtained by different representations. For ideal health embeddings, we expect lower speaker accuracy and higher diagnostic score.

Representation (Model_{finetuned dataset})	TORGO		Nemours
Representation (Model_{finetuned dataset})	ACC_spk $\downarrow$	AUC_Diag $\uparrow$	ACC_spk $\downarrow$	AUC_Diag $\uparrow$
WavLM_Voxceleb	.715	-	.951	-
WavRx_TORGO	.711	.918	.898	.984
WavRx-dynamics_TORGO	.602	.821	.831	.978
WavRx-temporal_TORGO	.902	.916	.990	.991
WavRx_Nemours	.609	.763	.873	.989
WavRx-dynamics_Nemours	.594	.750	.847	.998
WavRx-temporal_Nemours	.857	.682	.955	.967

V-E Task 4: Modulation dynamics analysis and interpretability

While the modulation dynamics branch is shown to improve the diagnostic performance and generalizability, it is crucial to investigate the characteristics of such representation to understand the reasons behind the improvements. To this end, we start by extracting the modulation dynamic representations from both positive and negative classes, then compute the Fisher’s F-ratio [72] between the two groups. Since the representation is 2-dimensional (feature by modulation frequency), the F-ratio is calculated per pixel, where the higher value suggests more discrimination between two classes. We further filtered out F-ratio values below 1 since those regions were statistically insignificant. This process was repeated for all six datasets. The F-ratio plots for all tasks can be seen in Fig. 5.

With the given hop length of the STFT ( $64\text{\,}\mathrm{ms}$ ), the maximal modulation frequency is $8.3\text{\,}\mathrm{Hz}$ with the resolution of $0.125\text{\,}\mathrm{Hz}$ . For all six datasets, the majority of the difference is observed below $2\text{\,}\mathrm{Hz}$ , with peaks seen between $0.1\text{\,}$ - $0.5\text{\,}\mathrm{Hz}$ , corresponding to a $2\text{\,}\mathrm{t}$ o $5\text{\,}\mathrm{s}$ -period modulation. Such slow rate of modulation aligns with our initial hypothesis that long-term dynamics of universal representations are crucial for disease detection. While the physiological origin of such modulations still needs to be investigated, it is likely to be associated with slower respiratory and articulatory movement. For example, the automatic contraction of respiratory muscles has been shown to take place once every five seconds during dialogues [73, 74]; an average of 15-25 breathing cycles per minute (equivalent to $2.4\text{\,}$ to $4\text{\,}\mathrm{s}$ per cycle) has been reported for adults and the elderly [75].

Another important phenomena noticed is the sparsity of the F-ratio plots, where only very few features among a total of 768 are shown with statistical significance. Based on this observation, we further calculated the sparsity of the 768-dimensional health embeddings learned from dynamic representations and compared with those learned from temporal representations. The sparsity values below 1% of the per-sample-maximum were thresholded to zeros. The final results are reported in Table VI. As can be seen, it is found that the health embeddings learned from temporal representations have an average sparsity of 35.8% across six datasets with a standard deviation of 9.1% across samples, while the average sparsity doubles to 76.7% with only 0.8% standard deviation for those learned from dynamic representations. Fusing the two together leads to an average sparsity of 64.1%. Findings here demonstrate that disease-related information is encoded more efficiently by the modulation dynamics, where roughly only half of the features are required for accurately detecting a disease. This not only provides insights into the reasons behind the improved generalizability across diseases, but also helps explain the improved privacy-preserving property of the proposed WavRx model. When learning the health embeddings from the fused representations, health-irrelevant information was likely discarded, which may include speaker attributes, such as gender and age.

TABLE VI: Sparsity of health embeddings learned for each dataset. Sparsity is calculated after thresholding the embedding values.

Embedding	Sparsity						Average
Embedding	Cam-Res	Cam-Res-L	DiCOVA2	TORGO	Nemours	NCSC	Average
Temporal	33.6 $\pm$ 5.6	48.0 $\pm$ 5.0	45.7 $\pm$ 10.8	28.3 $\pm$ 8.9	38.8 $\pm$ 16.3	20.6 $\pm$ 8.2	35.8 $\pm$ 9.1
Dynamics	88.5 $\pm$ 0.7	94.2 $\pm$ 0.1	86.6 $\pm$ 0.9	65.9 $\pm$ 1.3	49.4 $\pm$ 1.2	75.4 $\pm$ 0.8	76.7 $\pm$ 0.8
Combined	72.8 $\pm$ 1.7	76.6 $\pm$ 1.9	64.2 $\pm$ 2.4	58.0 $\pm$ 1.4	61.2 $\pm$ 1.9	52.0 $\pm$ 1.6	64.1 $\pm$ 1.8

V-F Task 4: Layer analysis

Similar to a group of studies which performed layer analysis on SSL models for speech applications [9, 10, 70, 71, 69], we investigated the impact of modulation dynamics block on the learned layer weights. Figure 6 compares the layer weights learned with and without the modulation dynamics block. As seen, using only the temporal representation, early layers (0 to 5) are shown to be more crucial, where similar patterns were reported for speaker and emotion recognition tasks [9, 10, 70, 71, 69]. After adding the modulation dynamics, weights are found to shift from early layers to middle layers, with peaks typically seen between layer 6 to 8. Meanwhile, later layers (layer 8 to 11) were also assigned with higher weights. Recent works have suggested that very early layers encode speaker identities [70], while middle layers were found most useful for predicting articulation trace [76]. Combined with our findings, it is likely that modulation dynamics guided the model to focus on articulation-related attributes rather than speaker identities, which led to such shift in layer weights and to the privacy-preserving property observed.

V-G Limitations and Future Study

One potential limitation of our evaluation is the existence of confounding factors in the employed datasets, which have been reported previously [13]. Though we have carefully partitioned the datasets so that groups with reported metadata labels are balanced, there might be other underlying factors, such as the noise level, which could lead to over-optimistic results. Furthermore, while the proposed WavRx obtained SOTA performance on majority of the tasks, the robustness to unseen conditions (e.g., in-the-wild speech) can be further improved. This can be seen from the lowest in-domain diagnostic performance achieved with COVID-19 detection, where speech samples were collected in an uncontrolled setting. Meanwhile, recent works have shown the potential of using speech for mental disease detection [77]. While our study did not include such datasets for evaluation, the mechanism of our proposed model could enable its usage for other pathological conditions.

VI Conclusion

In this study, we describe a novel speech health diagnostic model termed WavRx, by integrating modulation dynamics with a universal speech representation. Our proposed model achieves SOTA performance on five out of six pathological speech datasets and demonstrate zero-shot generalizability across diseases sharing similar symptoms. Furthermore, the leakage of speaker identities is significantly decreased after adding the innovated modulation dynamics block, thus providing the model with privacy-preserving properties needed in healthcare. An in-depth analysis of the modulation dynamic representation shows that low-frequency modulations below $2\text{\,}\mathrm{Hz}$ are crucial to discriminate pathological samples. Sparsity and layer analyses help further explain the reasons behind the improvements seen in generalizability and privacy-preserving abilities. In general, WavRx demonstrates generalizability across diseases while minimizing the leakage of speaker identities, hence can be established as a new benchmark model for health diagnostic tasks.

Disclaimer and Acknowledgement

The authors would like to acknowledge organizations and research groups that made their datasets available to the public. The data holders do not bear any responsibility for the analysis and results presented in this paper. All results and interpretation only represent the view of the authors. Authors also acknowledge funding from INRS, NSERC, and CIHR.

References

[1] G. Fagherazzi, A. Fischer, M. Ismael, and V. Despotovic, “Voice for health: the use of vocal biomarkers from research to clinical practice,” Digital biomarkers, vol. 5, no. 1, pp. 78–88, 2021.
[2] G. Deshpande and B. Schuller, “An overview on audio, signal, speech, & language processing for COVID-19,” arXiv preprint arXiv:2005.08579, 2020.
[3] J. Millet and N. Zeghidour, “Learning to detect dysarthria from raw speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5831–5835.
[4] A. S. Gullapalli and V. K. Mittal, “Early detection of parkinson’s disease through speech features and machine learning: A review,” ICT with Intelligent Applications: Proceedings of ICTIS 2021, Volume 1, pp. 203–212, 2022.
[5] A. König, A. Satt, A. Sorin, R. Hoory, O. Toledo-Ronen, A. Derreumaux, V. Manera, F. Verhey, P. Aalten, P. H. Robert et al., “Automatic speech analysis for the assessment of patients with predementia and alzheimer’s disease,” Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, vol. 1, no. 1, pp. 112–124, 2015.
[6] T. Xia, D. Spathis, J. Ch, A. Grammenos, J. Han, A. Hasthanasombat, E. Bondareva, T. Dang, A. Floto, P. Cicuta et al., “Covid-19 sounds: a large-scale audio dataset for digital respiratory screening,” in Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 2), 2021.
[7] N. Cummins, A. Baird, and B. W. Schuller, “Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning,” Methods, vol. 151, pp. 41–54, 2018.
[8] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[9] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[10] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
[11] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
[12] Y. Zhu, A. Mariakakis, E. De Lara, and T. H. Falk, “How generalizable and interpretable are speech-based COVID-19 detection systems?: A comparative analysis and new system proposal,” in 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 2022, pp. 1–5.
[13] Y. Zhu, M. Imoussaıne-Aıkous, C. Côté-Lussier, and T. H. Falk, “Investigating biases in covid-19 diagnostic systems processed with automated speech anonymization algorithms,” in Proc. 3rd Symposium on Security and Privacy in Speech Communication, 2023, pp. 46–54.
[14] G. Schu, P. Janbakhshi, and I. Kodrasi, “On using the ua-speech and torgo databases to validate automatic dysarthric speech classification approaches,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[15] H. Coppock, A. Akman, C. Bergler, M. Gerczuk, C. Brown, J. Chauhan, A. Grammenos, A. Hasthanasombat, D. Spathis, T. Xia et al., “A summary of the compare COVID-19 challenges,” Frontiers in Digital Health, vol. 5, p. 1058163, 2023.
[16] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. MüLler, and S. Narayanan, “Paralinguistics in speech and language—state-of-the-art and the challenge,” Computer Speech & Language, vol. 27, no. 1, pp. 4–39, 2013.
[17] D. Dagar and D. K. Vishwakarma, “A literature review and perspectives in deepfakes: generation, detection, and applications,” International journal of multimedia information retrieval, vol. 11, no. 3, pp. 219–289, 2022.
[18] Y. Zhu, M. Imoussaïne-Aïkous, C. Côté-Lussier, and T. H. Falk, “On the impact of voice anonymization on speech diagnostic applications: a case study on COVID-19 detection,” IEEE Transactions on Information Forensics and Security, 2024.
[19] F. Javanmardi, S. Tirronen, M. Kodali, S. R. Kadiri, and P. Alku, “Wav2vec-based detection and severity level classification of dysarthria from speech,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[20] X.-Y. Chen, Q.-S. Zhu, J. Zhang, and L.-R. Dai, “Supervised and self-supervised pretraining based covid-19 detection using acoustic breathing/cough/speech signals,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 561–565.
[21] V. Ravi, J. Wang, J. Flint, and A. Alwan, “A step towards preserving speakers’ identity while detecting depression via speaker disentanglement,” in Interspeech, vol. 2022. NIH Public Access, 2022, p. 3338.
[22] A. Das, S. Ghosh, T. Polzehl, and S. Stober, “Stargan-vc++: Towards emotion preserving voice conversion using deep embeddings,” arXiv preprint arXiv:2309.07592, 2023.
[23] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
[24] B. W. Schuller, A. Batliner, C. Bergler, C. Mascolo, J. Han, I. Lefter, H. Kaya, S. Amiriparian, A. Baird, L. Stappen et al., “The interspeech 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates,” arXiv preprint arXiv:2102.13468, 2021.
[25] B. Schuller, S. Steidl, A. Batliner, E. Bergelson, J. Krajewski, C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soderstrom et al., “The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring,” in Computational Paralinguistics Challenge (ComParE), Interspeech 2017, 2017, pp. 3442–3446.
[26] T. Arias-Vergara, J. C. Vásquez-Correa, and J. R. Orozco-Arroyave, “Parkinson’s disease and aging: analysis of their effect in phonation and articulation of speech,” Cognitive Computation, vol. 9, pp. 731–748, 2017.
[27] J. C. Vásquez-Correa, J. Orozco-Arroyave, T. Bocklet, and E. Nöth, “Towards an automatic evaluation of the dysarthria level of patients with parkinson’s disease,” Journal of communication disorders, vol. 76, pp. 21–36, 2018.
[28] Y. Zhu, A. Tiwari, J. Monteiro, S. Kshirsagar, and T. H. Falk, “COVID-19 detection via fusion of modulation spectrum and linear prediction speech features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[29] A. Afshan, J. Guo, S. J. Park, V. Ravi, J. Flint, and A. Alwan, “Effectiveness of voice quality features in detecting depression,” Interspeech 2018, 2018.
[30] Y. Zhu and T. H. Falk, “Fusion of modulation spectral and spectral features with symptom metadata for improved speech-based covid-19 detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8997–9001.
[31] H. Coppock, A. Gaskell, P. Tzirakis, A. Baird, L. Jones, and B. Schuller, “End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study,” BMJ innovations, vol. 7, no. 2, 2021.
[32] N. K. Sharma, S. R. Chetupalli, D. Bhattacharya, D. Dutta, P. Mote, and S. Ganapathy, “The second dicova challenge: Dataset and performance analysis for diagnosis of covid-19 using acoustics,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 556–560.
[33] Y. Zhu and T. H. Falk, “Spectral–temporal saliency masks and modulation tensorgrams for generalizable covid-19 detection,” Computer Speech & Language, vol. 86, p. 101620, 2024.
[34] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[35] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
[36] Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
[37] S. M. Sekhar, G. Kashyap, A. Bhansali, K. Singh et al., “Dysarthric-speech detection using transfer learning with convolutional neural networks,” ICT Express, vol. 8, no. 1, pp. 61–64, 2022.
[38] L. Jeancolas, D. Petrovska-Delacrétaz, G. Mangone, B.-E. Benkelfat, J.-C. Corvol, M. Vidailhet, S. Lehéricy, and H. Benali, “X-vectors: New quantitative biomarkers for early parkinson’s disease detection from speech,” Frontiers in Neuroinformatics, vol. 15, p. 578369, 2021.
[39] H. Coppock, G. Nicholson, I. Kiskin, V. Koutra, K. Baker, J. Budd, R. Payne, E. Karoune, D. Hurley, A. Titcomb et al., “Audio-based ai classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers,” arXiv preprint arXiv:2212.08570, 2022.
[40] H. Xue and F. D. Salim, “Exploring self-supervised representation ensembles for covid-19 cough classification,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1944–1952.
[41] L. P. Violeta, W.-C. Huang, and T. Toda, “Investigating self-supervised pretraining frameworks for pathological speech recognition,” arXiv preprint arXiv:2203.15431, 2022.
[42] Y. Lu, L. Wen, J. Liu, Y. Liu, and X. Tian, “Self-supervision can be a good few-shot learner,” in European Conference on Computer Vision. Springer, 2022, pp. 740–758.
[43] L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
[44] P. Zhang, M. Wu, H. Dinkel, and K. Yu, “Depa: Self-supervised audio embedding for depression detection,” in Proceedings of the 29th ACM international conference on multimedia, 2021, pp. 135–143.
[45] E. D. Casserly and D. B. Pisoni, “Speech perception and production,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 1, no. 5, pp. 629–647, 2010.
[46] J. J. Ohala, “Respiratory activity in speech,” in Speech production and speech modelling. Springer, 1990, pp. 23–53.
[47] L. R. Rabiner, R. W. Schafer et al., “Introduction to digital speech processing,” Foundations and Trends® in Signal Processing, vol. 1, no. 1–2, pp. 1–194, 2007.
[48] H. Hermansky, “Modulation spectrum in speech processing,” in Signal Analysis and Prediction. Springer, 1998, pp. 395–406.
[49] S. Greenberg and B. E. Kingsbury, “The modulation spectrogram: In pursuit of an invariant representation of speech,” in 1997 IEEE international conference on acoustics, speech, and signal processing, vol. 3. IEEE, 1997, pp. 1647–1650.
[50] T. H. Falk, C. Zheng, and W.-Y. Chan, “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1766–1774, 2010.
[51] M. Markaki and Y. Stylianou, “Voice pathology detection and discrimination based on modulation spectral features,” IEEE Transactions on audio, speech, and language processing, vol. 19, no. 7, pp. 1938–1948, 2011.
[52] M. Sarria-Paja and T. H. Falk, “Whispered speech detection in noise using auditory-inspired modulation spectrum features,” IEEE Signal Processing Letters, vol. 20, no. 8, pp. 783–786, 2013.
[53] T. H. Falk, W.-Y. Chan, and F. Shein, “Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility,” Speech Communication, vol. 54, no. 5, pp. 622–631, 2012.
[54] S. Wu, T. H. Falk, and W.-Y. Chan, “Automatic speech emotion recognition using modulation spectral features,” Speech communication, vol. 53, no. 5, pp. 768–785, 2011.
[55] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7669–7673.
[56] G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
[57] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
[58] A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[59] “Respiratory tract infections (RTIs),” https://www.nhs.uk/conditions/respiratory-tract-infection/, accessed: 2023-12-13.
[60] F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, vol. 46, pp. 523–541, 2012.
[61] X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “The nemours database of dysarthric speech,” in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, vol. 3. IEEE, 1996, pp. 1962–1965.
[62] R. P. Clapham, L. van der Molen, R. van Son, M. W. van den Brekel, F. J. Hilgers et al., “Nki-ccrt corpus-speech intelligibility before and after advanced head and neck cancer treated with concomitant chemoradiotherapy.” in LREC, vol. 4. Citeseer, 2012, pp. 3350–3355.
[63] B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. Van Son, F. Weninger, F. Eyben, T. Bocklet et al., “The interspeech 2012 speaker trait challenge,” in INTERSPEECH 2012, Portland, OR, USA, 2012.
[64] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
[65] Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C.-F. Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V. Quenneville-Bélair, and Y. Shi, “Torchaudio: Building blocks for audio and speech processing,” arXiv preprint arXiv:2110.15018, 2021.
[66] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
[67] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from imbalanced data sets. Springer, 2018, vol. 10.
[68] S. Baldwin, “Compute canada: advancing computational research,” in Journal of Physics: Conference Series, vol. 341, no. 1. IOP Publishing, 2012, p. 012001.
[69] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921.
[70] T. Ashihara, M. Delcroix, T. Moriya, K. Matsuura, T. Asami, and Y. Ijima, “What do self-supervised speech and speaker models learn? new findings from a cross model layer-wise analysis,” arXiv preprint arXiv:2401.17632, 2024.
[71] G.-T. Lin, C.-L. Feng, W.-P. Huang, Y. Tseng, T.-H. Lin, C.-A. Li, H.-y. Lee, and N. G. Ward, “On the utility of self-supervised models for prosody-related tasks,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1104–1111.
[72] R. A. Fisher, “Statistical methods for research workers,” in Breakthroughs in statistics: Methodology and distribution. Springer, 1970, pp. 66–70.
[73] A. Rochet-Capellan and S. Fuchs, “Take a breath and take the turn: how breathing meets turns in spontaneous dialogue,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 369, no. 1658, p. 20130399, 2014.
[74] A. L. Winkworth, P. J. Davis, E. Ellis, and R. D. Adams, “Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors,” Journal of Speech, Language, and Hearing Research, vol. 37, no. 3, pp. 535–556, 1994.
[75] K. E. Barrett, Ganong’s review of medical physiology. : McGraw Hill Education, 2019, no. 1.
[76] C. J. Cho, P. Wu, A. Mohamed, and G. K. Anumanchipalli, “Evidence of vocal tract articulation in self-supervised learning of speech,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[77] S. Koops, S. G. Brederoo, J. N. de Boer, F. G. Nadema, A. E. Voppel, and I. E. Sommer, “Speech as a biomarker for depression,” CNS & Neurological Disorders-Drug Targets (Formerly Current Drug Targets-CNS & Neurological Disorders), vol. 22, no. 2, pp. 152–160, 2023.