¹¹institutetext: ICube laboratory, University of Strasbourg, CNRS, France
¹¹email: {d.wang, kyuan, f.blanc, npadoy, seo}@unistra.fr
²²institutetext: Hôpital de la Robsertsau, France
²²email: {candice.muller, frederic.blanc}@chru-strasbourg.fr

Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model ^†^†thanks: Supported by French National project “ANR ArtIC: AI for Care”, French Minister.

Diwei Wang 11 Kun Yuan 11 Candice Muller 22 Frédéric Blanc 1122 Nicolas Padoy 11 Hyewon Seo 11

Abstract

We present a knowledge augmentation strategy for assessing the diagnostic groups and gait impairment from monocular gait videos. Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos, through a collective learning across three distinct modalities: gait videos, class-specific descriptions, and numerical gait parameters. Our specific contributions are two-fold: First, we adopt a knowledge-aware prompt tuning strategy to utilize the class-specific medical description in guiding the text prompt learning. Second, we integrate the paired gait parameters in the form of numerical texts to enhance the numeracy of the textual representation. Results demonstrate that our model not only significantly outperforms state-of-the-art (SOTA) in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions using the vocabulary of quantitative gait parameters. The code and the model will be made available at our project page: https://anonymous.4open.science/r/GaitAnalysisVLM-CC83.

Keywords:

Pathological gait classification MDS-UPDRS Gait score Knowledge-aware prompt tuning Numeracy for language model

1 Introduction

While quantitative gait impairment analysis has proven to be an established method for accessing neurodegenerative diseases and gauging their severity [pathologicalSignature, QuantativeGA, ImpactofEnvironment, muller2018correlation], current clinical assessments are used in highly restricted contexts, posing significant challenges: Not only do they often require specialized equipment, such as force plates or IMU sensors, but they also struggle to capture moments with prominent symptoms during clinical visits, which are somewhat special occasions for patients. Analysing motor symptoms from video offers new possibilities, enabling cost-effective monitoring, remote surveillance without the need of frequent in-person clinic visits, thereby facilitating timely and personalized assessment.
Naturally, there have been recent efforts to develop a single 2D-RGB-camera-based gait analysis system, with the majority leveraging advancements in deep learning. Albuquerque et al [St-LSTMGaitClassify] develop a spatiotemporal deep learning approach by producing a gait representation that combines image features extracted by Convolutional Neural Networks (CNNs), chained with a temporal encoding based on a LSTM (Long Short Term Memory) network. Sabo et al [sabo2022estimating] have shown that Spatiotemporal-Graph Convolution Network models operating on 3D joint trajectories outperform earlier models. In the work by Lu et al [lu2020miccai], 3D body mesh and pose are extracted and tracked from video frames, and the sequence of 3D poses is classified based on MDS-UPDRS gait scores [mds-updrs2008] using a temporal CNN. Wang et al [max-gr2023] have developed a dedicated 3D skeleton reconstructor tailored for gait motion, incorporating a gait parameter estimator from videos and a multihead attention Transformer for similar classification tasks. Among methods for non-pathological gait analysis, GaitBase [opengait2023] combines improved spatial feature extraction and temporal gait modeling for appearance-based gait recognition, in both indoor and outdoor settings.
Existing works face challenges in handling insufficient pathological gait data and imbalances with normal data, promoting strategies such as a self-supervised pretraining stage prior to the task-specific supervision [sabo2022estimating], or the employment of crafted loss functions [lu2020miccai]. Nevertheless, the need for data-efficient approaches with superior performance is crucial in video-based pathological gait classification. Meanwhile, the recent emergence of large-scale pre-trained vision-language models (VLMs) has demonstrated remarkable performance and transferability to different types of visual recognition tasks [clip2021, miech2020end], thanks to their generalizable visual and textual representations of natural concepts. In the context of medical image analysis, VLMs tailored to various medical imaging tasks via finetuning [huang2023visual], multimodal global and local representation learning [huang2021gloria], knowledge-based prompt learning [qin2022medical, kapt2023], knowledge-based contrastive learning on decoupled image and text modality [wang2022medclip], and large-scale noisy video-text pretraining [yuan2023learning].
Inspired by these works, we propose a new approach to transfer and improve representations of VLMs for the pathological gait classification task in neurodegenerative diseases. Concretely, we model the prompt’s context with learnable vectors, which is initialized with domain-specific knowledge. Additionally, numerical gait parameters paired with videos are encoded and aligned with the text representation with a contrastive learning. During training, the model learns visual and text representations capable of understanding both the class-discriminating and numerical features of gait videos. To our knowledge, our work represents the first attempt to deploy VLM for the analysis of pathological videos.

2 Method

An overview of our method is shown in Fig.1. We utilize three distinct modalities to enhance the accuracy and the reliability of the VLM in classifying medical concepts: gait videos, class-specific medical descriptions and numerical gait parameters. Our knowledge augmentation strategy consists of two parts: First, we adopt a knowledge-aware prompt learning strategy to exploit class-specific description in the text prompts generation, while leveraging the pre-aligned video-text latent space (Sec.2.2). Second, we incorporate the associated numerical gait parameters as numerical texts to enhance the numeracy within the latent space of the text (Sec.2.3).

2.1 Dataset and preprocessing

Dataset. Our study leverages a dataset comprising 92 gait videos from 40 patients diagnosed with neurodegenerative disorders and 3 healthy controls, as detailed in [max-gr2023]. Moreover, 28 gait video clips featuring healthy elderly individuals have been added, chosen from the TOAW archive [toaw2022] based on specific criteria (Berg Balance Scale $\geq{45}$ , 0-falls during last 6 months, etc.), bringing the total number to 120 clips. All the videos are recorded at 30 fps, each capturing a one-way walking path of an individual. The patients were instructed to walk forth and back on a GAITRite (https://www.gaitrite.com/) pressure-sensitive walkway, providing a set of gait parameters as outlined in Table LABEL:tab:_29_gait_params in Supplementary Material.

Refer to caption — Figure 1: Overview of our cross-modality model for video-based clinical gait analysis (left), alongside clinical gait notions and per-class descriptions of gait classes utilized for prompt initialization (right). Three colored blocks represent the text- and video encoding pipelines, and the text embedding of numerical gait parameters, respectively.

Preprocessing. We crop the original videos based on bounding boxes, and employ a sliding window scheme (window size: 70 frames) to generate subsequences, with a stride of 25 for training and 0 for validation. This process results in approximately 900 clips of 70 frames for each cross-validation fold. To effectively incorporate the gait parameters into text space, we formulate sentences by combining four gait parameters with “and”, connecting names and values with “is”, as illustrated in Fig.2. The choice of four parameters per sentence is based on our observation that, in practice, neurologists often label a video by using only a few prominent or representative visual clues rather than exhaustively listing all evidences. Out of the total 29 parameters available, we select 438 combinations, each containing 4 parameters whose Pearson correlation coefficients are within the range of $[-0.4,0.4]$ .

2.2 VLM fine-tuning with visual and knowledge-aware prompts

We adopt the prompt learning strategy, kee** the pre-trained VLM frozen to preserve its general representation and leverage the pre-aligned multi-modal latent space. Taking inspiration from KAPT [kapt2023], we introduce gait-specific knowledge-based prompts by feeding per-class descriptions ${Desc_{i}}$ (See our project page) into the text prompts. These clinical gait notions have been generated using ChatGPT-4 [2023gpt4], then subsequently filtered, modified, and validated by a neurologist. To devise learnable prompts, we use KEPLER [2021kepler], similar to [kapt2023], on the class descriptions, which are then projected through per-class multi-layer perceptrons (MLPs), and added to the learnable parameters $\{X^{k}_{i}\}$ to form learnable prompts $\{C^{k}_{i}\}$ :

\vspace{-3pt}\{C^{k}_{i}\}_{i=1,...,N_{cls}}=Proj^{k}_{\phi}(\textit{KEPLER}(% \{Desc_{i}\}))+\{X^{k}_{i}\},\quad k=1,...,8

(1)

where $N_{cls}$ is the number of class, $C^{k}_{i}\in\mathbb{R}^{512}$ and $X^{k}_{i}\in\mathbb{R}^{512}$ represent the $k$ -th learnable prompts and parameters associated to the $i$ -th class, respectively. For the automatic prompt $\{D_{i}\}$ , we extract keywords from $\{Desc_{i}\}$ , as illustrated in Fig.1. More examples can be found on the project page. These selected texts then undergo standard tokenization of the frozen CLIP text encoder $\textit{FCLIP}_{T}$ to obtain $\{D_{i}\}$ . Similarly, we pass the class names $\{T_{i}\}$ into the tokenizer of $\textit{FCLIP}_{T}$ to generate the class token $tok_{i}^{cls}$ . As shown in Fig.1, we concatenate $\{C_{i}\}$ , $\{D_{i}\}$ and $\{tok_{i}^{cls}\}$ into $\textit{FCLIP}_{T}$ to obtain the text features $\{F^{T}_{i}\}$ :

\vspace{-3pt}\{F_{i}^{T}\}=\textit{FCLIP}_{T}([\{C_{i}\},\{D_{i}\},\{tok_{i}^{% cls}\}]).\vspace{-3pt}

(2)

On the video side, each frame of the input video $V$ goes through the tokenization of the Vision transformer (ViT) [2020vit] and forms a sequence of per-frame representations $z_{t}^{(0)}$ . The visual prompts for the $l$ -th layer of the pretrained CLIP Vision Encoder $\textit{FCLIP}_{V}$ are derived by applying Vita-CLIP [vita2023]’s video prompt learner ( $VitaVPL$ ) to the output of the previous layer $\{z_{t}^{(l-1)}\}$ :

\vspace{-3pt}[S^{(l)},G^{(l)},L^{(l)}]_{l=1,...,12}=\textit{VitaVPL}_{\theta}(% \{z_{t}^{(l-1)}\}),

(3)

where $S^{(l)}$ , $G^{(l)}$ , and $L^{(l)}$ respectively denote the learnable summary, global, and local prompt tokens at layer $l$ . As suggested in [vita2023], these prompt tokens are appended to $\{z_{t}^{(l-1)}\}$ and subsequently fed into $\textit{FCLIP}_{V}$ to obtain $F^{V}$ :

F^{V}=\textit{FCLIP}_{V}([\{z_{t}^{(l-1)}\},S^{(l)},G^{(l)},L^{(l)}]).\vspace{% -10pt}

(4)

Moreover, to combat class imbalance, we employ a multi-class focal loss [lu2020miccai] to maximize the cosine similarity of positive pairs:

L_{k}=\sum_{i=1}^{N_{cls}}-\alpha(1-p_{i})^{\gamma}y_{i}log(p_{i}),\quad p_{i}% =\frac{exp(<F^{T}_{i}|F^{V}>/\tau)}{\sum^{N_{cls}}_{j=1}exp(<F^{T}_{j}|F^{V}>/% \tau)},\vspace{-5pt}

(5)

where $y$ denotes the one-hot encoded label, $<\cdot|\cdot>$ the cosine similarity, and $\tau=0.01$ temperature parameter. We set the weighting factor $\alpha=0.25$ and the focusing parameter $\gamma=2$ .

2.3 Contrastive learning with numerical text embeddings

Text embedding of numerical gait parameters. Starting from the set of sentences each containing four gait parameters, we employ a two-step encoding process as illustrated in Fig.3. Initially, sentences without numerical values are fed into the CLIP text encoder, resulting in a descriptive embedding of the textual content $\{F_{gp}^{T}\}$ . As illustrated in Fig.3, we treat separately the logical conjunction “is” to generate text embedding [IS]. Subsequently, number embeddings are generated by multiplying the dedicated embedding base [NUM] with the associated numerical values $\{\omega_{gp}\}$ . The chosen specialized embedding base is designed to be orthogonal to the position encoding [2023xval], ensuring the efficient transmission of numerical information through the self-attention blocks of the Transformer. The numerical text embedding $F^{num}$ is then obtained by applying the $\textit{FCLIP}_{T}$ to the concatenated sentence:

\vspace{-3pt}F^{num}=\textit{FCLIP}_{T}(\{[F^{T}_{gp},\textbf{[IS]},\omega_{gp% }\cdot\textbf{[NUM]}]\}),\quad gp\in\{1,2,3,4\}.

(6)

Fig.LABEL:fig:_compare_number_embeddings in Supplementary Material demonstrates the cosine similarities of $F^{num}$ across number values ranging from 0 to 200 for a selected gait parameter. Our numerical encoding scheme, in contrast to using position encoding or direct digit and numerical text encoding, produces continuous embeddings that best reflect the numerical domain. Given that most gait parameter values are positive, we designate the mean value among healthy controls as the zero reference: $V_{norm}=\alpha\cdot\frac{(V-\overline{V}_{healthy})}{\sigma}$ , where $\sigma$ is the variance of the gait parameter values, and $\alpha$ is the scaling factor to adjust the data range to [-2.5, 2.5], the dynamic range of layer normalization within the self-attention block. The numerical text embeddings for the dementia grou** task are visualized in Fig.4.

Cross-modal contrastive learning. To better align the multimodal representation with our task, we exploit all accessible modalities, incorporating them into cross-modal contrastive learning: gait videos, class-specific descriptions, and numerical gait parameters. In our dataset, although gait videos are not always consistently paired with a corresponding gait parameter set, each set of gait parameters is linked to a video is assigned a class label. To this end, we transform these gait parameters into numerical text embeddings using the encoding method described earlier, and introduce classification tasks thereon.
As illustrated in Fig.1, the projection of the generated text features ${F^{T}_{i}}$ and that of numerical embedding $F^{num}$ are trained in a way that the cosine similarity between $P^{num}$ and the projected text feature of its ground-truth class $P^{T}$ is maximized, by using a cross-entropy objective $L_{gp}$ . The global loss function becomes: $L=L_{k}+\omega\cdot L_{gp}$ . We set $\omega=0.05$ through heuristic analysis. To demonstrate alignment of numerical embeddings with the multi-modal space, we visualize the embedding spaces before and after learning in Fig.4.

Interpreting the per-class text embedding. We aim at the effective translation of per-class text features $\{F^{T}_{i}\}$ into natural language expressions using the vocabulary of numerical gait parameters. To this end, we trained a text decoder from scratch to transform numerical text embeddings $F^{num}$ back into their corresponding gait parameters, reverting the encoding scheme shown in Fig.3. An 4-layer transformer decoder $\textit{D}_{T}$ is employed for text decoding. In line with recent developments in text-only decoder pre-training [2023decap], we train $\textit{D}_{T}$ using the prefix language modeling. Specifically, given a sentence composed of gait parameters $\textbf{s}=\{\text{word}_{1},\text{word}_{2},...,\text{word}_{L}\}$ , we generate a sequence of token IDs using the dictionary of $\textit{FCLIP}_{T}$ . For the numbers, we scale the numbers, which had been previously normalized to [-2.5, 2.5], to a graduated integer scale of [0, $N_{num}$ ]. The token ID $tok$ of a number [num] is defined as: $tk=[\textbf{EOS}]+\textit{scale}([\textbf{num}])$ , where $[\textbf{EOS}]=49407$ . $\textit{D}_{T}$ learns to reconstruct the sequence of token IDs $\{tok_{j}\}$ starting from the numerical text embedding $F^{num}$ . In addition to the vanilla cross-entropy loss [2023decap], we leverage an ordinal cross-entropy loss to further penalize the reconstruction error of the number values:

L_{num}=-\frac{|\hat{tok}-tok|}{[\textbf{EOS}]+N_{num}-1}\sum^{[\textbf{EOS}]+% N_{num}}_{m=1}y_{m}log(p_{m}),\vspace{-7pt}

(7)

where $N_{num}=200$ , $|\hat{tok}-tok|$ represents the absolute distance between the ground-truth token ID $tok$ and the estimation $\hat{tok}$ , $y$ denotes the one-hot encoded ground-truth label, and $p$ the estimated probability.
Benefiting from the proposed cross-modal contrastive learning scheme, $\{F^{T}_{i}\}$ can be represented as a linear combination of the numerical text embeddings ${F^{num}}$ , with weights computed by measuring the cosine similarity between $\{P_{i}\}$ and $P^{num}$ . Subsequently, we apply $\textit{D}_{T}$ on $\{\hat{F}^{num}_{i}\}$ to generate natural language descriptions: $\{\hat{\textit{Desc}}_{i}\}=\mathbf{D}^{T}(\{\hat{F}^{num}_{i}\})$ .

3 Experiments and Results

Our study includes two classification tests: Gait scoring to estimate the severity of a patient’s condition based on a 4-class gait scoring (normal–0, slight–1, mild–2, and moderate–3) following MDS-UPDRS III [mds-updrs2008], and dementia subty** to distinguish between different dementia groups: normal/DLB(Dementia with Lewy Bodies)/AD(Alzheimer’s Disease). See the project page for detailed clinical gait descriptions on each class. Due to its limited size (a total of 120 videos), we divide our video dataset into training and validation sets and conduct 10-fold cross-validation for each classification task. Confusion matrices are provided in Fig.3 of Supplementary Material.

Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model ††thanks: Supported by French National project “ANR ArtIC: AI for Care”, French Minister.