License: arXiv.org perpetual non-exclusive license
arXiv:2403.13756v1 [cs.CV] 20 Mar 2024
11institutetext: ICube laboratory, University of Strasbourg, CNRS, France
11email: {d.wang, kyuan, f.blanc, npadoy, seo}@unistra.fr
22institutetext: Hôpital de la Robsertsau, France
22email: {candice.muller, frederic.blanc}@chru-strasbourg.fr

Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model thanks: Supported by French National project “ANR ArtIC: AI for Care”, French Minister.

Diwei Wang 11    Kun Yuan 11    Candice Muller 22    Frédéric Blanc 1122    Nicolas Padoy 11    Hyewon Seo 11
Abstract

We present a knowledge augmentation strategy for assessing the diagnostic groups and gait impairment from monocular gait videos. Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos, through a collective learning across three distinct modalities: gait videos, class-specific descriptions, and numerical gait parameters. Our specific contributions are two-fold: First, we adopt a knowledge-aware prompt tuning strategy to utilize the class-specific medical description in guiding the text prompt learning. Second, we integrate the paired gait parameters in the form of numerical texts to enhance the numeracy of the textual representation. Results demonstrate that our model not only significantly outperforms state-of-the-art (SOTA) in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions using the vocabulary of quantitative gait parameters. The code and the model will be made available at our project page: https://anonymous.4open.science/r/GaitAnalysisVLM-CC83.

Keywords:
Pathological gait classification MDS-UPDRS Gait score Knowledge-aware prompt tuning Numeracy for language model

1 Introduction

While quantitative gait impairment analysis has proven to be an established method for accessing neurodegenerative diseases and gauging their severity [pathologicalSignature, QuantativeGA, ImpactofEnvironment, muller2018correlation], current clinical assessments are used in highly restricted contexts, posing significant challenges: Not only do they often require specialized equipment, such as force plates or IMU sensors, but they also struggle to capture moments with prominent symptoms during clinical visits, which are somewhat special occasions for patients. Analysing motor symptoms from video offers new possibilities, enabling cost-effective monitoring, remote surveillance without the need of frequent in-person clinic visits, thereby facilitating timely and personalized assessment.
Naturally, there have been recent efforts to develop a single 2D-RGB-camera-based gait analysis system, with the majority leveraging advancements in deep learning. Albuquerque et al [St-LSTMGaitClassify] develop a spatiotemporal deep learning approach by producing a gait representation that combines image features extracted by Convolutional Neural Networks (CNNs), chained with a temporal encoding based on a LSTM (Long Short Term Memory) network. Sabo et al [sabo2022estimating] have shown that Spatiotemporal-Graph Convolution Network models operating on 3D joint trajectories outperform earlier models. In the work by Lu et al [lu2020miccai], 3D body mesh and pose are extracted and tracked from video frames, and the sequence of 3D poses is classified based on MDS-UPDRS gait scores [mds-updrs2008] using a temporal CNN. Wang et al [max-gr2023] have developed a dedicated 3D skeleton reconstructor tailored for gait motion, incorporating a gait parameter estimator from videos and a multihead attention Transformer for similar classification tasks. Among methods for non-pathological gait analysis, GaitBase [opengait2023] combines improved spatial feature extraction and temporal gait modeling for appearance-based gait recognition, in both indoor and outdoor settings.
Existing works face challenges in handling insufficient pathological gait data and imbalances with normal data, promoting strategies such as a self-supervised pretraining stage prior to the task-specific supervision [sabo2022estimating], or the employment of crafted loss functions [lu2020miccai]. Nevertheless, the need for data-efficient approaches with superior performance is crucial in video-based pathological gait classification. Meanwhile, the recent emergence of large-scale pre-trained vision-language models (VLMs) has demonstrated remarkable performance and transferability to different types of visual recognition tasks [clip2021, miech2020end], thanks to their generalizable visual and textual representations of natural concepts. In the context of medical image analysis, VLMs tailored to various medical imaging tasks via finetuning [huang2023visual], multimodal global and local representation learning [huang2021gloria], knowledge-based prompt learning [qin2022medical, kapt2023], knowledge-based contrastive learning on decoupled image and text modality [wang2022medclip], and large-scale noisy video-text pretraining [yuan2023learning].
Inspired by these works, we propose a new approach to transfer and improve representations of VLMs for the pathological gait classification task in neurodegenerative diseases. Concretely, we model the prompt’s context with learnable vectors, which is initialized with domain-specific knowledge. Additionally, numerical gait parameters paired with videos are encoded and aligned with the text representation with a contrastive learning. During training, the model learns visual and text representations capable of understanding both the class-discriminating and numerical features of gait videos. To our knowledge, our work represents the first attempt to deploy VLM for the analysis of pathological videos.

2 Method

An overview of our method is shown in Fig.1. We utilize three distinct modalities to enhance the accuracy and the reliability of the VLM in classifying medical concepts: gait videos, class-specific medical descriptions and numerical gait parameters. Our knowledge augmentation strategy consists of two parts: First, we adopt a knowledge-aware prompt learning strategy to exploit class-specific description in the text prompts generation, while leveraging the pre-aligned video-text latent space (Sec.2.2). Second, we incorporate the associated numerical gait parameters as numerical texts to enhance the numeracy within the latent space of the text (Sec.2.3).

2.1 Dataset and preprocessing

Dataset. Our study leverages a dataset comprising 92 gait videos from 40 patients diagnosed with neurodegenerative disorders and 3 healthy controls, as detailed in [max-gr2023]. Moreover, 28 gait video clips featuring healthy elderly individuals have been added, chosen from the TOAW archive [toaw2022] based on specific criteria (Berg Balance Scale 45absent45\geq{45}≥ 45, 0-falls during last 6 months, etc.), bringing the total number to 120 clips. All the videos are recorded at 30 fps, each capturing a one-way walking path of an individual. The patients were instructed to walk forth and back on a GAITRite (https://www.gaitrite.com/) pressure-sensitive walkway, providing a set of gait parameters as outlined in Table LABEL:tab:_29_gait_params in Supplementary Material.

Refer to caption
Figure 1: Overview of our cross-modality model for video-based clinical gait analysis (left), alongside clinical gait notions and per-class descriptions of gait classes utilized for prompt initialization (right). Three colored blocks represent the text- and video encoding pipelines, and the text embedding of numerical gait parameters, respectively.

Preprocessing. We crop the original videos based on bounding boxes, and employ a sliding window scheme (window size: 70 frames) to generate subsequences, with a stride of 25 for training and 0 for validation. This process results in approximately 900 clips of 70 frames for each cross-validation fold. To effectively incorporate the gait parameters into text space, we formulate sentences by combining four gait parameters with “and”, connecting names and values with “is”, as illustrated in Fig.2. The choice of four parameters per sentence is based on our observation that, in practice, neurologists often label a video by using only a few prominent or representative visual clues rather than exhaustively listing all evidences. Out of the total 29 parameters available, we select 438 combinations, each containing 4 parameters whose Pearson correlation coefficients are within the range of [0.4,0.4]0.40.4[-0.4,0.4][ - 0.4 , 0.4 ].

Refer to caption
Figure 2: Translation of gait parameters into text.

2.2 VLM fine-tuning with visual and knowledge-aware prompts

We adopt the prompt learning strategy, kee** the pre-trained VLM frozen to preserve its general representation and leverage the pre-aligned multi-modal latent space. Taking inspiration from KAPT [kapt2023], we introduce gait-specific knowledge-based prompts by feeding per-class descriptions Desci𝐷𝑒𝑠subscript𝑐𝑖{Desc_{i}}italic_D italic_e italic_s italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (See our project page) into the text prompts. These clinical gait notions have been generated using ChatGPT-4 [2023gpt4], then subsequently filtered, modified, and validated by a neurologist. To devise learnable prompts, we use KEPLER [2021kepler], similar to [kapt2023], on the class descriptions, which are then projected through per-class multi-layer perceptrons (MLPs), and added to the learnable parameters {Xik}subscriptsuperscript𝑋𝑘𝑖\{X^{k}_{i}\}{ italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to form learnable prompts {Cik}subscriptsuperscript𝐶𝑘𝑖\{C^{k}_{i}\}{ italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }:

{Cik}i=1,,Ncls=Projϕk(𝐾𝐸𝑃𝐿𝐸𝑅({Desci}))+{Xik},k=1,,8formulae-sequencesubscriptsubscriptsuperscript𝐶𝑘𝑖𝑖1subscript𝑁𝑐𝑙𝑠𝑃𝑟𝑜subscriptsuperscript𝑗𝑘italic-ϕ𝐾𝐸𝑃𝐿𝐸𝑅𝐷𝑒𝑠subscript𝑐𝑖subscriptsuperscript𝑋𝑘𝑖𝑘18\vspace{-3pt}\{C^{k}_{i}\}_{i=1,...,N_{cls}}=Proj^{k}_{\phi}(\textit{KEPLER}(% \{Desc_{i}\}))+\{X^{k}_{i}\},\quad k=1,...,8{ italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P italic_r italic_o italic_j start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( KEPLER ( { italic_D italic_e italic_s italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ) + { italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_k = 1 , … , 8 (1)

where Nclssubscript𝑁𝑐𝑙𝑠N_{cls}italic_N start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is the number of class, Cik512subscriptsuperscript𝐶𝑘𝑖superscript512C^{k}_{i}\in\mathbb{R}^{512}italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT and Xik512subscriptsuperscript𝑋𝑘𝑖superscript512X^{k}_{i}\in\mathbb{R}^{512}italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT represent the k𝑘kitalic_k-th learnable prompts and parameters associated to the i𝑖iitalic_i-th class, respectively. For the automatic prompt {Di}subscript𝐷𝑖\{D_{i}\}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, we extract keywords from {Desci}𝐷𝑒𝑠subscript𝑐𝑖\{Desc_{i}\}{ italic_D italic_e italic_s italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, as illustrated in Fig.1. More examples can be found on the project page. These selected texts then undergo standard tokenization of the frozen CLIP text encoder 𝐹𝐶𝐿𝐼𝑃Tsubscript𝐹𝐶𝐿𝐼𝑃𝑇\textit{FCLIP}_{T}FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to obtain {Di}subscript𝐷𝑖\{D_{i}\}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. Similarly, we pass the class names {Ti}subscript𝑇𝑖\{T_{i}\}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } into the tokenizer of 𝐹𝐶𝐿𝐼𝑃Tsubscript𝐹𝐶𝐿𝐼𝑃𝑇\textit{FCLIP}_{T}FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to generate the class token tokicls𝑡𝑜superscriptsubscript𝑘𝑖𝑐𝑙𝑠tok_{i}^{cls}italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT. As shown in Fig.1, we concatenate {Ci}subscript𝐶𝑖\{C_{i}\}{ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, {Di}subscript𝐷𝑖\{D_{i}\}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and {tokicls}𝑡𝑜superscriptsubscript𝑘𝑖𝑐𝑙𝑠\{tok_{i}^{cls}\}{ italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT } into 𝐹𝐶𝐿𝐼𝑃Tsubscript𝐹𝐶𝐿𝐼𝑃𝑇\textit{FCLIP}_{T}FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to obtain the text features {FiT}subscriptsuperscript𝐹𝑇𝑖\{F^{T}_{i}\}{ italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }:

{FiT}=𝐹𝐶𝐿𝐼𝑃T([{Ci},{Di},{tokicls}]).superscriptsubscript𝐹𝑖𝑇subscript𝐹𝐶𝐿𝐼𝑃𝑇subscript𝐶𝑖subscript𝐷𝑖𝑡𝑜superscriptsubscript𝑘𝑖𝑐𝑙𝑠\vspace{-3pt}\{F_{i}^{T}\}=\textit{FCLIP}_{T}([\{C_{i}\},\{D_{i}\},\{tok_{i}^{% cls}\}]).\vspace{-3pt}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } = FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( [ { italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT } ] ) . (2)

On the video side, each frame of the input video V𝑉Vitalic_V goes through the tokenization of the Vision transformer (ViT) [2020vit] and forms a sequence of per-frame representations zt(0)superscriptsubscript𝑧𝑡0z_{t}^{(0)}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. The visual prompts for the l𝑙litalic_l-th layer of the pretrained CLIP Vision Encoder 𝐹𝐶𝐿𝐼𝑃Vsubscript𝐹𝐶𝐿𝐼𝑃𝑉\textit{FCLIP}_{V}FCLIP start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are derived by applying Vita-CLIP [vita2023]’s video prompt learner (VitaVPL𝑉𝑖𝑡𝑎𝑉𝑃𝐿VitaVPLitalic_V italic_i italic_t italic_a italic_V italic_P italic_L) to the output of the previous layer {zt(l1)}superscriptsubscript𝑧𝑡𝑙1\{z_{t}^{(l-1)}\}{ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT }:

[S(l),G(l),L(l)]l=1,,12=𝑉𝑖𝑡𝑎𝑉𝑃𝐿θ({zt(l1)}),subscriptsuperscript𝑆𝑙superscript𝐺𝑙superscript𝐿𝑙𝑙112subscript𝑉𝑖𝑡𝑎𝑉𝑃𝐿𝜃superscriptsubscript𝑧𝑡𝑙1\vspace{-3pt}[S^{(l)},G^{(l)},L^{(l)}]_{l=1,...,12}=\textit{VitaVPL}_{\theta}(% \{z_{t}^{(l-1)}\}),[ italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_l = 1 , … , 12 end_POSTSUBSCRIPT = VitaVPL start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( { italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT } ) , (3)

where S(l)superscript𝑆𝑙S^{(l)}italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, G(l)superscript𝐺𝑙G^{(l)}italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, and L(l)superscript𝐿𝑙L^{(l)}italic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT respectively denote the learnable summary, global, and local prompt tokens at layer l𝑙litalic_l. As suggested in [vita2023], these prompt tokens are appended to {zt(l1)}superscriptsubscript𝑧𝑡𝑙1\{z_{t}^{(l-1)}\}{ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT } and subsequently fed into 𝐹𝐶𝐿𝐼𝑃Vsubscript𝐹𝐶𝐿𝐼𝑃𝑉\textit{FCLIP}_{V}FCLIP start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT to obtain FVsuperscript𝐹𝑉F^{V}italic_F start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT:

FV=𝐹𝐶𝐿𝐼𝑃V([{zt(l1)},S(l),G(l),L(l)]).superscript𝐹𝑉subscript𝐹𝐶𝐿𝐼𝑃𝑉superscriptsubscript𝑧𝑡𝑙1superscript𝑆𝑙superscript𝐺𝑙superscript𝐿𝑙F^{V}=\textit{FCLIP}_{V}([\{z_{t}^{(l-1)}\},S^{(l)},G^{(l)},L^{(l)}]).\vspace{% -10pt}italic_F start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = FCLIP start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( [ { italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT } , italic_S start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ] ) . (4)

Moreover, to combat class imbalance, we employ a multi-class focal loss [lu2020miccai] to maximize the cosine similarity of positive pairs:

Lk=i=1Nclsα(1pi)γyilog(pi),pi=exp(<FiT|FV>/τ)j=1Nclsexp(<FjT|FV>/τ),formulae-sequencesubscript𝐿𝑘superscriptsubscript𝑖1subscript𝑁𝑐𝑙𝑠𝛼superscript1subscript𝑝𝑖𝛾subscript𝑦𝑖𝑙𝑜𝑔subscript𝑝𝑖subscript𝑝𝑖𝑒𝑥𝑝inner-productsubscriptsuperscript𝐹𝑇𝑖superscript𝐹𝑉𝜏subscriptsuperscriptsubscript𝑁𝑐𝑙𝑠𝑗1𝑒𝑥𝑝inner-productsubscriptsuperscript𝐹𝑇𝑗superscript𝐹𝑉𝜏L_{k}=\sum_{i=1}^{N_{cls}}-\alpha(1-p_{i})^{\gamma}y_{i}log(p_{i}),\quad p_{i}% =\frac{exp(<F^{T}_{i}|F^{V}>/\tau)}{\sum^{N_{cls}}_{j=1}exp(<F^{T}_{j}|F^{V}>/% \tau)},\vspace{-5pt}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_α ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e italic_x italic_p ( < italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_F start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT > / italic_τ ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_e italic_x italic_p ( < italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_F start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT > / italic_τ ) end_ARG , (5)

where y𝑦yitalic_y denotes the one-hot encoded label, <|><\cdot|\cdot>< ⋅ | ⋅ > the cosine similarity, and τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 temperature parameter. We set the weighting factor α=0.25𝛼0.25\alpha=0.25italic_α = 0.25 and the focusing parameter γ=2𝛾2\gamma=2italic_γ = 2.

2.3 Contrastive learning with numerical text embeddings

Refer to caption
Figure 3: Our numerical text encoding (NTE) paradigm.

Text embedding of numerical gait parameters. Starting from the set of sentences each containing four gait parameters, we employ a two-step encoding process as illustrated in Fig.3. Initially, sentences without numerical values are fed into the CLIP text encoder, resulting in a descriptive embedding of the textual content {FgpT}superscriptsubscript𝐹𝑔𝑝𝑇\{F_{gp}^{T}\}{ italic_F start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }. As illustrated in Fig.3, we treat separately the logical conjunction “is” to generate text embedding [IS]. Subsequently, number embeddings are generated by multiplying the dedicated embedding base [NUM] with the associated numerical values {ωgp}subscript𝜔𝑔𝑝\{\omega_{gp}\}{ italic_ω start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT }. The chosen specialized embedding base is designed to be orthogonal to the position encoding [2023xval], ensuring the efficient transmission of numerical information through the self-attention blocks of the Transformer. The numerical text embedding Fnumsuperscript𝐹𝑛𝑢𝑚F^{num}italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT is then obtained by applying the 𝐹𝐶𝐿𝐼𝑃Tsubscript𝐹𝐶𝐿𝐼𝑃𝑇\textit{FCLIP}_{T}FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to the concatenated sentence:

Fnum=𝐹𝐶𝐿𝐼𝑃T({[FgpT,[IS],ωgp[NUM]]}),gp{1,2,3,4}.formulae-sequencesuperscript𝐹𝑛𝑢𝑚subscript𝐹𝐶𝐿𝐼𝑃𝑇subscriptsuperscript𝐹𝑇𝑔𝑝[IS]subscript𝜔𝑔𝑝[NUM]𝑔𝑝1234\vspace{-3pt}F^{num}=\textit{FCLIP}_{T}(\{[F^{T}_{gp},\textbf{[IS]},\omega_{gp% }\cdot\textbf{[NUM]}]\}),\quad gp\in\{1,2,3,4\}.italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT = FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { [ italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT , [IS] , italic_ω start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ⋅ [NUM] ] } ) , italic_g italic_p ∈ { 1 , 2 , 3 , 4 } . (6)

Fig.LABEL:fig:_compare_number_embeddings in Supplementary Material demonstrates the cosine similarities of Fnumsuperscript𝐹𝑛𝑢𝑚F^{num}italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT across number values ranging from 0 to 200 for a selected gait parameter. Our numerical encoding scheme, in contrast to using position encoding or direct digit and numerical text encoding, produces continuous embeddings that best reflect the numerical domain. Given that most gait parameter values are positive, we designate the mean value among healthy controls as the zero reference: Vnorm=α(VV¯healthy)σsubscript𝑉𝑛𝑜𝑟𝑚𝛼𝑉subscript¯𝑉𝑒𝑎𝑙𝑡𝑦𝜎V_{norm}=\alpha\cdot\frac{(V-\overline{V}_{healthy})}{\sigma}italic_V start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT = italic_α ⋅ divide start_ARG ( italic_V - over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h italic_e italic_a italic_l italic_t italic_h italic_y end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ end_ARG, where σ𝜎\sigmaitalic_σ is the variance of the gait parameter values, and α𝛼\alphaitalic_α is the scaling factor to adjust the data range to [-2.5, 2.5], the dynamic range of layer normalization within the self-attention block. The numerical text embeddings for the dementia grou** task are visualized in Fig.4.

Refer to caption
(a) Original text embeddings.
Refer to caption
(b) Embeddings projected by MLPs.
Figure 4: Feature visualization using UMAP (no. components===3) for numerical text embeddings derived from gait parameters. Yellow points in (b) represent the projections of the learned per-class text features. Images rendered with Polyscope.

Cross-modal contrastive learning. To better align the multimodal representation with our task, we exploit all accessible modalities, incorporating them into cross-modal contrastive learning: gait videos, class-specific descriptions, and numerical gait parameters. In our dataset, although gait videos are not always consistently paired with a corresponding gait parameter set, each set of gait parameters is linked to a video is assigned a class label. To this end, we transform these gait parameters into numerical text embeddings using the encoding method described earlier, and introduce classification tasks thereon.
As illustrated in Fig.1, the projection of the generated text features FiTsubscriptsuperscript𝐹𝑇𝑖{F^{T}_{i}}italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and that of numerical embedding Fnumsuperscript𝐹𝑛𝑢𝑚F^{num}italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT are trained in a way that the cosine similarity between Pnumsuperscript𝑃𝑛𝑢𝑚P^{num}italic_P start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT and the projected text feature of its ground-truth class PTsuperscript𝑃𝑇P^{T}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is maximized, by using a cross-entropy objective Lgpsubscript𝐿𝑔𝑝L_{gp}italic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT. The global loss function becomes: L=Lk+ωLgp𝐿subscript𝐿𝑘𝜔subscript𝐿𝑔𝑝L=L_{k}+\omega\cdot L_{gp}italic_L = italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ω ⋅ italic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT. We set ω=0.05𝜔0.05\omega=0.05italic_ω = 0.05 through heuristic analysis. To demonstrate alignment of numerical embeddings with the multi-modal space, we visualize the embedding spaces before and after learning in Fig.4.

Interpreting the per-class text embedding. We aim at the effective translation of per-class text features {FiT}subscriptsuperscript𝐹𝑇𝑖\{F^{T}_{i}\}{ italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } into natural language expressions using the vocabulary of numerical gait parameters. To this end, we trained a text decoder from scratch to transform numerical text embeddings Fnumsuperscript𝐹𝑛𝑢𝑚F^{num}italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT back into their corresponding gait parameters, reverting the encoding scheme shown in Fig.3. An 4-layer transformer decoder 𝐷Tsubscript𝐷𝑇\textit{D}_{T}D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is employed for text decoding. In line with recent developments in text-only decoder pre-training [2023decap], we train 𝐷Tsubscript𝐷𝑇\textit{D}_{T}D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT using the prefix language modeling. Specifically, given a sentence composed of gait parameters 𝐬={word1,word2,,wordL}𝐬subscriptword1subscriptword2subscriptword𝐿\textbf{s}=\{\text{word}_{1},\text{word}_{2},...,\text{word}_{L}\}s = { word start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , word start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , word start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }, we generate a sequence of token IDs using the dictionary of 𝐹𝐶𝐿𝐼𝑃Tsubscript𝐹𝐶𝐿𝐼𝑃𝑇\textit{FCLIP}_{T}FCLIP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. For the numbers, we scale the numbers, which had been previously normalized to [-2.5, 2.5], to a graduated integer scale of [0,Nnumsubscript𝑁𝑛𝑢𝑚N_{num}italic_N start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT]. The token ID tok𝑡𝑜𝑘tokitalic_t italic_o italic_k of a number [num] is defined as: tk=[𝐄𝐎𝐒]+𝑠𝑐𝑎𝑙𝑒([𝐧𝐮𝐦])𝑡𝑘delimited-[]𝐄𝐎𝐒𝑠𝑐𝑎𝑙𝑒delimited-[]𝐧𝐮𝐦tk=[\textbf{EOS}]+\textit{scale}([\textbf{num}])italic_t italic_k = [ EOS ] + scale ( [ num ] ), where [𝐄𝐎𝐒]=49407delimited-[]𝐄𝐎𝐒49407[\textbf{EOS}]=49407[ EOS ] = 49407. 𝐷Tsubscript𝐷𝑇\textit{D}_{T}D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT learns to reconstruct the sequence of token IDs {tokj}𝑡𝑜subscript𝑘𝑗\{tok_{j}\}{ italic_t italic_o italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } starting from the numerical text embedding Fnumsuperscript𝐹𝑛𝑢𝑚F^{num}italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT. In addition to the vanilla cross-entropy loss [2023decap], we leverage an ordinal cross-entropy loss to further penalize the reconstruction error of the number values:

Lnum=|tok^tok|[𝐄𝐎𝐒]+Nnum1m=1[𝐄𝐎𝐒]+Nnumymlog(pm),subscript𝐿𝑛𝑢𝑚^𝑡𝑜𝑘𝑡𝑜𝑘delimited-[]𝐄𝐎𝐒subscript𝑁𝑛𝑢𝑚1subscriptsuperscriptdelimited-[]𝐄𝐎𝐒subscript𝑁𝑛𝑢𝑚𝑚1subscript𝑦𝑚𝑙𝑜𝑔subscript𝑝𝑚L_{num}=-\frac{|\hat{tok}-tok|}{[\textbf{EOS}]+N_{num}-1}\sum^{[\textbf{EOS}]+% N_{num}}_{m=1}y_{m}log(p_{m}),\vspace{-7pt}italic_L start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT = - divide start_ARG | over^ start_ARG italic_t italic_o italic_k end_ARG - italic_t italic_o italic_k | end_ARG start_ARG [ EOS ] + italic_N start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT - 1 end_ARG ∑ start_POSTSUPERSCRIPT [ EOS ] + italic_N start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , (7)

where Nnum=200subscript𝑁𝑛𝑢𝑚200N_{num}=200italic_N start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT = 200, |tok^tok|^𝑡𝑜𝑘𝑡𝑜𝑘|\hat{tok}-tok|| over^ start_ARG italic_t italic_o italic_k end_ARG - italic_t italic_o italic_k | represents the absolute distance between the ground-truth token ID tok𝑡𝑜𝑘tokitalic_t italic_o italic_k and the estimation tok^^𝑡𝑜𝑘\hat{tok}over^ start_ARG italic_t italic_o italic_k end_ARG, y𝑦yitalic_y denotes the one-hot encoded ground-truth label, and p𝑝pitalic_p the estimated probability.
Benefiting from the proposed cross-modal contrastive learning scheme, {FiT}subscriptsuperscript𝐹𝑇𝑖\{F^{T}_{i}\}{ italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } can be represented as a linear combination of the numerical text embeddings Fnumsuperscript𝐹𝑛𝑢𝑚{F^{num}}italic_F start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT, with weights computed by measuring the cosine similarity between {Pi}subscript𝑃𝑖\{P_{i}\}{ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and Pnumsuperscript𝑃𝑛𝑢𝑚P^{num}italic_P start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT. Subsequently, we apply 𝐷Tsubscript𝐷𝑇\textit{D}_{T}D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT on {F^inum}subscriptsuperscript^𝐹𝑛𝑢𝑚𝑖\{\hat{F}^{num}_{i}\}{ over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to generate natural language descriptions: {𝐷𝑒𝑠𝑐^i}=𝐃T({F^inum})subscript^𝐷𝑒𝑠𝑐𝑖superscript𝐃𝑇subscriptsuperscript^𝐹𝑛𝑢𝑚𝑖\{\hat{\textit{Desc}}_{i}\}=\mathbf{D}^{T}(\{\hat{F}^{num}_{i}\}){ over^ start_ARG Desc end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = bold_D start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( { over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_n italic_u italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ).

3 Experiments and Results

Our study includes two classification tests: Gait scoring to estimate the severity of a patient’s condition based on a 4-class gait scoring (normal–0, slight–1, mild–2, and moderate–3) following MDS-UPDRS III [mds-updrs2008], and dementia subty** to distinguish between different dementia groups: normal/DLB(Dementia with Lewy Bodies)/AD(Alzheimer’s Disease). See the project page for detailed clinical gait descriptions on each class. Due to its limited size (a total of 120 videos), we divide our video dataset into training and validation sets and conduct 10-fold cross-validation for each classification task. Confusion matrices are provided in Fig.3 of Supplementary Material.

Table 1: Comparative analysis on two classification tasks: Gait score (‘Gait scoring’) and dementia subty** (‘Dem. group’). Model performance is evaluated using top-1 accuracy (‘acc’,%) and F1-score (‘Fscore’,%).
Figure 2: Descriptions generated from per-class text features through the pretrained text decoder. Key criteria are highlighted in the respective class color.

(a) Different model configurations o 0.95—X[l, 3]—X[l, 1]—X[l, 1]—X[l, 1]—X[l, 1]— Model configurations Gait scoring Dem. group Acc. Fscore Acc. Fscore Baseline 64.78 60.75 86.27 79.24 Baseline+KAPT 65.98 61.97 87.29 78.48 Baseline+NTE 64.44 57.64 88.26 81.34 Ours 67.76 62.59 90.08 83.86 (b) SOTA methods Refer to caption (b) Diagnostic groups.

Refer to caption
(a) Gait score classification.
Refer to caption
(b) Dementia subtype classification.
Figure 3: Confusion matrices for the classification tasks.
Figure 2: Descriptions generated from per-class text features through the pretrained text decoder. Key criteria are highlighted in the respective class color.