VicTR: Video-conditioned Text Representations for Activity Recognition
Abstract
Vision-Language models (VLMs) have excelled in the image-domain— especially in zero-shot settings— thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image video), often kee** text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.
1 Introduction
Video understanding poses significant challenges, often adding to the complications in image domain such as model complexity and annotation costs. The additional temporal dimension and different modalities of data introduce useful cues, but also can be redundant, raising interesting questions about trade-offs. Activity Recognition (i.e., classification) in particular— as the prominent task in video understanding— has long been explored by the community in these research directions. Whether it is efficient architecture variants ranging from CNNs [31, 13, 57] to Transformers [2, 5, 12], training schemes from fully-supervised [7, 14] to self-supervised [47, 15, 53] or data regimes from unimodal [76, 64] to multimodal [20, 39], the progress has been steady and exciting. More recently, with the availability of internet-scale paired image-text data, the direction of vision-language models (VLMs) [50, 23] have emerged dominant, achieving strong generalization across numerous benchmarks. However, the progress of VLMs in the video domain is yet to be caught-up to its full potential.
Following the seminal VLMs such as CLIP [50] and ALIGN [23], there have been significant strides in tasks such as image classification [90, 92, 83], open-vocabulary object detection [18, 38], text-to-image retrieval [61, 86] and robot manipulation [24, 91]. Such models are usually pretrained on paired image-text data based on a contrastive learning framework. The idea is to have two separate backbones— an Image Encoder and a Text Encoder, that generate embeddings in a joint latent space. To optimize this space, the corresponding pairs of embeddings are drawn closer, by increasing their similarity (i.e., Affinity). The key advantage of such models is that, at inference, any semantic concept (given as a text input) can be embedded in the same space, giving intriguing zero-shot or few-shot transfer capabilities [91, 1]. For instance, CLIP [50] excels at classifying unseen attribute categories (e.g. objects, scenes), or even counting such occurrences [91]. However, these VLMs do not perform well in tasks that require specialized knowledge, such as localizing (e.g. detection/segmentation) or temporal reasoning (e.g. activity recognition), at least not out-of-the-box, as their training objective has not seen any location or temporal cues. Yet, with task-specific finetuning, such models can readily be adapted to specialized domains [18, 40].
In the video domain, training VLMs from scratch may show a limited success [77]— while also being expensive— due to the lack of paired data at scale. As a compromise, the common practice is to adapt pretrained image-VLMs to video, by introducing temporal information. Such methods either insert temporal modules within the image backbone itself to have cross-frame interactions [40], or use a post-processing video head on-top of the image backbone [36, 66, 4, 33]. In both cases, image embeddings are enhanced as video embeddings. However, the use of text embeddings varies among different approaches. Text may either be discarded [33], kept frozen [36, 66], used as conditioning [4] (to further enhance video embeddings), or fully-updated jointly with video [40]. More often than not, the main focus is on visual embeddings (i.e., converting image video), and the impact of updating text has been limited.
Nevertheless, video models benefit from semantic information [22, 91, 70]. In fact, certain attributes (e.g. objects, scene or human subjects) are directly tied with specific activities, and can simplify their recognition. For instance, the presence of attributes such as [rope, gym, one-person] can narrow down the potential activity to battling ropes or rope climbing. VLMs are especially suited to take advantage of such semantics. Any concept represented as text can be visually-grounded based on paired embeddings (in zero-shot), to extract relevant attributes for a given input that benefit recognition tasks. Such visually-grounded semantics are cheap in-terms of both annotation and compute costs, yet highly-useful.
Motivated by the above we propose VicTR, focusing on adapting text information to the video domain. More specifically, we generate Video-conditioned Text embeddings (see Fig. 1), while jointly-training both textual and visual features generated by an image-VLM. By finetuning text embeddings, we observe significant gains in our framework, compared to just finetuning visual embeddings (similar to the observations in [92]). We can also make use of freely-available auxiliary semantic information, represented in the form of visually-grounded text embeddings. Fig. 2 shows an overview of the proposed architecture. Our video-conditioned text embeddings are unique to each video, allowing more-flexibility to move in the latent space and generalize to complex downstream tasks. Optionally, our video-conditioned auxiliary text can further help optimize this latent space. We evaluate VicTR on few-shot, zero-shot, short-form and long-form activity recognition, validating its strong generalization capabilities among video-VLMs.
2 Related Work
Video understanding
is about reasoning based on spatio-temporal inputs. Compared to image inputs, videos bring additional useful cues such as motion or multiple modalities (e.g. audio) into play, but also any associated complications such as increased compute requirements and redundancy in data. Convolutional networks (CNNs) [7, 76, 64, 68] and Recurrent models [11, 87] have been the state-of-the-art in video modeling, prior to the rise of Transformers [2, 5, 34, 55]. Multi-stream models [7, 14] that make use of different spatio-temporal views [14, 53] or modalities (e.g. optical-flow [7, 20], audio [39, 21, 54]) have emerged, tackling benchmark tasks such as activity recognition [28, 29], localization [60, 17, 87] or text-to-video retrieval [78]. To handle longer video inputs, models have focused on efficient temporal modeling [45, 44, 26], or memory mechanisms [72, 73, 58]. While Neural Architecture Search (NAS) has enabled efficient model designs [13, 57, 56], self-supervised methods [53, 20, 47, 15] have alleviated the high demand for annotated data. More recently, language-supervision has been of interest for video understanding due to the strong generalization capabilities shown in the image domain.
Vision-Language Models (VLMs)
are usually trained on internet-scale paired visual-language (e.g. image-text) data. Seminal work such as CLIP [50] and ALIGN [23] have shed the light on the capabilities of such models, especially for zero-shot transfer. Since then, VLM literature has flourished, with applications in open-vocabulary object detection [18, 38], open-set classification [48], retrieval [61, 86, 3], captioning [85], segmentation [79, 51], robot manipulation [91, 24, 27] and many other domains. Although VLMs are generally trained on image-text data, there are intuitive variants which are trained either only on images [65] or only on text [41]. The commonly-used similarity-based objective of VLMs has also been repurposed to specialized domains, through prompt learning [95] or engineering [18, 42]. The text encoder of VLMs can be a powerful map** from semantic concepts to latent embeddings [37]. Many foundation models [90, 1, 88] follow similar design principles as VLMs, thriving in zero-shot [19] or few-shot [95] settings. Recent work combining Large Language Models (LLMs) with VLMs show how language can act as a communication-medium between models [91, 94, 70]. In [37], authors use an LLM to represent object classes as a set of its semantic attributes, to learn a better classifier.
As for video-VLMs, they are either trained from scratch on video-text data [77, 85], or more-often than not, finetuned initializing from a pretrained image-VLM [9, 83, 32]. Some are even trained on both image and video data paired with text [3]. The success of VLMs in the image domain has fueled similar research directions in the video domain.
Adapting image-text models to video
is a common practice when designing video-VLMs. A general and effective recipe for such adaptation is proposed in [9]. It consists of temporal modeling, multi-modal fusion, auxiliary training objectives, and both image/video data at scale. All others usually make use of a subset of these concepts. CLIP-ViP [81] is trained with different sources of data and multiple cross-modal training objectives. VideoCoCa [83] extends CoCa [88] with attention-pooled frame embeddings, which are used to decode text captions in a generative framework. MOV [48] is trained with additional audio/flow encoders through cross-modal attention, kee** image-text encoders frozen. Video-specific prompts can also be learned with such frozen encoders [25]. Vi-Fi [52] shows that simply finetuning CLIP image-text encoders without any specialized modules can generate video representations efficiently.
Apart from the above, there exists a body of prior work that closely-relates to VicTR. ActionCLIP [66] upgrades its CLIP image-encoder with (1) parameter-free temporal layers (TSM [31]) within the backbone, and (2) a temporal transformer head, while kee** the text-encoder fixed. Similarly, CLIP4clip [36] just uses a temporal transformer head to update visual embeddings. CLIPHitchhiker’s [4] generates text-conditioned video embeddings by temporally-pooling frame embeddings, conditioned on each text query. In this case, a given video generates multiple different visual embeddings, one per each text embedding. EVL [33] completely discards text. It acts as an initialization for a visual-only backbone, consisting of CLIP image encoder and a temporal, class-conditioned decoder. X-CLIP [40] introduces trainable temporal layers within its backbone image encoder, and generates video-specific text prompts. Meaning, it finetunes both encoders similar to ours. However, it does not allow interaction among text embeddings, nor with fine-grained visual information (but only, with temporally-aggregated information). Hence, it shows limited gains from adapting text to video domain. In contrast, our video-conditioned text embeddings that are unique for each video, interacts with both fine-grained visual embeddings and other text embeddings, to enable a better contrastive framework, and in-turn, a more-flexible alignment in the latent space.
3 Background: image-VLMs to video
In this section, we introduce the generic framework for adapting image-VLMs to video, and discuss how prior work fit into it. We consider CLIP [50] as the image-VLM, which is widely-adapted thanks to its convincing performance and open-source models. It consists of two encoders: Image and Text, optimized together on internet-scale paired image-text data. Image Encoder () is a ViT [10]. Given an input image , it is broken down to patch embeddings (i.e., tokens) and processed through multiple transformer layers. The class token is sampled as the visual embedding . Text Encoder () is a causal transformer, operating on tokenized text. Each class-label (or, any semantic concept) given as text , is first converted into a prompt based on a template such as “a photo of {class}.”, and tokenized with Byte Pair Encoding (BPE) [59] at the input of Text Encoder. Following multiple causal transformer layers, the [EOS] (i.e., end-of-sequence) token is extracted as the text embedding .
The two encoders are jointly-optimized with Cross-Entropy loss, where logits are computed based on the similarities (i.e., affinities) between visual and text embeddings. The corresponding pairs of embeddings (i.e., positives) are drawn together ( affinity) in a joint embedding space, whereas the others (i.e., negatives) are pushed apart ( affinity).
When adapting this framework to the video domain, the above Image encoder, Text encoder and the learning objective usually stays the same. But now, video frames become inputs to the Image encoder (while each being processed separately), and further go through a Video Head to induce temporal reasoning capabilities. Optionally, text embedding may also be updated or used as a conditioning within the Video Head.
Here, denotes optional embeddings. This Video Head may just be a temporal pooling layer or a temporal transformer as in [36, 66], or may even consist of more-specialized modules. Text embeddings could either be discarded as in [33], used as a conditioning as in [4], or jointly-updated with video embeddings as in [40]. Finally, logits are computed based on video-text affinities if text is not discarded, or as a linear map** of video embeddings if text is discarded. This generic framework is shown in Fig. 3 (top-left), along with variations of prior work in Fig. 3 (bottom-left).
4 Video-conditioned Text Representations
In VicTR, we adapt a pretrained image-VLM (e.g. CLIP [50]) to video, focusing more on text representations. Refer to Fig. 3 (right) for a detailed view. The image-VLM has not seen any temporal information during training. While it obviously affects the temporal reasoning capabilities of the visual embeddings— which most prior work focus on addressing, it also affects the text embeddings as well. The learnt latent space (and, the affinity-based objective) depends on both these embeddings. Thus, we consider text equally as important, if not more, in contrast to prior work
VicTR consists of a joint video-text model as , which consumes both visual and text embeddings from the image-VLM. It outputs text embeddings uniquely-specified for each video, i.e., Video-conditioned Text embeddings. It relies on three main components: (1) Token-boosting, (2) Cross-modal attention, and (3) Affinity (re-)weighting. Optionally, it can also benefit from any semantic concept available as auxiliary text, to optimize its latent space. Following subsections look at each of these in detail.
Let us first introduce a few additional notations. Consider a fixed vocabulary of activity-classes given by , and optional auxiliary semantic categories given by . The corresponding text embeddings can be denoted as and . Also, given an input video of frames, the corresponding image embeddings can be denoted as . The inputs to our Video Head are , and tokens. As visual embeddings are extracted per-frame and the text embeddings per prompt, there is no interaction among frame tokens, among text tokens or, across frame-text tokens up to this point.
4.1 Token-boosting
To introduce video-conditioned text embeddings, we first create a dedicated set of text tokens per video, by replicating the outputs of the backbone text encoder. Going further, we also create text tokens per each frame. This is done by weighting text tokens with the corresponding frame-text affinities. Formally, given text tokens, we end up with dedicated text tokens per video, at the input of our video head. Refer to Fig. 3 (right).
Here, SigAffinity() corresponds to affinity-weights normalized in range. We convert the values given by Affinity() that lie in , to be affinity-weights, by scaling with a learnable weight () and feeding through a sigmoid activation.
Although such affinity-weights based on the original image-VLM embeddings are not ideal for temporal reasoning, it initializes a noisy-version of our video-conditioned text embeddings that gets updated iteratively, later in the network. Such a token-boosting brings multiple other benefits. (1) More tokens means higher the model capacity. It can help learn better representations, but also adds a compute overhead (which we handle through other measures, as discussed later). (2) It also highlights relevant text tokens by grounding text on visual embeddings, while diminishing irrelevant ones. Subsequent attention mechanisms attend less to such diminished tokens, simplifying the gradient flow during learning. In other words, it acts as a soft-selection of relevant semantics, specific to each video. (3) Finally, it enables our model to capture variations of semantic categories over time. How certain attributes appear (or, disappear) over time is an important motion cue for activity recognition.
Next, we concatenate such boosted text tokens with visual tokens (corresponding to frames), and feed tokens to the subsequent layers.
Such tokens go through transformer layers in our Video Head. Each layer () consists of cross-modal attention, temporal attention, affinity (re-)weighting and linear (MLP) layers.
4.2 Cross-modal and Temporal attention
We consider our token representation to be two-dimensional (i.e., cross-modal and temporal), and apply divided self-attention (MSA) on each axis as in [2, 5]. First, we have a Cross-modal attention layer. Here, each visual token could attend to all text tokens at the same timestep, and each text token could attend to both the visual token and other text tokens at the same timestep. Since text tokens are already affinity-weighted, attention weights do not draw information from irrelevant semantic classes. Next, we have a Temporal attention layer. Here, both visual and text tokens go through a shared set of parameters, learning temporal cues in visual modality (i.e., ), and modeling variations of semantics across time in textual modality.
Here, ) stands for LayerNorm operation. Having a divided attention across two-axes instead of a joint-attention eases the compute requirement of our video head.
4.3 Affinity (re-)weighting
As previously discussed, the original affinities based on the image-VLM embeddings can be noisy, in the context of temporal reasoning. Now, as we have updated both our visual (i.e., video) and text tokens with cross-modal and temporal information, they are in a better state to re-compute affinities. Hence, we compute new affinity values and re-weight the text tokens accordingly. Refer to Fig. 3 (rightmost). First, we split video and text tokens as in,
Next, we temporally-pool the text tokens to come up with a compressed representation, on which we perform affinity re-weighting. This is similar to token-boosting, but done with updated video-text embeddings that are already video-conditioned. Without loss of generality, the same operations apply for auxiliary text tokens.
Finally, such affinity (re-)weighted text tokens are concatenated with visual tokens, as , and go through an MLP.
4.4 Classifier
Following transformer layers in our Video Head, we temporally-pool all tokens. We end up with a single video embedding, activity-text embeddings and aux-text embeddings. We further aggregate auxiliary embeddings, leaving a single embedding per each of the semantic categories (e.g. object, scene, human-subjects). Finally, we compute logits based on affinity, similar to the CLIP [50] objective, and use Cross-Entropy loss for optimization.
4.5 Discussion on design decisions
Auxiliary semantic information:
We rely on optional semantics (or, attributes) in the form of visually-grounded auxiliary text, to improve our video-conditioned text embeddings. This is guided by the loss on . The vocabulary of such auxiliary texts is fixed (i.e., common for all videos) per dataset. On Charades, we consider 97 auxiliary text classes, and on Kinetics-400, we use 88 classes (refer the appendix for more details). To highlight only the relevant semantics for a given video, we visually-ground them via (1) cross-modal attention with visual embeddings, and (2) affinity weighting. Finally, to compute , we create one representative embedding per each of the semantic categories, by average pooling aux embeddings within a category ( for Charades and for Kinetics-400).
Alternative weighting schemes:
Our text (re-)weighting method is similar to a contrastive training objective (as in CLIP [50]), which is based on visual-text affinities. We find this complementary nature beneficial. It highlights relevant text (and diminish irrelevant ones) within each intermediate layer of our Video Head. This iterative process fixes the initial noisy affinities resulting from the original image-VLM embeddings, when fused with better temporal cues in subsequent layers. We also explored other weighting schemes such as learnable weights or attention-based weights, which are not directly-connected to the training objective. They do not provide any improvements.
Visual-only or Text-only classifiers:
We also explored different classifiers (i.e., how we compute logits), considering (1) a visual-only classifier as in [33], (2) a text-only classifier, or (3) an affinity-based classifier as in [50, 40]. The last performs the best. Even though we primarily focus on updating text embeddings, it still makes sense to rely on video-text affinities to be the training objective (or, classifier), as it is complementary to the components within our Video Head.
5 Experiments
To validate the merits of VicTR, we experiment on few-shot and zero-shot activity recognition (on HMDB-51 [29] and UCF-101 [62]), as well as short-form (on Kinetics-400 [28]) and long-form recognition (on Charades [60]). Following sub-sections will detail our implementation, evaluation settings, datasets and the results.
Implementation details:
We use a pretrained CLIP [50] as our image-VLM backbone. Our Video Head is randomly-initialized having 4 transformer blocks similar to [66], which is applied on-top of CLIP backbones. We consider an embedding dimension of 512/768 (w/ heads 8/12) corresponding to CLIP B/16 and L/14 backbone variants. Our output video-text embeddings are further mapped into 256-dimensional embeddings prior to computing affinity-based logits. We use an AdamW [35] optimizer with a cosine schedule for training. On Kinetics-400 [28], we finetune our model for 30 epochs with a batch size of 256 using 8e-6/8e-5 learning rates for backbone/newly-initialized parameters, similar to [40]. On Charades [60], we finetune for 50k iterations with a batch size of 64 using 5e-7/5e-4 learning rates for backbone/newly-initialized parameters, similar to [4]. We use augmentations and input sampling strategies similar to [40] for Kinetics-400 and similar to [33] for Charades.
Evaluation settings:
In our experiments, we compare against prior art VLMs on each dataset. Since the direction of adapting image-VLMs to video is relatively-recent, their absolute performance may not be the state-of-the-art in some cases (e.g. long-form recognition), but we report numbers in comparable settings. For each experiment, we report pretraining settings, #frames-per-view, #views-at-inference and compute-per-view (GFLOPs) as supplementary metrics. We evaluate single-label activity recognition performance with Top-1 (%) accuracy, and multi-label recognition with Average Precision (mAP%). When reporting FLOPs, we consider the cost of computing a single affinty-based logit (i.e., the cost for one video-text pair) similar to [40].
5.1 Few-shot and Zero-shot Transfer
l cccc c cccc
Model HMDB-51 UCF-101
: 2 4 8 16 2 4 8 16
Methods w/o image-text pretraining
\rowfontTSM [31] 17.5 20.9 18.4 31.0 25.3 47.0 64.4 61.0
\rowfontTimeSformer [5] 19.6 40.6 49.4 55.4 48.5 75.6 83.7 89.4
\rowfontVideo-Swin-B [34] 20.9 41.3 47.9 56.1 53.3 74.1 85.8 88.7
Methods w/ image-text pretraining
X-CLIP [40] 53.0 57.3 62.8 64.0 76.4 83.4 88.3 91.4
X-Florence [40] 51.6 57.8 64.1 64.2 84.0 88.5 92.5 94.8
VicTR (B/16) 60.0 63.2 66.6 70.7 87.7 92.3 93.6 95.8
Data:
We consider the downstream datasets HMDB-51 [29] and UCF-101 [62] to evaluate few-shot and zero-shot performance of our model. UCF-101 is a classification dataset collected from YouTube. It contains 13k clips annotated with 101 action classes. HMDB-51 is relatively small and contains 7k clips with 51 annotated classes. Both datasets have three splits of training/test data. In few-shot evaluation, we randomly sample 2, 4, 8, or 16 clips per class to create our training sets, same as in [40]. We use a model pretrained on Kinetics-400 [28] for 10 epochs and finetune on few-shot examples for 50 epochs, using 32-frames per view as in [40].
Few-shot results:
In Table 5.1, we report top-1 accuracy on the first test split among three, in each dataset, using a single view at inference. VicTR significantly outperforms prior art, either w/o image-text pretraining (TSM [31], TimeSformer [5], Video-Swin [34]) or w/ such pretraining (X-CLIP [40], X-Florence [40]). Although our method uses similar backbones as X-CLIP, it even outperforms X-Florence (an extension of a more-generic foundation model) on both datasets consistently. This shows the effectiveness of our video-conditioned text embeddings when generalizing to downstream with few training samples.
lccc
Model #Frames HMDB-51 UCF-101
Methods w/o image-text pretraining
\rowfontMTE [80] - 19.71.6 15.81.3
\rowfontASR [67] 16 21.80.9 24.41.0
\rowfontZSECOC [49] - 22.61.2 15.11.7
\rowfontUR [96] 1 24.41.6 17.51.6
\rowfontTS-GCN [16] 16 23.23.0 34.23.1
\rowfontE2E [6] 16 32.7 48.0
\rowfontER-ZSAR [8] - 35.34.6 51.82.9
Methods w/ image-text pretraining
ActionCLIP [66] 32 40.85.4 58.33.4
X-CLIP [40] 32 44.65.2 72.02.3
VicTR (B/16) 32 51.01.3 72.40.3
Zero-shot results:
We report zero-shot transfer performance in Table 5.1. We use a model pretrained for 10 epochs on Kinetics-400 [28] with 32-frames per view, similar to [40], and transfer to the downstream. We report mean and standard deviation on three-splits. VicTR-B/16 outperforms X-CLIP [40] by on HMDB-51 and by on UCF-101. Also, the performance of our model is more stable across splits. This validates that the learned video-conditioned text embeddings can be generalized, even w/o seeing the same categories as in the downstream, during pretraining.
5.2 Short-form Activity Recognition
lccccr
Model Pretrain #Frames #Views GFLOPs Top-1
Methods w/o image-text pretraining
\rowfont Video-Swin-L (384) [34] IN-21K 32 105 2107 84.9
\rowfont TimeSformer-L [5] IN-21K 96 13 2380 80.7
\rowfont MTV-L [82] JFT-300M 32 43 1504 84.3
\rowfont Video-SwinV2-G (384) [34] IN-21K+ 8 45 - 86.8
\rowfont MViTv2-L [30] (312) - 40 53 2828 86.1
\rowfont ViViT-L FE [2] JFT-300M 32 13 3980 83.5
\rowfont TokenLearner [55] JFT-300M 64 43 4076 85.4
\rowfont CoVeR-L [93] JFT-3B - 13 - 87.2
Methods w/ image-text pretraining
ST-Adapter [43] CLIP 32 13 2749 87.2
Text4Vis [74] CLIP 32 13 1662 87.1
EVL [33] CLIP 8 13 674 86.3
X-CLIP [40] CLIP 8 43 658 87.1
VicTR (L/14) CLIP 8 43 656 87.0
Data:
Kinetics-400 [28] is a large-scale activity recognition dataset, with 240k training and 20k validation videos. Each clip contains video-level annotations for a single activity out of 400 categories, having short 10s duration.
Results:
We report the performance of VicTR on Kinetics-400 short-form activity recognition in Table 5.2. We consider L/14 with 8-frames per view, while using such views at inference similar to [40]. Our method shows a competitive performance at a similar footprint to closely-related video-VLMs [40, 33]. It is also competitive with CoVeR-L [93] which is trained with 10 more data. VicTR outperforms MTV [82] by , ViViT [2] by and TokenLearner [55] by , all trained on a similar scale of data, while being more-efficient.
5.3 Long-form Activity Recognition
lccccr
Model Pretrain #Frames #Views GFLOPs mAP
Methods w/o image-text pretraining
\rowfont I3D + NL [68] K400 128 103 544 37.5
\rowfontEvaNet [46] K400 64 - - 38.1
\rowfontLFB-101 [71] K400 32 103 529 42.5
\rowfontSlowFast-50 [14] K400 8+32 103 66 38.0
\rowfontSlowFast-101 + NL [14] K400 16+64 103 234 42.5
\rowfontX3D-XL (312) [13] K400 16 103 48 43.4
\rowfontMViT [12] K400 32 103 237 47.7
\rowfontAssembleNet-101 [57] - 128 51 1200 58.6
Methods w/ image-text pretraining
ActionCLIP [66] CLIP 32 103 563 44.6
CLIP4clip [36] CLIP 32 11 - 32.0
CLIP Hitchhiker’s [4] CLIP 32 11 - 44.9
VicTR (B/16) CLIP 32 41 567 50.1
VicTR (L/14) CLIP 32 41 2602 57.6
Data:
Charades [60] is a small-yet-challenging activity recognition dataset with 9.8k long-form videos. It comes with frame-level annotations of 157 daily household activities. Yet, the benchmark setting requires making video-level predictions. The data is split as 7.9k for training and 1.8k for validation. Each video contains multiple overlap** activities, having an average duration of 30s.
Results:
We report the performance of VicTR on Charades long-form activity recognition in Table 5.3. Here, we consider both B/16 and L/14 model variants with 32-frames per view, while having such views at inference. Our method outperforms prior video-VLMs by a considerable margin. In fact, VicTR-B/16 shows mAP boost over CLIP Hitchhiker’s [4], and mAP boost over ActionCLIP [66] with a similar footprint. This is a significant improvement considering the challenging Charades settings. Our method is also competitive with non-VLMs, whereas other video-VLMs lag behind. It highlights the limitations of current VLMs in long-context temporal modeling.
5.4 Ablation Study
Model | Kinetics-400 | Charades |
---|---|---|
VicTR | 84.4 | 50.1 |
VicTR (No Aux. Text) | 84.2 | 49.8 |
VicTR (w/ CLIP Visual emb.) | 84.0 | 49.7 |
VicTR (w/ CLIP Text emb.) | 83.3 | 41.7 |
Model | mAP |
---|---|
VicTR | 50.1 |
VicTR (No Affinity weighting) | 48.8 |
VicTR (w/ joint-attention) | 44.8 |
VicTR (Text Classifier) | 41.2 |
VicTR (Visual Classifier) | 43.1 |
In Table 5, we provide evidence to validate our main hypotheses. Namely, we evaluate the impact of auxiliary semantics and the effectiveness of updating text embeddings.
Auxiliary semantics do help.
We rely on extra semantic information to guide our latent embedding space. We see that such auxiliary text is giving gain on Kinetics-400 and mAP gain on Charades. This conveys the potential of semantics, but also the limitations of not having ground-truth annotations corresponding to them.
Updating text embeddings is more effective.
To evaluate which of our embeddings (video or video-conditioned text) are critical, we replace them with the corresponding original CLIP [50] embeddings (i.e., temporally-pooled frame, or text). We see that the proposed video-conditioned text are significantly-more effective, and when replaced, the performance drops on Kinetics-400 and mAP on Charades. In contrast, when our video embeddings are replaced, the performance drops only and mAP, respectively. Meaning, the CLIP frame embeddings are on-par with our video embeddings, but our video-conditioned text embeddings are significantly improved.
In Table 6, we ablate and justify our design decisions. Namely, we evaluate our affinity weighting mechanism, divided attention, and affinity-based classifier.
Affinity-weighting and divided attention do help.
We see a mAP performance gain by having our affinity (re-)weighting mechanism. While joint-attention may be more expressive compared to divided attention, it can incur training difficulties. As a result, we see the divided attention enjoying a significant mAP boost.
Affinity-based classifier is required.
As we previously discussed, our affinity weighting mechanism makes more-sense in the context of the same affinity-based loss formulation. To verify this, we replace such affinity-based logits with text-only or visual-only logits, which are just linear map**s of the corresponding embeddings. These significantly underperforms, with mAP and mAP, respectively.
6 Conclusion
In this paper, we introduced VicTR, a framework for adapting image-VLMs to video, with a focus on video-conditioned text embeddings. It can also benefit from freely-available auxiliary semantic information in the form of visually-grounded text, to guide the learned latent space. Our evaluations verified the importance of updating text embeddings, across multiple activity recognition benchmarks, under few-shot, zero-shot, short-form and long-form settings. We believe that this work reveals the importance of using language embeddings for temporal reasoning.
Appendix
Details on auxiliary text classes:
On Charades [60], we use 97 auxiliary classes: 43 objects, 15 places, 5 people-counts and 34 atomic-actions. People-count prompts are manually-selected, whereas the others are already annotated in the dataset. On Kinetics-400 [28], we use 88 auxiliary classes: 40 objects, 43 places and 5 people-counts. Atomic-actions on Kinetics-400 are too diverse to be categorized as a concise set, and thus omitted. On Kinetics-400, people-counts are similarly selected, and the others are generated by prompting ChatGPT3.5 with the set of 400 activity classes. The auxiliary vocabulary for each dataset is given below.
On Charades [60], we have the following:
Objects: bag, bed, blanket, book, box, broom, chair, closet, cabinet, clothes, cup, glass, bottle, dish, door, doorknob, doorway, floor, food, groceries, hair, hands, laptop, light, medicine, mirror, paper, notebook, phone, camera, picture, pillow, refrigerator, sandwich, shelf, shoe, sofa, couch, table, television, towel, vacuum, window.
Places: basement, garage, pantry, recreation room, walk-in closet, laundry room, stairs, hallway, dining room, entryway, home office, bathroom, kitchen, bedroom, living room.
People: no people, one person, two people, three people, several people.
Atomic-actions: doing nothing, awakening, closing, cooking, dressing, drinking, eating, fixing, gras**, holding, laughing, lying, making, opening, photographing, playing, pouring, putting, running, sitting, smiling, sneezing, snuggling, standing, taking, talking, throwing, tidying, turning, undressing, walking, washing, watching, working.
On Kinetics-400 [28], we have the following:
Objects: bow and arrow, flowers, leaves or tree, computer, bed or baby crib, glass or bottle, dumbbell, treadmill or gym equipment, trampoline, mechanical bull or roller skates, bowling ball, cabinet or windows or dining table, sailboat or jet ski, fishing rod, cleaning supplies, grooming tools, pool, shoes, toilet, rope or ladder, barbecue grill or campfire, makeup tools, shovel, laundry or clothes, books or drawing materials, baseball, basketball or golf club, gymnastics mat, ice skates, dessert, fruits or vegetables, food items, fire extinguisher, hammer or meat grinder, musical instruments, board game, sporting equipment, gas pump, shop** cart, newspaper, animals, car, tractor or bicycle, rock climbing gear, electric sharpener or shredder.
Places: home, living room, dining room, bathroom, kitchen, bedroom, backyard or garden, staircase, hair salon, restaurant, outdoor, mountain or cliff, grass field, snow or ice, river or sea, sky, gym or fitness center, supermarket, foundary or workshop, forest, sports field, stadium, court or arena, massage palor, dance floor or stage, road or sidewalk, swimming pool, restaurant or bar, entrance or doorway, hospital or emergency room, bowling alley, building or skyscraper, theatre or auditorium, farm, recording studio or music room, news room, repair shop, garage, archery or shooting range, beach, underwater or sea bed, office or workspace, park, arcade or casino, school or classroom.
People: no people, one person, two people, three people, several people.
On the selection of datasets:
In literature, activity recognition is considered as the prominent video classification task. To understand the effectiveness of our video-conditioned text representations, we tackle a variety of activity recognition benchmarks. This includes few-shot and zero-shot activity recognition (on HMDB-51 [29], UCF-101 [62]), short-form recognition (on Kinetics-400 [28]) and long-form recognition (on Charades [60]). It is worth noting that Kinetics-400 usually contains single-person activities, whereas Charades includes multiple people and complex overlap** activities. Together, these provide a thorough spread of scenarios for both single-label and multi-label classification. Our evaluation setting is similar to many other prior work which evaluate on classification [66, 40, 33], yet extensive as it includes diverse contexts.
Compute requirement:
Token-boosting increases the footprint of our model. However, our Video-Head is still lightweight, requiring minimal additional computations. In fact, it amounts for only 0.2% (0.5B) of total FLOPs in B/16 16-frame model (285B), and only 0.1% (0.6B) in L/14 8-frame model (656B). This is because of three reasons: (1) having fewer layers (i.e., 4 layers vs. 12/24 layers) and lightweight attention modules (i.e., temporal and cross-modal attention vs. spatial attention) compared to the image-VLM backbone [50], (2) processing significantly fewer tokens (i.e., only temporal and text-class tokens remain), and (3) doing text-conditioning only after the backbone (i.e., for the most part, all text embeddings go through shared computations). Ovrall, VicTR has a comparable footprint to prior work such as [33, 40, 66], providing a fair comparison (see respective GFLOPs in Table 5.2 and Table 5.3).
Other forms of semantic information:
In our framework, we use a fixed vocabulary of auxiliary prompts as semantic inputs, that is specific to each dataset. Another way of providing semantic information is in the form of captions. If available, a detailed set of captions may provide better semantic supervision. However, they come with a significant cost, since they need to be annotated per-video. In contrast, our auxiliary prompts are freely-available and can be selected with only a minimal effort, as they are common for all videos in a dataset. Our model learns to highlight relevant information for a given video implicitly, via affinity weighting, without needing any ground-truth annotations.