EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Baoqi Pei1,2∗, Guo Chen3,1∗, Jilan Xu4,1∗, Yu** He3∗, Yicheng Liu3∗, Kanghua Pan3∗,
Yifei Huang1,5∗, Yali Wang1,6, Tong Lu3, Limin Wang1,3, Yu Qiao2

1Shanghai AI Laboratory, 2Zhejiang University, 3Nan**g University,
4Fudan University, 5The University of Tokyo, 6SIAT, CAS
[email protected]     [email protected]     [email protected]
{502023330020,522023330056,522023330071}@smail.nju.edu.cn
     [email protected]
[email protected]
     {lutong,lmwang}@nju.edu.cn     [email protected]
Abstract

In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.

* These authors contributed equally.

1 Introduction

In computer vision research, egocentric video understanding represents a pivotal task aimed at enabling machines to comprehend videos including human activities from a first-person perspective. Unlike traditional third-person viewpoint analysis, egocentric video understanding focuses on understanding human activities as they occur from the camera wearer’s viewpoint, often captured through wearable cameras or head-mounted devices. This task holds significant implications across various domains, including healthcare [37], virtual/augmented reality [40], and human-computer interaction [6]. Egocentric video action understanding facilitates applications ranging from assistive technologies for the visually impaired to immersive experiences in virtual environments [4, 16]. Additionally, it fosters advancements in personalized assistance systems [35], sports analytics [1], and surveillance technologies [7], thereby underscoring its multifaceted impact on both academic research and practical applications.

In recent years, action recognition methods have undergone significant advancements, propelled by the surge in deep learning techniques and the availability of large-scale annotated datasets [3, 12, 5]. With the advent of convolutional neural networks (CNNs) [29, 3] and recurrent neural networks (RNNs), action recognition has witnessed a paradigm shift towards end-to-end trainable models capable of automatically learning discriminative features from video clips. Furthermore, the integration of attention mechanisms, spatial-temporal modeling, and graph-based representations has further enhanced the performance of action recognition systems. Benefit from large-scale vision-language datasets [2, 30, 12], a variety of video foundation models [31, 33] have been designed to learn general video representations, which have shown to benefit a series of downstream action recognition tasks [3, 11, 5]. However, as most of these video foundation models are trained on videos recorded in third-person view, the learned representations turn out sub-optimal for egocentric video understanding [32, 17, 13, 34, 18].

Refer to caption
Figure 1: The workflow of the training process of EgoVideo. It includes 3 stages: in the first stage, we filter and select high-quality egocentric video and text pairs from multiple existing datasets. Then we perform post pertaining using the data in stage 1 by standard video-text contrastive learning. Finally, we adapt the pretrained EgoVideo model to different downstream tasks.

To tackle the challenge, we propose a 3-stage training paradigm for egocentric video understanding, including multiple tasks like natural language grounding, domain adaptation, and multi-instance retrieval. Specifically, we first filter and select high-quality egocentric video and text pairs from multiple existing datasets [12, 23, 27, 17]. These high-quality data serve as the foundational data for transferring models learned from general domains to the egocentric domain. We adopt a video foundation model [31] that is pre-trained on large-scale video-language datasets [30]. With the help of rich vision features and a wide range of action-aware knowledge, this model is capable of extracting general video feature representations, acting as a good starting point for subsequent feature learning. In the second stage, to mitigate the domain gap between web-scale video datasets and egocentric videos, we perform post-training on the selected data, effectively transferring the general video feature representations to egocentric domain. We term the resulting model as EgoVideo, consisting of a strong egocentric video encoder EgoVideo-V and a text encoder EgoVideo-T. In the third stage, we conduct task-specific fine-tuning of EgoVideo-V and EgoVideo-T on three different egocentric video understanding tasks, e.g., natural language queries, domain adaptation action recognition, and multi-instance retrieval.

Experimental results show that our 3-stage strategy has led to a remarkable improvement in overall model performance. The model excels at understanding fine-grained, action-specific information, demonstrating strong performance in action recognition and multi-instance retrieval. Moreover, benefiting from the multi-stage training, our model exhibits video understanding ability across a wide range of actions.

In the remainder of this report, we will detail our solutions along with experiments for each joined Ego4D track. Finally, we discuss the limitations of our work and conclude this technical report.

2 Training Process of EgoVideo

2.1 Stage1: Augmented Data Selection

To better transfer the video foundation model learned in the general video domain into the egocentric domain, we collect a broad range of paired egocentric video-text pairs from public video datasets, such as Ego4d [12], HowTo100M [23], EgoExoLearn [17], and Ego4d GoalStep [27] by automatic filtering techniques. We do this to ensure a wider range of egocentric data and maintain the pertaining data quality. This results in around 7M video-text pairs.

2.2 Stage2: Egocentric Video Post-training

In this work, we adopt InternVideo2 [31], a novel video foundation model that is pre-trained on millions of video-text pairs [30]. InternVideo2 is built through a progressive learning scheme, consisting of feature distillation, multi-modal alignment, and vision-language connection. The pre-trained video foundation model thus acts as a strong starting point for the subsequent feature learning process. More details about the foundation model can be found in  [31, 21].

We then perform the post-pretraining process and train the model for 5 epochs on the hybrid data in Stage 1 to improve the egocentric video understanding ability. The model is optimized via a standard visual-text contrastive loss. During training, we also examine the model’s egocentric video understanding ability on EPIC-Kitchen-100 zero-shot multi-instance retrieval benchmark [5], and the results are shown in Table 3.6. We term this egocentric video foundation model as EgoVideo, consisting of a strong egocentric video encoder EgoVideo-V and a text encoder EgoVideo-T.

2.3 Stage 3: Egocentric Downstream Adaptation

After stage 2 training, we obtain a video foundation model EgoVideo tailored for the egocentric domain. We use this model to initialize the models in stage 3. In this stage, we conduct task-specific fine-tuning on the training sets. We put the detailed task-specific fine-tuning process of each task in the following section.

3 Task-specific Finetuning

3.1 Task 1: Natural Language Queries @ Ego4D

Task Definition Given a video clip and a natural language query, the Ego4D [12] Natural Language Queries task aims to identify the temporal window corresponding to the query’s answer.

Approach Our solution builds upon GroundNLQ[14] and employs our EgoVideo to extract video and text features. GroundNLQ proposes a multi-modal multiscale transformer encoder module to encode both video and text features and then efficiently fuse them. Following GroundNLQ, we first pretrain on NaQ [25] data and then fine-tune on NLQ data.

Implementation Details. 1) Feature Extraction: We leverage ViT-1B of EgoVideo to extract video feature for each snippet, which contains s=16𝑠16s=16italic_s = 16 consecutive frames with interval δ=16𝛿16\delta=16italic_δ = 16. The text features are extracted by BERT-Large of EgoVideo. 2) Training Setup: In the pretraining phase, we set the batch size to 8 and the total epochs to 10, with a warmup of 4 epochs, employing a maximum learning rate of 2e-4. In the fine-tuning phase, we set the batch size to 2 and the total epochs to 10, with a warmup of 4 epochs, with a maximum learning rate of 5e-5.

# Method Feature Validation Test
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
A GroundNLQ [14] EgoVLP 26.98 18.83 53.56 40.00 24.50 17.31 40.46 29.17
B GroundNLQ† [14] EgoVLP 27.20 18.91 54.42 39.98 25.67 18.18 42.05 29.80
C GroundVQA [8] EgoVLP 29.70 - - - 26.67 17.63 39.94 27.70
D GroundNLQ EgoVideo 28.65 19.73 53.30 40.42 25.07 17.31 40.88 29.67
E C+D Ensemble - - - - 28.05 19.31 44.16 31.37
Table 1: The Natural Language Queries performance.

Results. Table 1 presents the results of NLQ. #A and #D employ identical model and training strategy, while our single model’s features( #D) significantly outperform the ensemble of EgoVLP and InternVideo(#A). #E combines predictions from GroundNLQ, GroundNLQ*, and GroundVQA [8]. GroundNLQ* is a variant of GroundNLQ, distinguished by the integration of a cross-modal layer within the encoder. GroundVQA leverages a large language model to encode visual and language features. Ensemble methods further enhance performance.

Validation Test
Method [email protected] [email protected] [email protected] [email protected]
VSLNet [36] - - 19.04 12.04
Ours 28.02 23.66 32.99 25.92
Ours (Ensemble) - - 34.06 26.97
Table 2: The Step Grounding performance. Our method is the combination of GroundNLQ and EgoVideo.

3.2 Task 2: GoalStep - Step Grounding @ Ego4D

Task Definition Step grounding aims to identify the temporal segment in an untrimmed egocentric video corresponding to a given natural language description of the step.

Approach Similar to NLQ, we use GroundNLQ as the grounding model for Step Grounding and adopts EgoVideo to extract video and text features.

Implementation Details. We adopt the consistent configurations with NLQ for feature extraction. During the fine-tuning phase, we use a batch size of 8, apply dropout with a probability of 0.2, and set the drop path rate to 0.2. Other hyperparameters remain the same as in NLQ.

Results. Table 2 displays our results on Step-Grounding. The official baseline uses VSLNet[36] as the grounding model and Omnivore[10] features. In contrast, our solution leverages stronger video and text features along with advanced grounding models, resulting in notable improvements. After ensembling results from GroundNLQ and GroundNLQ\ast, we achieve further gains.

3.3 Task 3: Moment Queries @ Ego4D

Task Definition Given an egocentric video and a specific action category, Moment Queries task aims to retrieve all temporal segments corresponding to this action category. The action categories are pre-defined and specific to first-person activities.

# Feature Validation Test
Average mAP [email protected] Average mAP [email protected]
A InternVideo + EgoVLP 27.85 46.98 - -
B Slowfast + Omnivore + EgoVLP + InternVideo - - 29.34 48.50
C EgoVideo-MQ 28.53 46.07 - -
D InternVideo + EgoVideo-V 31.30 50.21 31.52 49.22
E InternVideo + EgoVideo-MQ 31.00 49.28 - -
F InternVideo + EgoVideo-V + EgoVideo-MQ 32.48 51.04 32.50 50.07
Table 3: The Moment Queries performance.

Approach We adopt ASL [26] as our task-specific solution. ASL divides the task into two subtasks: classification and localization. It incorporates an action sensitivity evaluator module to assess the significance of each frame relative to the action, guiding the learning process for each subtask.

Fast Backbone Noun Noun_Verb Noun_TTC Overall
X3d-M 25.06 13.29 9.14 5.12
EgoVideo-V 31.08 16.18 12.41 7.21
Table 4: Short-term object-interaction anticipation performance.

Implementation Details. 1) Feature Extraction: For further enhancing vision-only performance, we finetune the video encoder of EgoVideo-V on MQ data and the resulting model is termed as EgoVideo-MQ. Consistent with the configuration of NLQ and GoalStep, we adopt EgoVideo-V and EgoVideo-MQ to extract two types of video features. 2) Training Setup: InternVideo, EgoVideo-V, and EgoVideo-MQ features are all projected to 512 dimensions, and other hyperparameters remain consistent with ASL.

Results. Table 3 displays the results for MQ. Comparing #A and #C, our single model’s features outperform the ensemble of EgoVLP and InternVideo, demonstrating the superior performance of EgoVideo. #D combines InternVideo with EgoVideo-V features. Specifically, we project each feature and concatenate them. #E incorporates InternVideo with EgoVideo-MQ features. #F combines predictions from #D and #E by averaging the output logits for classification and localization from each model. Compared with ASL, our solution leverages multiple complementary features and achieves better results.

Model Noun Top1 Verb Top1 Action Top1
EgoVLP 45.53 40.32 20.63
EgoVideo-V 52.21 43.65 27.64
Table 5: Action recognition performance on the validation set.

3.4 Task 4: Short-term Object-interaction Anticipation @ Ego4D

Task Definition Short-term object interaction anticipation task aims to predict the next human-object interaction happening after a given timestamp [12]. Given an input video, the model is required to anticipate at what time and in what location, what kind of object interaction will happen.

Approach We choose to use Stillfast [24] as our downstream solution. This approach separately extracts high-resolution, low-frame-rate image information and low-resolution, high-frame-rate video information, and then fuses them to obtain multi-modal spatio-temporal features. Stillfast [24] uses X3D-M [9] as the backbone for video feature extraction. We replace the X3D-M with our stronger VideoEgo-V. Differing from the original Stillfast framework which fuses multiple multi-scale intermediate layers of X3D-M (fast) and ResNet (still), we interpolate the last layer feature map of VideoEgo-V into different sizes and fuse them into the multi-scale still features generated by ResNet.

Implementation Details. We adopt the training setup consistent with Stillfast. The difference is that we set the drop path rate to 0.3, layer-wise lr decay to 0.9. Meanwhile, we enable BF16 for stable training.

Results. Table 4 displays the results for Short-term object-interaction anticipation on the test set. The results indicate that our EgoVideo-V is also suitable for direct transfer to forecasting tasks. In particular, the predictions of Verb and TTC are challenging to substantiate with direct evidence and often rely on advanced cognitive reasoning abilities.

3.5 Task 5: Long-term Action Anticipation @ Ego4D

Task Definition Long-term action anticipation is a task that aims to predict multiple future actions following a given action. Each action is composed of a verb and a noun. Given an input video up to a particular timestamp, which corresponds to the last visible action, The goal is to predict a list of the twenty subsequent actions.

Approach Recent methods [38, 20] leveraging Large Language Models (LLMs) have shown superior performance in LTA tasks by converting video actions into natural language sequences, which LLMs then use to predict future actions. For LLM-based methods, better classification prediction and stronger LLM intuitively bring stronger language comprehension and prediction capabilities.

Video Clip Classification. Previous methods typically used video encoders like EgoVLP[38, 20] or CLIP[38] combined with a Transformer-based classification head to obtain verbs and nouns. We simply finetune EgoVideo-V on LTA data to replace the previous classification predictions with our better inference results.

Anticipation with LLMs. We employed the Vicuna-7B [42] model as the LLM. During fine-tuning, we fixed the historical action sequence length to 8 and used the subsequent 20 actions as labels. We used EgoVLP [22] to extract features and augment the training set.

Experiments Implementation Details. Following [38], during the fine-tuning phase, we set the learning rate to 3e-4, gamma to 0.85, batch size to 32, and the number of epochs to 3 for all models. We also use LoRA [15] to improve the speed and efficiency of fine-tuning.

Action Recognition Results. Table 5 shows the accuracy of action recognition on the validation set. The results reveal that our EgoVideo-V can achieve better prediction for the next long-term anticipation.

Validation Test
LLM Classification Model Noun ED\downarrow Verb ED\downarrow Action ED\downarrow Noun ED\downarrow Verb ED\downarrow Action ED\downarrow
LLaMA2-7B[28] CLIP 67.55 67.28 89.31 - - -
LLaMA2-7B[28] EgoVLP 65.97 67.30 88.83 - - -
LLaMA2-7B[28] EgoVideo-V 65.09 66.78 87.93 67.04 65.07 87.39
Mistral-7B[19] EgoVideo-V 65.02 70.08 88.69 65.00 68.07 87.70
LLaMA3-8B EgoVideo-V - - - 64.54 67.77 87.78
Vicuna-13B[42] EgoVideo-V 65.01 69.15 88.52 - - -
Vicuna-7b[42] EgoVideo-V 62.64 65.76 86.19 63.67 63.54 85.04
Table 6: Results on the validation and test set of Long-term action anticipation Challenge.

Action Anticipation Results. Table 6 shows the LTA results on the validation and testing set. The table shows that classification results of EgoVideo-V achieved significant improvements in anticipation performance compared with EgoVLP [22], when using LLaMA2-7B for anticipation. Furthermore, we tested various LLMs, including LLaMA2-7B [28], LLaMA3-8B, Vicuna-7B [42], Vicuna-13B[42], and Mistral-7B [19]. The Vicuna-7B demonstrated significant performance improvements.

3.6 Task6: Action Recognition @ EPIC

Task definition. Action recognition considers a short video clip and requires the model to predict the verb/noun/action classes of the action in this segment. The evaluation metric includes Top-1/5 Accuracy.

Training. Following prior works [39, 41], we train our model for 100 epochs on the training set with a learning rate of 1e-5 and batch size of 48. We conduct warm-up training for 2 epochs using the cross-entropy loss. The model is trained on 16 A100 GPUs.

Results. Table 7 present fine-tuned model’s performance on EK100 action recognition. The results reveal significant advancements with our proposed method, surpassing state-of-the-art approaches in both Verb/Noun/Action top-1 scores. Our single EgoVideo-V achieves 72.9%/68.7%/56.2% Verb/Noun/Action top-1 scores on the test set. This is far ahead of that last challenge champion whose ensembled Verb/Noun/Action top-1 results is 71.7%/65.8%/54.3%. After ensembling three different models, our EgoVideo-V further achieves slight improvement +0.2%/+1.1%/+0.6%, and the final testing results are 73.1%/69.8%/56.8%. Overall, the results underscore the effectiveness of our approach in enhancing the understanding of daily human activities captured in egocentric views, highlighting its potential for advancing research in activity recognition domains.

Table 7: Action Recognition Top-1 performance on EK100 dataset.
Method Val Test
Verb Noun Action Verb Noun Action
LaViLA [41] 72.0 62.9 51.0 - - -
AVION [39] 73.0 65.4 54.4 - - -
AVION [39] (Ensemble) - - - 71.7 65.8 54.3
EgoVideo-V - - - 72.9 68.7 56.2
EgoVideo-V (Ensemble) - - - 73.1 69.8 56.8
Table 8: Unsupervised Domain Adaptation for Action Recognition Top-1 performance on the target domain. The model is finetuned on the source domain.
Method Val Test
Verb Noun Action Verb Noun Action
Previous Top-1 - - - 58.2 40.3 30.1
EgoVideo-V - - - 61.3 56.2 43.2
Table 9: Zero-shot multi-instance retrieval performance on EK100 dataset.
{tblr}

cells = c, hline1-2,8 = -, Method & Backbone Average mAP Average nDCG
EgoVLP [22] ViT-B 16.6 23.1
LaViLA [41] ViT-B 30.9 32.0
AVION [39] ViT-B 30.9 32.0
LaViLA [41] ViT-L 36.1 34.6
AVION [39] ViT-L 37.6 35.3
EgoVideo EgoVideo-1B 47.6 39.4

Table 10: Multi-instance retrieval performance on EK100 dataset.
Method mAP nDCG
Avg. T2V V2T Avg. T2V V2T
EgoVLP [22] 45.0 40.5 49.9 59.4 57.9 60.9
LaViLA [41] 50.9 47.1 54.7 66.5 64.9 68.1
AVION [39] 54.5 51.1 57.9 69.0 67.6 70.4
EgoVideo 63.3 58.9 67.6 73.2 71.5 75.0

3.7 Task7: Multi-instance Retrieval @ EPIC

Task definition: The primary objective of Epic-Kitchen Multi-Instance Retrieval task is to develop models capable of accurately retrieving relevant video segments from the Epic-Kitchen-100 dataset given a query in the form of a textual description of the action or activity. The evaluation metric includes Mean Average Precision (mAP) and normalized Discounted Cumulative Gain (nDCG). More detailed information can be found in [5].

Training: Following prior works [39, 41], we train our model for 50 epochs on the training set with a learning rate of 1e-5 and batch size of 8. We conduct warm-up training for 1 epoch using the classic video-text contrastive loss. The model is trained on 8 A100 GPUs for 12 hours.

Results. Tables 3.6 and  10 present zero-shot and fine-tuned model’s performance on EK100 multi-instance retrieval. Comparative analysis revealed significant advancements with our proposed method, surpassing state-of-the-art approaches in both mAP and nDCG scores. As shown in Table 3.6, the zero-shot performance of our stage 2 model (after post-training) reveals strong retrieval performance, compared with EgoVLP and LaViLA, indicating the strong performance of our backbone model and the effectiveness of the multi-stage training strategy. Through task-specific training, our model achieves 63.3% and 73.2% average mAP and nDCG, respectively, exhibiting substantial improvements in both text-to-video and video-to-text retrieval tasks. This indicates superior performance in capturing fine-grained action semantics within the kitchen domain. Overall, the results underscore the effectiveness of our approach in enhancing the understanding of daily human activities captured in egocentric views, highlighting its potential for advancing research in activity recognition and video retrieval domains.

3.8 Task8: Domain Adaptation for Action Recognition @ EPIC

Task definition. Domain Adaptation is defined by utilizing a labelled source domain to train an action recognition model that is capable of adapting to an unlabelled target domain. According to the data source [5], this task poses additional challenges due to the discrepancy in location, hardware, and long-term temporal offsets. The evaluation metric includes Top-1/5 Accuracy.

Training. Similar to the training setting of action recognition, our approach differs in that we only train the model on the source domain.

Results. Tables 8 present model’s performance on EK100 domain adaptation action recognition. Notably, our model is only finetuned on the source domain, achieving 61.3%/56.2%/43.2% Verb/Noun/Action top-1 performance that is much higher than the previous leading results 58.2%/40.3%/30.1%. This highlights superior performance improvement brought by well-pretrained models.

4 Limitation and Conclusion

Although our solution achieved good results in the competition, there are still some limitations worth noting. Firstly, we use a large video-language model and A100 as the computing GPU during the training process, which requires expensive computing resources and results in higher carbon emissions. Secondly, we employ feature-based approaches to solve the temporal localization problem, which often fails to obtain the optimal solution. Finally, we find that in Long-Term Action Anticipation (LTA) tasks, training and prediction based on LLMs have high uncertainty, and the final prediction performance may not be proportional to the capability of the LLM itself.

In conclusion, we have presented our solutions to 8 tracks in the EgoVis CVPR2024 Challenge. We find a larger video-language model can still give an advantage to egocentric task performance. This reveals that there is still ample room for exploration in egocentric video understanding.

References

  • [1] Hamed Habibi Aghdam, Elnaz Jahani Heravi, and Domenec Puig. An unsupervised method for summarizing egocentric sport videos. In Eighth international conference on machine vision (ICMV 2015), volume 9875, pages 337–341. SPIE, 2015.
  • [2] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  • [3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  • [4] Tejo Chalasani, Jan Ondrej, and Aljosa Smolic. Egocentric gesture recognition for head-mounted ar devices. In 2018 IEEE international symposium on mixed and augmented reality adjunct (ISMAR-Adjunct), pages 109–114. IEEE, 2018.
  • [5] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2022.
  • [6] Dima Damen, Teesid Leelasawassuk, Osian Haines, Andrew Calway, and Walterio W Mayol-Cuevas. You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In BMVC, volume 2, page 3. Citeseer, 2014.
  • [7] Gerwin de Haan, Josef Scheuer, Raymond de Vries, and Frits H Post. Egocentric navigation for video surveillance in 3d virtual environments. In 2009 IEEE Symposium on 3D User Interfaces, pages 103–110. IEEE, 2009.
  • [8] Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. arXiv preprint arXiv:2312.06505, 2023.
  • [9] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
  • [10] Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, and Ishan Misra. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  • [11] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
  • [12] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  • [13] Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. arXiv preprint arXiv:2311.18259, 2023.
  • [14] Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, and Mike Zheng Shou. Groundnlq@ ego4d natural language queries challenge 2023. arXiv preprint arXiv:2306.15255, 2023.
  • [15] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • [16] Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (ECCV), pages 754–769, 2018.
  • [17] Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Li** Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22072–22086, 2024.
  • [18] Yifei Huang, Yusuke Sugano, and Yoichi Sato. Improving action segmentation via graph-based temporal reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14024–14034, 2020.
  • [19] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • [20] Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, and Xi Wang. Lalm: Long-term action anticipation with language models. arXiv preprint arXiv:2311.17944, 2023.
  • [21] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, ** Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023.
  • [22] Kevin Qinghong Lin, **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z XU, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022.
  • [23] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  • [24] Francesco Ragusa, Giovanni Maria Farinella, and Antonino Furnari. Stillfast: An end-to-end approach for short-term object interaction anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–3644, 2023.
  • [25] Santhosh Kumar Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Naq: Leveraging narrations as queries to supervise episodic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6694–6703, 2023.
  • [26] Jiayi Shao, Xiaohan Wang, Ruijie Quan, and Yi Yang. Action sensitivity learning for the ego4d episodic memory challenge 2023. arXiv preprint arXiv:2306.09172, 2023.
  • [27] Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: Toward hierarchical understanding of procedural activities. Advances in Neural Information Processing Systems, 36, 2024.
  • [28] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [29] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
  • [30] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  • [31] Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024.
  • [32] Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. arXiv preprint arXiv:2401.00789, 2024.
  • [33] Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
  • [34] Li** Yang, Yifei Huang, Yusuke Sugano, and Yoichi Sato. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14722–14732, 2022.
  • [35] Yu Yao, Mingze Xu, Chiho Choi, David J Crandall, Ella M Atkins, and Behzad Dariush. Egocentric vision-based future vehicle localization for intelligent driving assistance systems. In 2019 International Conference on Robotics and Automation (ICRA), pages 9711–9717. IEEE, 2019.
  • [36] Hao Zhang, Aixin Sun, Wei **g, and Joey Tianyi Zhou. Span-based localizing network for natural language video localization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6543–6554. Association for Computational Linguistics, 2020.
  • [37] Qing Zhang, Giulia Barbareschi, Yifei Huang, Juling Li, Yun Suen Pai, Jamie Ward, and Kai Kunze. Seeing our blind spots: smart glasses-based simulation to increase design students’ awareness of visual impairment. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–14, 2022.
  • [38] Qi Zhao, Ce Zhang, Shijie Wang, Changcheng Fu, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. Antgpt: Can large language models help long-term action anticipation from videos? arXiv preprint arXiv:2307.16368, 2023.
  • [39] Yue Zhao and Philipp Krähenbühl. Training a large video model on a single machine in a day. arXiv preprint arXiv:2309.16669, 2023.
  • [40] Yunhan Zhao, Haoyu Ma, Shu Kong, and Charless Fowlkes. Instance tracking in 3d scenes from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21933–21944, 2024.
  • [41] Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
  • [42] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.