Event-aware Video Corpus Moment Retrieval

Danyang Hou CAS Key Laboratory of AI Security, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBei**gChina [email protected] , Liang Pang CAS Key Laboratory of AI Security, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBei**gChina [email protected] , Huawei Shen CAS Key Laboratory of AI Security, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBei**gChina [email protected] and Xueqi Cheng CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBei**gChina [email protected]

(2018)

Abstract.

Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity. However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.

Video Corpus Moment Retrieval, Video Retrieval, Event Retrieval

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: ; July 2017; Washington D.C., USA^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Video search

Refer to caption — Figure 1. In VCMR, the relevant part corresponding to the query is the moment. While the frame-aware method utilizes frames for retrieval, our event-aware approach adopts events as the retrieval unit, ensuring a more comprehensive capture of moment information.

Table 1. The overlap between the moments predicted by models or the extracted events with ground truth moments. The metric is the ratio of predicted moments to ground truth moments with an IoU greater than 0.5 or 0.7.

Model	IoU=0.5	IoU=0.7
XML (Lei et al., 2020) (ECCV’20)	29.56	13.05
ReLoCLNet (Zhang et al., 2021a) (SIGIR’21)	31.65	14.80
HERO (Li et al., 2020) (EMNLP’20)	32.2	15.30
Event	35.04	14.21

1. Introduction

With the widespread use of mobile devices and video-sharing applications, online video content has surged to unprecedented levels, encompassing extensive, untrimmed content, including TV series and instructional videos. An advanced video retrieval system should efficiently pinpoint specific moments within a vast corpus for users, be it a classic shot from a movie or a crucial step in an instructional video, thereby minimizing the user’s browsing time. Addressing this need, the recently proposed Video Corpus Moment Retrieval task (VCMR) (Escorcia et al., 2019; Lei et al., 2020) requires retrieving semantically relevant video moments from a corpus of untrimmed videos by a natural language query, where the moment is a continuous temporal segment.

A distinctive feature of VCMR, setting it apart from typical text-to-video retrieval, lies in the nature of video relevance to the query. Unlike trimmed videos in text-to-video retrieval (Chen and Dolan, 2011), where the entire video aligns with the text query, the untrimmed video involves only a small part (relevant moment) of the content being related to the query, shown in Figure 1. The existing works (Lei et al., 2020; Li et al., 2020; Zhang et al., 2021a) employ frame-aware video retrieval to capture the partial relevance between the query and video. This entails calculating the similarity between the query and all video frames in the corpus, ranking the videos based on the maximum similarity of frames within each video. However, these works overlook the semantic structure embedded within the information between video frames, i.e. event. Cognitive science research (Tversky and Zacks, 2013) suggests that human perception of visual information primarily revolves around the concept of events, with event information being the most fundamental unit of visual perception for humans. In the realm of videos, a sequence depicting a consistent action, object, or environment is termed an event (Shou et al., 2021), comprised of frames that are both similar and consecutive. Employing frames as a unit for video retrieval contains less information compared to human cognitive habits. While the event may not overlap with relevant moment exactly, it covers more complete information than a frame, shown in Figure 1.

To further evaluate the helpfulness of the event for the VCMR task, we measure the overlap between the events extracted using the unsupervised method in (Kang et al., 2022) from the video and the ground truth moment. The results on the TVR validation set are shown in Table 1. The model results are based on the predicted moments, and the event extraction results are derived from video events with the highest overlap with the correct moment (the ideal case). Notably, with the threshold set at 0.5, the optimal extracted events outperform the predicted moments of all models. Given that the events are extracted without any training, these results highlight the utility of event information in video for VCMR. If the event can be utilized effectively, it will enhance the accuracy of retrieval. Retrieval efficiency will also increase because fewer units reduce the amount of computation in retrieval.

However, the frame-aware method, which simply encodes the frame representations by Transformer, struggles to utilize event for retrieval. There are three main reasons. (1) The event information is not explicitly extracted from the frames. Although contextual relevance is captured using Transformer, each frame expresses more information from itself, posing challenges in capturing the overall information of an event. Hence, directly using frame as event is insufficient. (2) It lacks event-level information interaction. Events encapsulate more comprehensive semantic information, and strong semantic associations typically exist between events, as seen in examples such as two correlative steps in an instructional video. (3) The attention of model is not adequately concentrated. The range of attention in vanilla Tansformer (Vaswani et al., 2017) is the entire video. But not all content in the untrimmed informative video is relevant. The most relevant content tends to be intra-event, i.e., adjacent.

To this end, we propose EventFormer to explicitly leverage event information to help VCMR. The model contains two main components for event learning: event reasoning and hierarchical event encoding. The event reasoning module plays a pivotal role in extracting event information from the video based on the frame representation. In reference to the works on generic event boundary detection (Shou et al., 2021), we introduce three event extraction strategies, contrastive convolution, Kmeans, and window, aimed at aggregating visually similar and consecutive frames as event. The hierarchical event encoding module captures interactions not only at the frame level but also at the event level, obtaining a more semantically relevant representation of events. To encourage the model to focus attention on adjacent content, anchor multi-head self-attention is introduced to augment Transformer. VCMR task includes two sub-tasks: video retrieval (VR) and single video moment retrieval (SVMR). In VR, the objective is to retrieve the most pertinent untrimmed video from a large corpus using a natural language description as query, while SVMR focuses on pinpointing the start and end times of the relevant moment within the retrieved video. For the two subtasks, EventFormer adopts distinct training strategies. Specifically, two-branch contrastive learning and dual optimization. Both strategies essentially integrate frame and event into the training process.

We evaluate our proposed EventFormer on three benchmarks, TVR (Lei et al., 2020), ANetCaps (Caba Heilbron et al., 2015), and DiDeMo (Anne Hendricks et al., 2017). The results show the effectiveness and efficiency of EventFormer, achieving new state-of-the-art results. Additionally, we validate the effectiveness of our model in the partially relevant video retrieval (PRVR) task.

Our main contributions are as follows:

•

We propose an event-aware model EventFormer for VCMR, motivated by human perception for visual information.
•

We adopt event reasoning and hierarchical event encoding for event learning, and anchor multi-head self-attention to enhance close-range dependencies.
•

Experiments on three benchmarks show the effectiveness and efficiency, achieving new state-of-the-art results on VCMR. We also validate the effectiveness of the model in PRVR task.

2. Related work

In this section, we first introduce works on two related tasks, text-to-video retrieval and natural language video localization. Then we review works on VCMR. Finally, we present works on generic event boundary detection.

Text-to-video retrieval Similar to VR, text-to-video retrieval aims to find relevant videos from a corpus based on a natural language query. However, the distinction lies in the nature of query-video relevance. In text-to-video retrieval, the video is trimmed to precisely match the entire content of the video with the text query. Text-to-video retrieval methods are broadly categorized into two types: two-tower models (Bain et al., 2021; Gabeur et al., 2020; Ge et al., 2022; Ging et al., 2020; Liu et al., 2019a; Miech et al., 2020; Rouditchenko et al., 2020; Xu et al., 2021b) and one-tower models (Fu et al., 2021; Lei et al., 2021b; Sun et al., 2019; Xu et al., 2021a; Chen et al., 2020c; Han et al., 2021; Wang et al., 2021). Two-tower models utilize separate encoders for obtaining video and query representations, employing a simple similarity function like cosine to measure relevance. These methods are efficient due to decomposable computations of query and video representations. On the other hand, one-tower models leverage cross-modal attention (Bahdanau et al., 2015; Vaswani et al., 2017) for deep interactions between query and video, enhancing retrieval accuracy. Some works (Miech et al., 2021; Liu et al., 2021b; Yu et al., 2022; Lei et al., 2022) combine the strengths of both methods by employing a two-tower model for fast retrieval of potentially relevant videos in the initial stage, followed by a one-tower model to accurately rank the retrieved videos in the subsequent stage.

Natural language video localization The objective of the natural language video localization task is to pinpoint a moment semantically linked to the query. This task bears similarities to SVMR and can be viewed as a specialized case of VCMR, wherein the corpus comprises only one video, and the video must contain the target moment. Early works can be broadly classified into two categories: proposal-based (Liu et al., 2018; Xu et al., 2019; Chen and Jiang, 2019; Xiao et al., 2021; Chen et al., 2018; Zhang et al., 2019, 2021b; Liu et al., 2021a) models and proposal-free (Yuan et al., 2019; Chen et al., 2020b; Zeng et al., 2020; Li et al., 2021; Ghosh et al., 2019; Chen et al., 2019; Zhang et al., 2020b) models. In proposal-based methods, initial steps involve generating moment proposals as candidates, followed by ranking these proposals based on the similarity between the query and the proposals. On the other hand, proposal-free methods take a direct approach by predicting the start and end positions of the target moment in the video based on the query. Drawing inspiration from the success of Transformer, particularly in object detection tasks such as DETR (Carion et al., 2020) (DEtection TransfomeR), recent works propose DETR-based methods (Lei et al., 2021a; Moon et al., 2023; Liu et al., 2022; Cao et al., 2021) for moment localization. These approaches simplify the post-processing of previous predictions into an end-to-end process.

Video corpus moment retrieval VCMR is first proposed by Escorcia et al. (Escorcia et al., 2019), introducing VR task on top of natural language video localization, with benchmarks derived from localization datasets such as ANetCaps. Zhang et al. (Lei et al., 2020) propose a dataset for VCMR, where the videos provide subtitles. Similar to the taxonomy applied to text-to-video retrieval, existing VCMR works fall into one-tower, two-tower, and two-stage methods. One-tower (Zhang et al., 2020a; Yoon et al., 2022) and two-tower methods (Lei et al., 2020; Zhang et al., 2021a; Li et al., 2020), essentially treated as one-stage approaches, address VCMR as a multi-task problem, utilizing a shared backbone model with distinct heads for VR and SVMR. HAMMER (Zhang et al., 2020a) is the first one-tower model with hierarchical fine-grained cross-modal interactions. SQuiDNet (Yoon et al., 2022) utilizes causal inference to avoid the model learning bad retrieval biases. The two-tower method demonstrates superior retrieval efficiency, especially when dealing with numerous videos in the corpus. To capture partial relevance in VR, frame-aware retrieval methods are commonly employed. XML (Lei et al., 2020) is a pioneering work in VCMR using frame-aware retrieval, followed by enhancements in ReLoCLNet (Zhang et al., 2021a), leveraging contrastive learning. Li et al. (Li et al., 2020) introduces HERO, a video-language pre-trained model, significantly improving overall performance. The two-stage method combines one-tower and two-tower approaches, utilizing the two-tower model for VR to quickly retrieve video and the one-tower model for SVMR to precisely localize moment. CONQUER (Hou et al., 2021), DMFAT (Zhang et al., 2023) and CKCN (Chen et al., 2023) are two-stage models that employ HERO for video retrieval and propose one-tower models as moment localizer. CONQUER introduce a moment localizer based on context-query attention (CQA)(Yu et al., 2018). DMFAT innovates with multi-scale deformable attention for multi-granularity feature fusion. And CKCN introduces a calibration network to refine important modality features. Our model also adopts a two-stage approach, differing by integrating an event-aware retrieval strategy. Recently, Dong et al. (Dong et al., 2022) introduces a new task partially relevant video retrieval (PRVR) which is a weakly supervised version of VR, where the relevant moment is not provided.

Generic event boundary detection Generic event boundary detection (GEBD) (Shou et al., 2021) is a video understanding task designed to identify boundaries, dividing the video into several meaningful units that humans perceive as events. Typically, the frames within an event exhibit visual similarity and continuity, with event boundaries aligning with changes in action, subject, and environment. The task provides supervised and unsupervised settings, where the unsupervised setting is suitable to be generalized across various video understanding scenarios. UBoCo (Kang et al., 2022) is a representative work for unsupervised GEBD that leverages contrastive convolution to identify frames with drastic visual variations as event boundaries from the temporal self-similarity matrix (TSM) of video frames. We integrate the method into the event reasoning of the proposed EventFormer and implement two other strategies.

3. Method

In this section, we detail the proposed event-aware retrieval model EventFormer for VCMR task. We first formulate VCMR task and the sub-tasks in Section 3.1. Then, we describe the feature extraction of video and query in Section 3.2. Next, we introduce two main modules of EventFormer video retriever and moment localizer in Section 3.3 and Section 3.4 respectively. Finally, we present training and inference of model on VCMR task in Section 3.5.

3.1. Task Formulation

Given a video corpus $\mathcal{V}=\{v_{1},v_{2},...,v_{M}\}$ , the goal of VCMR is to retrieve the most relevant moment $m_{*}$ using a natural language query $q=\{w^{1},w^{2},...,w^{L}\}$ which consists of a sequence of words. The retrieval can be formulated as :

(1)

m_{*}=\mathop{\rm argmax}\limits_{m}P(m|q,\mathcal{V}).

VCMR can be decomposed into two sub-tasks, VR and SVMR. The goal of VR is to find the video $v_{*}$ that potentially contains the target moment from the corpus:

(2)

v_{*}=\mathop{\rm argmax}\limits_{v}P(v|q).

And SVMR aims to localize moment from the retrieved video:

(3)

m_{*}=\mathop{\rm argmax}\limits_{m}P(m|v_{*},q),

where the predicted moment is decided by the start and end times:

(4)

P(m|v_{*},q)=P(\tau_{st}|v_{*},q)\cdot P(\tau_{ed}|v_{*},q).

In video $v_{*}$ , only the segment of the target moment $m_{*}$ holds relevance to the query. As a result, many prior methods typically adopt frame-aware retrieval for VR. We introduce a simple yet effective event-aware retrieval model for VCMR.

3.2. Feature Extractor

The initial features of model are extracted by pre-trained networks. The visual features (frame features) of video are encoded by 2D and 3D CNNs, i.e., ResNet (He et al., 2016) and Slowfast (Feichtenhofer et al., 2019) to extract semantic and action features respectively. The textual features of subtitles in video and text query are encoded by RoBERTa (Liu et al., 2019b). In particular, the feature of a frame in the video is obtained by max-pooling the visual features over a short duration (1.5 seconds), and if subtitles are available at the corresponding time, it is featured as max-pooling word features in the subtitle at the corresponding duration. The visual features of frames in the $i$ -th video $v_{i}$ is formulated as $\bm{F}_{i}=\{\bm{f}_{i}^{1},\bm{f}_{i}^{2},...,\bm{f}_{i}^{T}\}$ , and the subtitle features are $\bm{S}_{i}=\{\bm{s}_{i}^{1},\bm{s}_{i}^{2},...,\bm{s}_{i}^{T}\}$ . If the subtitle is not available at a time in the video, the corresponding text feature is a vector of zeros. The query feature is $\bm{Q}=\{\bm{w}^{1},\bm{w}^{2},...,\bm{w}^{L}\}$ . In this paper, we use bold symbols for vectors, distinguishing normal symbols such as $v_{i}$ that indicate a video. Before being fed into the model, all features are mapped by the fully connected layers to a space of the dimension $D$ .

3.3. Event-aware Video Retriever

We propose a two-tower event-aware retriever that utilizes the event representations of the videos as the retrieval units. The extraction of event representations involves event reasoning and hierarchical event encoding shown in Figure 2.

3.3.1. Event Reasoning

We segment the video into units perceived by humans as events, emphasizing the gathering of consecutive and visually similar frames to form events. A representative work for event extraction is UBoCo (Kang et al., 2022) which leverages contrastive convolution to identify event boundaries. We draw on this approach but simplify the process to make it more adaptable to VCMR. In addition, we also adopt two extra event extraction strategies, K-means and window.

Contrastive convolution Utilizing frame representations $\bar{\bm{F}}=\{\bar{\bm{f}}^{1},\bar{\bm{f}}^{2},...,\bar{\bm{f}}^{T}\}$ , we compute self-similarities among frames, thereby constructing a Temporal Self-Similarity Matrix (TSM) shown in Figure 2. A contrastive kernel is employed to perform convolution along the diagonal of TSM, for computing event boundary scores. The results of diagonal elements serve as boundary scores, where a higher score indicates a greater likelihood that the frame is a boundary used to split video into events. We use a threshold $\delta$ to decide whether the i- $th$ frame is a boundary if the difference between the score and the mean of all scores is greater than $\delta$ .

Kmeans We employ TSM column vectors as features for K-means clustering, partitioning the video into $k$ segments to represent distinct events. To ensure consecutiveness within each segment, we include the frame index as an additional feature.

Window A fixed-size window divides the video evenly into pieces as events. The window size $w$ is a hyper-parameter.

This paper focuses on extracting visual events and still employing a frame-aware approach for subtitles. Extracting textual events poses more challenges as subtitle information is non-continuous, and subtitles with high similarity may not belong to the same event, as observed in the topic model (Larochelle and Lauly, 2012). The extraction of textual events can be left for future works.

3.3.2. Hierarchical Event Encoding

We employ a hierarchical structure to encode event representations, initially focusing on frame representation and subsequently on the event, ensuring the interactions of contextual information at both levels. Transformers for frame and event are augmented with anchor attention, encouraging them to focus on the correlations of neighboring content. And the video retriever is trained by two-branch contrastive learning.

Anchor Attention Untrimmed video contains abundant information, where not all frames or events exhibit strong correlations, and there is a tendency for higher correlations within close ranges. To this end, we introduce anchor multi-head self-attention (AMHSA) to enhance the relevance between neighboring content.

We review vanilla multi-head self-attention (MHSA) (Vaswani et al., 2017):

(5)

{\rm MultiHead}(Q,K,V)={\rm Concat}({\rm head}_{1},...{\rm head}_{h}),

(6)

{\rm head}_{i}={\rm Attention}(Q,K,V),

(7)

\alpha=\frac{QK^{t}}{\sqrt{D}},\ \ {\rm Attention}={\rm softmax}(\alpha)V,

where $\alpha$ is the attention score before softmax normalized. In each attention head, an element in the input computes attention scores with all elements. Instead, we introduce a constraint, allowing an element to calculate attention scores only with a finite number of its neighboring elements. For instance, for the $i$ -th frame, attention score computation is limited to the 2 frames before and after, forming a range of [i-2, i+2]. Different attention heads can utilize various ranges, such as 2, 3, 4, or all frames (ensuring globality) shown in Figure 2, capturing multi-scale neighborhood correlations. These ranges for attention computation serve as anchors, leading us to term it ”anchor attention.” We use AnchorFormer to mark the Transformer enhanced with anchor attention.

Hierarchical Video Encoder The hierarchical encoding of the event involves frame and event encodings. We first encode frame representations. For the $i$ -th video, we input visual features of the video and textual features of subtitles, along with positional embeddings and modality embeddings, into a multi-modal AnchorFormer. This allows for the simultaneous capture of both intra-modal and inter-modal contextual dependencies. The output contextual representation of visual and textual modalities are $\bar{\bm{F}}_{i}=\{\bar{\bm{f}}_{i}^{1},\bar{\bm{f}}_{i}^{2},...,\bar{\bm{f}}_% {i}^{T}\}$ and $\bar{\bm{S}}_{i}=\{\bar{\bm{s}}_{i}^{1},\bar{\bm{s}}_{i}^{2},...,\bar{\bm{s}}_% {i}^{T}\}$ respectively.

After event reasoning, we partition the video into $N$ events, the initial event representation $\bar{\bm{E}}_{i}=\{\bar{\bm{e}}_{i}^{1},\bar{\bm{e}}_{i}^{2},...,\bar{\bm{e}}_% {i}^{N}\}$ is obtained from max pooling of the frame representations contained in event. Considering that events carry richer semantic information compared to frames, and the frequent presence of tight semantic associations between events, we employ an additional AnchorFormer to capture contextual dependencies at event level. The input of AnchorFormer is event representations $\bar{\bm{E}}_{i}$ and subtitle representations $\bar{\bm{S}}_{i}$ , and the output is contextual representations $\hat{\bm{E}}_{i}=\{\hat{\bm{e}}_{i}^{1},\hat{\bm{e}}_{i}^{2},...,\hat{\bm{e}}_% {i}^{N}\}$ and $\hat{\bm{S}}_{i}=\{\hat{\bm{s}}_{i}^{1},\hat{\bm{s}}_{i}^{2},...,\hat{\bm{s}}_% {i}^{T}\}$ .

Query Encoder Token features of a query are processed through vanilla Transformer to yield token representations $\bar{\bm{w}}^{j}$ . Given the inconsistent matching of words across modalities in a query, as the query in Figure 2, where ”…turns and leaves the room” emphasizes the visual modality with its action description, while ”Foreman gives Chase a negative answer to his question…” leans towards the textual modality, we adopt modality-specific pooling to create two query representations for two modalities, denoted as $\bm{Q}_{F}$ (frame) and $\bm{Q}_{S}$ (subtitle). Specifically, we calculate the weight of each word for a modality, followed by a weighted sum of word representations:

(8)

o^{j}=\bm{W}_{d}\bar{\bm{w}}^{j},\ \ \ \\ \ \alpha^{j}=\frac{{\rm exp}(o^{j})}{\sum\limits_{i=1}^{L}{\rm exp}(o^{i})},\ % \ \ \\ \ \bm{q}^{d}=\sum\limits_{j=1}^{L}\alpha^{j}\bar{\bm{w}}^{j},

where $\bm{W}_{d}\in\mathbb{R}^{D\times 1}$ is a fully-connect layer which outputs a scalar $o^{j}$ , $d\in\{F,S\}$ is frame or subtitle. $\alpha^{j}$ is softmax normalized weight of $j$ -th word. And $\bm{Q}_{d}$ is a modality-specific query representation.

Table 2. VR results on TVR validation set, ANetCaps validation set, and DiDeMo test set.

{\dagger}

: the fine-tuned model before pre-training. The hyper-parameters (

\delta

k

w

) of the three event extraction strategies are set to (0.3, 10, 5), (0.1, 7, 8), and (0.1, 5, 4) for three datasets respectively. The results of XML and ReLoCLNet are reproduced by us using the same features.

Model	TVR				ANetCaps				DiDeMo
Model	R@1	R@5	R@10	R@100	R@1	R@5	R@10	R@100	R@1	R@5	R@10	R@100
XML (Lei et al., 2020) (ECCV’20)	18.52	41.36	53.15	89.59	6.14	20.69	32.45	75.92	6.23	19.35	29.95	74.16
ReLoCLNet (Zhang et al., 2021a) (SIGIR’21)	22.63	46.54	57.91	90.65	6.66	22.18	34.07	75.59	5.53	18.25	27.96	71.42
HERO (Li et al., 2020) (EMNLP’20)	19.44	42.08	52.34	84.94	4.70	16.77	27.01	67.42	5.11	16.35	33.11	68.38
${\rm HERO}^{{\dagger}}$ (Li et al., 2020) (EMNLP’20)	29.01	52.82	63.07	89.91	6.46	21.45	32.61	73.00	8.46	23.43	34.86	75.36
SQuiDNet (Yoon et al., 2022) (ECCV’22)	31.61	-	65.32	-	-	-	-	-	-	-	-	-
EventFormer (Frame)	25.56	50.14	61.33	91.79	7.50	24.20	37.10	77.97	7.63	24.06	35.06	77.63
EventFormer (Convolution)	28.44	52.92	64.11	92.92	7.97	25.51	37.97	77.62	8.19	23.77	35.33	77.67
EventFormer (Kmeans)	27.51	52.80	64.01	92.54	8.36	26.03	38.42	78.00	7.99	24.08	35.61	77.57
EventFormer (Window)	27.59	52.50	64.07	92.38	8.16	25.76	37.93	77.96	8.39	25.39	35.76	77.72

Two-branch Contrastive Learning We introduce a two-branch contrastive learning method focusing on both frame and event representations for event representation learning. The additional frame representation learning aims to acquire more fitting representations for the query, considering that events are composed of frames. A key aspect in representation learning involves the selection of positive and negative samples (Chen et al., 2020a). Shown in Figure 3, we sample positive sample from the range of the correct moment, as this part is explicitly relevant to the query. Given the contextual coherence of the video, content beyond the range of the moment might possess implicit relevance to the query, such as content preceding and following the moment. We also take positive sample from content excluding the target moment in the video. And the negative samples are from videos irrelevant to query.

Specially, in frame branch, the positive sample and weak positive sample are frames exhibiting the highest query similarity within and outside the target moment, respectively. For negative frames, we employ the hardest sample mining technique (Faghri et al., 2017), wherein the frame within each negative video exhibiting the highest similarity to the query is chosen as the negative sample. The sampling for subtitle is the same as the frame. We apply InfoNCE (Oord et al., 2018) loss, and take the positive sample as an example:

(9)

\mathcal{L}^{f}=-log\frac{{\rm exp}(rf^{+}/t)}{{\rm exp}(rf^{+}/t)+\sum\limits% _{z=1}^{n}{\rm exp}(rf^{-}/t)},

where $t$ is the temperature set to 0.01, $n$ is the number of negative videos, and $rf^{+}$ is the average of cosine similarities of query and positive frame/subtitle :

(10)

rf^{+}=\frac{1}{2}({\rm cos}(\bm{Q}_{F},\bar{\bm{f}}^{+})+{\rm cos}(\bm{Q}_{S}% ,\bar{\bm{s}}^{+})),

where subtitle similarity ${\rm cos}(\bm{Q}_{S},\bar{\bm{s}}^{+})$ is optional. The computation of weak positive frame loss $\mathcal{L}^{f}_{w}$ is identical to that of the positive frame. In addition to query-to-frame loss, following most works on cross-modal retrieval that employ bidirectional loss, we incorporate frame-to-query loss $\mathcal{L}^{q}$ . The bidirectional loss for frame branch is:

(11)

\mathcal{L}_{F}=\mathcal{L}^{f}+\omega*\mathcal{L}^{f}_{w}+\mathcal{L}^{q},

where $\omega$ is a hyper-parameter set to 0.5.

For event branch, positive event is the event that contains positive frame to hold the consistency of contrastive learning. And the negative events sampled similarly to those in the frame branch. The overall loss for two-branch contrastive learning is:

(12)

\mathcal{L}=\lambda*\mathcal{L}_{F}+\mathcal{L}_{E},

where $\mathcal{L}_{E}$ is InfoNCE loss of event representation learning between query representations $\bm{Q}_{F}$ / $\bm{Q}_{S}$ and event and subtitle representations $\hat{\bm{e}}$ / $\hat{\bm{s}}$ , and $\lambda$ is a hyper-parameter set to 0.8.

Table 3. SVMR and VCMR results on TVR validation and test set. The results of the test set are obtained by submitting predictions to the evaluation system.

*

: reproduced results.

{\dagger}

: the two-stage models that use HERO as the video retriever.

		SVMR (val)			VCMR (val)			VCMR (test)
Model		IoU=0.7			IoU=0.7			IoU=0.7
	R@1	R@10	R@100	R@1	R@10	R@100	R@1	R@10	R@100
HAMMER (Zhang et al., 2020a) (Arxiv’20)	-	-	-	5.13	11.38	16.71	-	-	-
SQuiDNet (Yoon et al., 2022) (ECCV’22)	24.74	-	-	8.52	-	-	10.09	31.22	46.05
XML (Lei et al., 2020) (ECCV’20)	$13.05^{*}$	$38.80^{*}$	$63.13^{*}$	$2.91^{*}$	$10.12^{*}$	$25.10^{*}$	3.32	13.41	30.52
ReLoCLNet (Zhang et al., 2021a) (SIGIR’21)	$14.80^{*}$	$45.85^{*}$	$72.39^{*}$	$4.11^{*}$	$14.41^{*}$	$32.94^{*}$	-	-	-
HERO (Li et al., 2020) (EMNLP’20)	15.30	40.84	63.45	5.13	16.26	24.55	6.21	19.34	36.66
${\rm CONQUER}^{{\dagger}}$ (Hou et al., 2021) (MM’21)	22.84	$53.98^{*}$	$79.24^{*}$	7.76	22.49	35.17	9.24	28.67	41.98
${\rm DMFAT}^{{\dagger}}$ (Zhang et al., 2023) (TCSVT’23)	23.26	-	-	7.99	23.81	36.89	-	-	-
${\rm CKCN}^{{\dagger}}$ (Chen et al., 2023) (TMM’23)	23.18	-	-	7.92	22.00	39.87	-	-	-
EventFormer	25.45	62.87	80.41	10.12	27.54	42.88	11.11	32.78	46.18

3.4. Event-aware Moment Localizer

The moment localizer shown in Figure 4 is focused on accurately pinpointing the location of the target moment. We follow the works (Lei et al., 2020; Zhang et al., 2021a; Hou et al., 2021; Li et al., 2020) on VCMR of using proposal-free method, i.e., directly learning to predict the start and end positions of moment. We incorporate event information to proposal-free method, enhancing the model’s learning to discriminate the start and end positions of moments through dual optimization of frame and event.

Architecture We introduce a one-tower event-aware moment localizer that has a similar structure to the retriever and also leverages AMHSA and event reasoning, but the video encoding requires cross-attention with the query. The query encoder is the same as in the video retriever, employing the vanilla Transformer to encode query words. However, the distinction is that the overall query representation $\bm{Q}_{F}$ or $\bm{Q}_{S}$ is unnecessary. We emphasize that the architecture of the localizer is not novel; however, our innovation lies in the utilization of event and dual optimization.

Dual Optimization As shown in Figure 3, for frame optimization, the objective is to maximize the confidence scores of frames which is the start or end boundary of ground truth moment. The confidence scores are derived from the output of AnchorFormer. Concretely, we begin by summing the visual and textual outputs at the same index, creating a sequence of multi-modal features with a length of $T$ . Subsequently, the features are fed to two different 1D convolution networks to generate confidence scores for start $cf^{st}\in\mathbb{R}^{1}$ and end $cf^{ed}$ boundaries respectively. The 1D convolutions are used to capture dependencies among neighboring frames. The optimization is based on cross-entropy loss:

(13)

\mathcal{L}_{F}^{st}=-log\frac{{\rm exp}(lf^{st})}{\sum\limits_{i=1}\limits^{T% }lf_{1}^{i}},\ \ \ \mathcal{L}_{F}^{ed}=-log\frac{{\rm exp}(lf^{ed})}{\sum% \limits_{i=1}\limits^{T}lf_{2}^{i}},

(14)

\mathcal{L}_{F}=\mathcal{L}_{F}^{st}+\mathcal{L}_{F}^{ed},

where $lf_{1}^{i}$ and $lf_{2}^{i}$ are the $i$ -th outputs from two convolution networks. For event optimization, we expect high confidence scores for events that contain correct moment boundaries. To obtain the confidence of the event, we use the output event representations and subtitle representations as features. Similar to frame optimization, we also need text features for event prediction. We perform max-pooling on the subtitle representations within the scope of an event as the textual event features for the event. The sum of visual and textual features of an event are fed to two distinct fully connected networks to predict confidence scores that the moment boundaries are in the event. The optimization is the same as that in Eq. 13 and Eq. 14. The overall loss is:

(15)

\mathcal{L}=\mathcal{L}_{F}+\gamma*\mathcal{L}_{E},

where $\mathcal{L}_{E}$ is event loss, $\gamma$ is a hyper-parameter set to 0.8.

3.5. Training and Inference

We employ a stage-wise training strategy for the two modules. Firstly, the video retriever is trained using an in-batch negative sampling method (Karpukhin et al., 2020), where all other videos in a batch serve as negative videos. Subsequently, the moment localizer is trained using sharing normalization techniques (Shared-Norm)(Clark and Gardner, 2018), widely applied in open domain QA(Chen et al., 2017) tasks. This technique enhances the confidence that the moment appears in the correct video while reducing its confidence in the wrong video. Especially, the softmax normalizations in the loss functions Eq. 13 cover confidence scores not only for frames or events in the correct video but also in incorrect videos, serving as negative samples. The negative videos are sampled from the training set based on high similarity to the query, with the similarity computed by the trained video retriever.

In inference, we first use the video retriever to retrieve the top-10 videos from the corpus based on the the average of the highest query-event and query-subtitle similarities $re_{i}$ in the video $v_{i}$ . The moment localizer is used to predict the position of the moment in the 10 videos, relying on the confidence scores ( $lf^{st}_{i}$ and $lf^{ed}_{i}$ ) indicating whether a frame serves as a start or end boundary. The event aspect of moment localizer is excluded from the prediction, as moment localization necessitates fine-grained frame-level localization. The confidence score $cm$ for moment prediction consists of video retrieval score and moment localization score:

(16)

cm=\frac{re_{i}}{t}+lf^{st}_{i}+lf^{ed}_{i},

where $t$ is temperature in contrastive learning, consistent with the training objective, and $cm$ is used to rank the candidate moments.

Table 4. SVMR and VCMR results on ANetCaps validation set and DeDiMo test set. The metric is

R@1,IoU=0.5,0.7

Dataset	Model	SVMR		VCMR
Dataset	Model	0.5	0.7	0.5	0.7
ANetCaps	HAMMER (Zhang et al., 2020a)	41.45	24.27	2.94	1.74
	ReLoCLNet (Zhang et al., 2021a)	-	-	3.09	1.82
	CONQUER (Hou et al., 2021)	$35.63^{*}$	$20.08^{*}$	$2.14^{*}$	$1.33^{*}$
	EventFormer	45.21	27.98	4.32	2.75
DiDeMo	XML (Lei et al., 2020)	-	-	2.36	1.59
	ReLoCLNet (Zhang et al., 2021a)	$34.81^{*}$	$26.71^{*}$	$2.28^{*}$	$1.71^{*}$
	HERO (Li et al., 2020)	$\textbf{39.20}^{*}$	$30.19^{*}$	$3.42^{*}$	$2.79^{*}$
	CONQUER (Hou et al., 2021)	38.17	29.9	3.31	2.79
	DMFAT (Zhang et al., 2023)	-	-	3.44	2.89
	CKCN (Chen et al., 2023)	36.54	28.89	3.22	2.69
	EventFormer	39.02	30.91	3.53	3.12

4. Experiments

4.1. Experimental Details

Datasets We evaluate EventFormer on three benchmarks. TV shows retrieval (TVR) (Lei et al., 2020) is constructed on TV shows with videos providing subtitles. The training, validation, and testing sets of TVR consist of 17,435, 2,179, and 1,089 videos, respectively. Each video contains 5 moments for retrieval. The average duration of the videos and moments are 76.2 seconds and 9.1 seconds respectively. ActivityNet Captions (ANetCaps) (Caba Heilbron et al., 2015) comprises approximately 20K videos. The videos exclusively contain visual information without subtitles. We follow the setup in (Zhang et al., 2020a; Yoon et al., 2022) with 10,009 videos for training and 4,917 videos for testing, resulting in 37,421 and 17,505 moments respectively. The average duration of videos and moments are 120 seconds and 36.18 seconds respectively. The videos of Distinct Describable Moments (DiDeMo) (Anne Hendricks et al., 2017) are from YFCC100M (Thomee et al., 2016), exclusively feature visual information. The dataset is divided into 8,395, 1,065, and 1,004 videos for training, validation, and testing, respectively. Most videos have a duration of approximately 30 seconds, uniformly segmented into 5-second intervals, resulting that moment boundaries consistently aligning with multiples of 5.

Implementation For TVR and DiDeMo, we utilize the 768D RoBERTa feature provided by (Lei et al., 2020) for query and subtitle, and the 4352D SlowFast+ResNet feature provided by (Li et al., 2020) as the frame feature. The duration for the sampling frame feature is 1.5 seconds with an FPS of 3. We follow the feature extractions in (Lei et al., 2020) and (Li et al., 2020) to extract features for ANetCaps. In inference, we first retrieve the top-10 videos, then localize the moment within the retrieved videos. Non-maximum suppression (Girshick et al., 2014) is applied in moment localization to remove overlapped predictions. For Shared-Norm at the moment localizer, the number of negative videos is set to 5, sampled from the top-100 videos in the training set, ranked by the video retriever. Anchor sizes for AnchorFormer in frame and event are configured as 3, 6, 9, all and 1, 2, 3, all, respectively.

Evaluation Metrics Following (Lei et al., 2020), the metrics for VR are the same as those for text-to-video retrieval, i.e., $R@K$ ( $k=1,5,10,100$ ) the fraction of queries that correctly retrieve correct videos in the top K of the ranking list. And for SVMR and VCMR, the metrics are $R@K,IoU=\mu$ ( $\mu=0.5,0.7$ ) which require the intersection over union (IoU) of predicted moments to ground truth exceeds $\mu$ . The evaluation of SVMR is only in the correct video for a query, while the evaluation of VCMR ranges over videos in the corpus.

Baselines We compare our model to the models for VCMR task as a baseline, containing one-tower methods, two-tower methods, and two-stage methods. One-tower: HAMMER (Zhang et al., 2020a), SQuiDNet (Yoon et al., 2022). Two-tower: XML (Lei et al., 2020), ReLoCLNet (Zhang et al., 2021a), HERO (Li et al., 2020). Two-stage: CONQUER (Hou et al., 2021), DMFAT (Zhang et al., 2023), CKCN (Chen et al., 2023). Additionally, a frame-aware baseline, denoted as EventFormer (Frame), is introduced for VR. This baseline shares the same model architecture and the number of parameters as the proposed EventFormer but lacks anchor attention, event reasoning and event encoding modules.

4.2. Main Results

VR The results of VR task on three datasets are reported in Table 2. Except for the one-tower model SQuiDNet and the pre-trained large video-language model HERO, our event-aware retrieval model surpasses other frame-aware retrieval models such as XML, ReLoCLNet, and the frame-aware version of EventFormer. SQuiDNet leverages fine-grained cross-modal interaction between video and query for better matching. Nevertheless, the one-tower method encounters retrieval efficiency challenges as it involves fine-grained interactive matching between the query and each video in the corpus. HERO is pre-trained on HowTo100M (Miech et al., 2019) and TVR, providing external knowledge for retrieval. However, HERO’s performance on ANetCaps and DiDeMo is sub-optimal, likely due to a domain gap between the videos in the two datasets and HERO’s pre-training data. For three event strategies, convolution surpasses the other two strategies in TVR, demonstrating superior adaptability to varying numbers of events in videos and dynamic event spans. Kmeans performs better in ANetCaps, attributed to the fact that the majority of videos in ANetCaps come from YouTube, which are user-shot, one-shot, continuous sequences, distinct from TV shows with explicit scene transitions. In DiDeMo, the window strategy excels because of the consistently fixed size of the query-related portion in videos, as detailed in (Anne Hendricks et al., 2017). Future works can be the exploration of more robust extraction methods for diverse datasets.

Table 5. Ablation of video retriever on TVR validation set. ’S’: subtitle. ’ER’: event reasoning. ’EI’: event interaction. ’AMHSA’: anchor multi-head self-attention. ’FCL’: frame contrastive learning. ’ECL’:event contrastive learning. ’WP’: weak positive sample.

S	ER	EI	AMHSA	FCL	ECL	WP	R@1	R@5	R@10	R@100
$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	28.44	52.92	64.11	92.92
	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	17.15	38.84	50.30	86.97
$\surd$			$\surd$	$\surd$		$\surd$	25.56	50.14	61.33	91.79
$\surd$	$\surd$		$\surd$	$\surd$	$\surd$	$\surd$	26.54	51.43	62.01	91.91
$\surd$	$\surd$	$\surd$		$\surd$	$\surd$	$\surd$	26.29	51.55	62.18	92.13
$\surd$	$\surd$	$\surd$	$\surd$	$\surd$		$\surd$	26.84	51.48	62.55	91.90
$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$		27.49	52.53	64.01	92.69

Table 6. Ablation of moment localizer on TVR validation set. ’EO’: event optimization.’SN’: Shared-Norm.

S	AMHSA	EO	SN	SVMR (R@1)		VCMR(R@1)
S	AMHSA	EO	SN	0.5	0.7	0.5	0.7
$\surd$	$\surd$	$\surd$	$\surd$	47.14	25.45	17.79	10.12
	$\surd$	$\surd$	$\surd$	41.75	22.07	14.80	8.00
$\surd$		$\surd$	$\surd$	44.64	23.89	16.49	9.23
$\surd$	$\surd$		$\surd$	46.25	25.12	17.12	9.80
$\surd$	$\surd$	$\surd$		43.52	23.07	14.14	8.14

SVMR and VCMR The results of SVMR and VCMR on three datasets are reported in Table 3 and Table 4. In both tasks, our proposed EventFormer outperforms other baselines, no matter which architectures (one-tower, two-tower, and two-stage) these models belong to. The one-tower (HAMMER, SQuiDNet) and two-stage (CONQUER, DMFAT, and CKCN) models exhibit superior performance compared to the two-tower (XML, ReLoCLNet, HERO) models. This is attributed to fine-grained interactions, making deep matching between the query and video. In contrast, the two-tower model relies solely on the similarity between frames and queries to determine the boundaries of the target segments. EventFormer has a similar structure to the other two-stage models, leveraging Transformer for multi-modal fusion. However, our model surpasses these models, attributed to AMHSA and dual optimization.

4.3. Ablation Study

The results on TVR validation set for video retriever and moment localizer are reported in Table 5 and Table 6, respectively.

Video retriever Subtitle plays a crucial role, as many queries in TVR include character names like ”Shelton”, which align more effectively with the textual information than with visual content. The retrieval accuracy significantly benefits from event reasoning, event interaction, and anchor attention, thus validating the three reasons highlighted in the Introduction for the ineffectiveness of frame-aware methods in leveraging event information for video retrieval. Frame learning in two-branch contrastive learning works, demonstrating the query-related frame representations contribute to the learning of event representations. Moreover, weak positive sample enhance learning by taking implicitly query-related frame or event.

Moment localizer Subtitle information also helps, but the improvement is not as pronounced as in video retriever. This is because moment localization involves precisely identifying the action described by the query, placing more emphasis on visual information, with text typically playing a supporting role. AMHSA is also effective for moment localization. Event optimization enhances retrieval accuracy, even without direct involvement in the prediction, indicating the beneficial impact of additional optimization. Notably, Shared-Norm exerts a substantial influence on moment localization, particularly in VCMR, as this technique empowers the model with the capability to distinguish moments in different videos.

Table 7. The results of moment localization directly using event extracted by three extract strategies.

Strategy	SVMR		VCMR
Strategy	0.5	0.7	0.5	0.7
Window ( $w$ = 5)	20.59	7.86	6.63	2.74
Kmeans ( $k$ = 10)	21.15	9.03	7.11	3.23
Convolution ( $\delta$ = 0.3)	21.31	9.88	6.91	3.51

4.4. Event Reasoning

We evaluate three event extraction strategies on TVR validation set. Specially, we predict the moment by directly using the event with the highest similarity computed by video retriever to the query in the video. The results are presented in Table 7. While the accuracy falls short of the optimal events in the ideal case that is shown in Table 1, it still demonstrates effectiveness. This is because the events are extracted solely through the aggregation of consecutive and similar frames, without training for the localization task. The events extracted through contrastive convolution are closer to the ground truth moment compared to Kmeans and window for its superior adaptation to the number and length of events.

Table 8. Efficiency and memory on TVR validation set. Efficiency is measured as the average latency (ms) to retrieve the top-10 videos. And memory (MB) is the storage of vectors of frames or events saved in advance by two-tower model. ’Number’ is the total number of frames or events in corpus.

Model	Params	VR			SVMR
Model	Params	Latency	Memory	Number	Latency
CONQUER (Hou et al., 2021)	47M	9932	-	-	156
ReLoCLNet (Zhang et al., 2021a)	8M	88	161	109924	29
HERO (Li et al., 2020)	121M	212	322	109924	97
EventFormer ( $\delta=0.3$ )	9M+18M	51	47	31975	103
EventFormer ( $k=10$ )	9M+18M	43	31	21786	103
EventFormer ( $w=5$ )	9M+18M	43	33	22755	103

4.5. Retrieval Efficiency and Memory Usage

We further analyze the retrieval efficiency and memory usage of our EventFormer and other models. We select a two-tower model ReLoCLNet, a two-tower pre-trained model HERO and a one-tower model. Given that the one-tower models HAMMER and SQuiDNet lack published code, we choose CONQUER due to its attempt at VR task introduced in (Hou et al., 2021). The results are reported in Table 8. In VR, CONQUER shows the slowest performance because it cannot decompose similarity, requiring online calculations for the relevance between query and videos. HERO exhibits lower efficiency compared to XML and our model, attributed to its excessive number of parameters and the representations with twice the dimensionality of XML and our model. Our model is optimally efficient and least memory consuming because the number of saved events is much smaller than the number of frames. In SVMR, although our model is not as efficient as two-tower models, it remains acceptable since only 10 videos interact with the query at a fine-grained level.

Table 9. PRVR results on TVR (without subtitle) validation set. All models use the same features, ResNet+I3D (Carreira and Zisserman, 2017) for video and RoBERTa for the query.

VCMR models w/o moment localization:
Model	R@1	R@5	R@10	R@100	SumR
XML (Lei et al., 2020) (ECCV’20)	10.0	26.5	37.3	81.3	155.1
ReLoCLNet (Zhang et al., 2021a) (SIGIR’21)	10.7	28.1	38.1	80.3	157.1
CONQUER (Hou et al., 2021) (MM’21)	11.0	28.9	39.6	81.3	160.8
PRVR models:
MS-SL (Dong et al., 2022) (MM’22)	13.5	32.1	43.4	83.4	172.4
PEAN (Jiang et al., 2023) (ICME’23)	13.5	32.8	44.1	83.9	174.2
GMMFormer (Wang et al., 2023) (AAAI’24)	13.9	33.3	44.5	84.9	176.6
DL-DKD (Dong et al., 2023) (ICCV’23)	14.4	34.9	45.8	84.9	179.9
EventFormer ( $\delta=0.3$ )	14.2	34.6	46.0	84.8	179.6

4.6. Partially Relevant Video Retrieval

Beyond VCMR, we evaluate the proposed EventFormer on the PRVR task, which serves as a weakly supervised version of VR, as it does not provide the ground truth moment for the query. This poses a challenge for our model because the positive frames and events for two-branch contrastive learning are sampled based on the moment relevant to the query. We modify the sampling strategy to adapt to PRVR, selecting the two frames or events in the correct video with the highest similarity to the query as positive and weak positive samples. The negative sampling from other videos are same as that of supervised EventFormer. The results are presented in Table 9. Our model demonstrates superior retrieval accuracy compared to other models, except for DL-DKD, which distills knowledge from a large-scale vision-language model. Notably, our model achieves this performance while being trained solely on the dataset’s training set. Our model is designed for supervised VCMR task, leaving room for enhancement in weakly supervised VR for feature works.

4.7. Case Study

We present three cases in Figure 5. In the first case, the extracted event overlaps perfectly with the ground truth, because the boundaries of the moment fall exactly where the visual content suddenly changes. In the second case, the change occurs in the middle of the moment, resulting in the event capturing the front half of the content. However, this part remains pertinent to the query. The last case is a failure example, where multiple changes occur within the moment, making the event reasoning struggle to capture consecutive and similar frames.

5. Conclusion

This paper proposes an event-aware retrieval model EventFormer for the VCMR task, motivated by human perception of visual information. To extract event representations of video for retrieval, EventFormer leverages event reasoning and two-level hierarchical event encoding. Anchor multi-head self-attention is introduced for Transformer to enhance close dependencies in the untrimmed video. We adopt two-branch contrastive learning and dual optimization for the training of two sub-tasks in VCMR. Extensive experiments show the effectiveness and efficiency of EventFormer on VCMR. The ablation study and case study additionally further verify the efficacy and rationale of each module in our model. The effectiveness of the model is also validated on the PRVR task. Our approach has limitations, particularly in robustness for videos in different datasets. Additionally, our event reasoning relies mainly on visual frame similarity, making it sensitive to changes in visual appearance. Future work can address these problems by introducing more semantic associations.

References

(1)
Anne Hendricks et al. (2017) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.
Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015.
Bain et al. (2021) Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.
Caba Heilbron et al. (2015) Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961–970.
Cao et al. (2021) Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, and Yuexian Zou. 2021. On Pursuit of Designing Multi-modal Transformer for Video Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9810–9823.
Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
Chen and Dolan (2011) David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1870–1879.
Chen et al. (2018) **gyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing. 162–171.
Chen et al. (2019) **gyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. 2019. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8175–8182.
Chen et al. (2020b) Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020b. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10551–10558.
Chen and Jiang (2019) Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8199–8206.
Chen et al. (2020c) Shizhe Chen, Yida Zhao, Qin **, and Qi Wu. 2020c. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.
Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020a. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
Chen et al. (2023) Tongbao Chen, Wenmin Wang, Zhe Jiang, Ruochen Li, and Bingshu Wang. 2023. Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval. IEEE Transactions on Multimedia (2023).
Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 845–855.
Dong et al. (2022) Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246–257.
Dong et al. (2023) Jianfeng Dong, Minsong Zhang, Zheng Zhang, Xianke Chen, Daizong Liu, Xiaoye Qu, Xun Wang, and Baolong Liu. 2023. Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11302–11312.
Escorcia et al. (2019) Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan C. Russell. 2019. Temporal Localization of Moments in Video Collections with Natural Language. (2019). arXiv:1907.12763
Faghri et al. (2017) Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
Feichtenhofer et al. (2019) Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
Fu et al. (2021) Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. 2021. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681 (2021).
Gabeur et al. (2020) Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 214–229.
Ge et al. (2022) Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, and ** Luo. 2022. Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16167–16176.
Ghosh et al. (2019) Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander G Hauptmann. 2019. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1984–1990.
Ging et al. (2020) Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. Coot: Cooperative hierarchical transformer for video-text representation learning. Advances in neural information processing systems 33 (2020), 22605–22618.
Girshick et al. (2014) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587.
Han et al. (2021) Ning Han, **g**g Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826–3834.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Hou et al. (2021) Zhijian Hou, Chong-Wah Ngo, and Wing Kwong Chan. 2021. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3900–3908.
Jiang et al. (2023) Xun Jiang, Zhiguo Chen, Xing Xu, Fumin Shen, Zuo Cao, and Xunliang Cai. 2023. Progressive Event Alignment Network for Partial Relevant Video Retrieval. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1973–1978.
Kang et al. (2022) Hyolim Kang, **woo Kim, Taehyun Kim, and Seon Joo Kim. 2022. Uboco: Unsupervised boundary contrastive learning for generic event boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20073–20082.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781.
Larochelle and Lauly (2012) Hugo Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. Advances in Neural Information Processing Systems 25 (2012).
Lei et al. (2021a) Jie Lei, Tamara L Berg, and Mohit Bansal. 2021a. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.
Lei et al. (2022) Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara L Berg, and Licheng Yu. 2022. Loopitr: Combining dual and cross encoder architectures for image-text retrieval. arXiv preprint arXiv:2203.05465 (2022).
Lei et al. (2021b) Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and **g**g Liu. 2021b. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341.
Lei et al. (2020) Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision. 447–463.
Li et al. (2021) Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.
Li et al. (2020) Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and **g**g Liu. 2020. HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training. In EMNLP.
Liu et al. (2021a) Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021a. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11235–11244.
Liu et al. (2021b) Haoliang Liu, Tan Yu, and ** Li. 2021b. Inflate and shrink: Enriching and reducing interactions for fast text-image retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9796–9809.
Liu et al. (2018) Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia. 843–851.
Liu et al. (2019a) Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019a. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).
Liu et al. (2022) Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3042–3051.
Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Miech et al. (2021) Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9826–9836.
Miech et al. (2020) Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision. 2630–2640.
Moon et al. (2023) WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. 2023. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23023–23033.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Rouditchenko et al. (2020) Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, et al. 2020. Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199 (2020).
Shou et al. (2021) Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, and Matt Feiszli. 2021. Generic event boundary detection: A benchmark for event segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8075–8084.
Sun et al. (2019) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.
Thomee et al. (2016) Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73.
Tversky and Zacks (2013) Barbara Tversky and Jeffrey M Zacks. 2013. Event perception. Oxford handbook of cognitive psychology 1, 2 (2013), 3.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2021) Xiaohan Wang, Linchao Zhu, and Yi Yang. 2021. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5079–5088.
Wang et al. (2023) Yuting Wang, **peng Wang, Bin Chen, Ziyun Zeng, and Shu-Tao Xia. 2023. GMMFormer: Gaussian-Mixture-Model based Transformer for Efficient Partially Relevant Video Retrieval. arXiv preprint arXiv:2310.05195 (2023).
Xiao et al. (2021) Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, and Jun Xiao. 2021. Natural Language Video Localization with Learnable Moment Proposals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4008–4017.
Xu et al. (2021a) Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. 2021a. Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021).
Xu et al. (2021b) Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021b. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
Xu et al. (2019) Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.
Yoon et al. (2022) Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, and Chang D Yoo. 2022. Selective Query-Guided Debiasing for Video Corpus Moment Retrieval. In European Conference on Computer Vision. Springer, 185–200.
Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In International Conference on Learning Representations.
Yu et al. (2022) Tan Yu, Hongliang Fei, and ** Li. 2022. Cross-probe bert for fast cross-modal search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2178–2183.
Yuan et al. (2019) Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159–9166.
Zeng et al. (2020) Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10287–10296.
Zhang et al. (2020a) Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas, Vihan Jain, Eugene Ie, and Fei Sha. 2020a. A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046 (2020).
Zhang et al. (2019) Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1247–1257.
Zhang et al. (2021a) Hao Zhang, Aixin Sun, Wei **g, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021a. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.
Zhang et al. (2020b) Hao Zhang, Aixin Sun, Wei **g, and Joey Tianyi Zhou. 2020b. Span-based Localizing Network for Natural Language Video Localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6543–6554.
Zhang et al. (2021b) Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, **g**g Li, and Heng Tao Shen. 2021b. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12669–12678.
Zhang et al. (2023) Xuemei Zhang, Peng Zhao, **sheng Ji, Xiankai Lu, and Yilong Yin. 2023. Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training. IEEE Transactions on Circuits and Systems for Video Technology (2023).