HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.11189v1 [cs.CV] 17 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University 22institutetext: Wormpex AI Research 33institutetext: University of Illinois Chicago

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

Kun Xia 11    Le Wang 11    San** Zhou 11    Gang Hua 22    Wei Tang 33
Abstract

The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e., the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.

Keywords:
Temporal Action Localization Semi-Supervised Learning

1 Introduction

Temporal Action Localization (TAL) attempts to temporally locate and recognize action instances of interest in untrimmed videos. It is a fundamental yet challenging task in computer vision, with a wide range of applications, such as security surveillance [7, 36] and human behavior analysis [8, 26]. Traditional TAL approaches heavily rely on large-scale, well-annotated datasets, a process that is both tedious and time-consuming in practice. In response to these challenges, recent efforts have been directed toward Semi-Supervised Temporal Action Localization (SS-TAL), aiming to train models using only a limited number of labeled samples and a substantial amount of unlabeled data.

Refer to caption
Figure 1: Illustration of unreliable predictions on an unlabeled video snippet. A common practice is to treat the action class with the highest confidence , i.e., “Putting on Shoes” as its target class for model optimization, while the ground truth label , i.e., “Sailing” is buried in the non-target classes.

Recent advancements of SS-TAL [11, 31, 21, 32] have demonstrated notable success, leveraging two well-known semi-supervised learning paradigms: consistency regularization and self-training. Consistency regularization approaches [11, 31] aim to generate reliable predictions through a teacher model to guide the learning process of the student model. However, learning a decent teacher model with limited labeled data is as challenging as the SS-TAL task itself. More recently, self-training approaches [21, 32] tailored for SS-TAL have dominated this area, attaining state-of-the-art performance. These approaches iteratively use the current model to assign pseudo labels to unlabeled videos and train a new model on both the labeled videos and the pseudo-labeled videos.

Despite achieving promising results, existing approaches simply utilize the target class (i.e., the predicted class with the highest confidence) as the pseudo label, which has two significant drawbacks. First, the target class tends to be highly noisy, given that the model is trained on a limited amount of labeled data. Second, the non-target classes are entirely disregarded, even though they often contain valuable cues about the action. An illustrative example is depicted in Figure 1. A video snippet of "Sailing" is mistakenly assigned the target class "Putting on Shoes" for self-training, leading to noisy pseudo labels, while the semantics of the ground truth label are buried among the ignored non-target classes.

In this paper, we approach Semi-Supervised Temporal Action Localization from a novel perspective by learning informative semantics from non-target classes, moving beyond the traditional focus on the target class. Given a predicted class probability distribution on unlabeled data, we often observe two phenomena. First, when the ground truth label does not align with the target class, it frequently falls within other top-ranked classes in the prediction. Second, it is highly unlikely that the low-confidence or bottom-ranked classes contain the ground truth label.

Building upon this observation, we partition the label space of the predicted class probability distribution into four subspaces: target class, positive classes, negative classes, and ambiguous classes. As mentioned earlier, the target class is defined as the highest-confidence class. Positive classes encompass non-target classes with high confidences, often covering the ground truth class. Negative classes comprise non-target classes with low confidences, making them unlikely to contain the ground truth class. The remaining non-target classes form the ambiguous classes.

While the idea of learning from non-target classes is intriguing, two key challenges need to be addressed: How should the non-target classes, especially the positive and negative classes, be identified from the predicted class distribution? How can the model effectively learn from these non-target classes? In response to the first challenge, we devise innovative strategies to adaptively select high-quality positive and negative classes from the label space. This involves modeling both the confidence and rank of a class in relation to those of the target class. To tackle the second challenge, we introduce novel positive and negative losses designed to push the prediction closer to the positive classes and push it away from the negative classes. Consequently, positive learning empowers the model to extract richer semantics relevant to the true class but absent in the target class, while negative learning reinforces the model’s belief of which classes are incorrect. Given the high uncertainty and noise associated with ambiguous classes, we exclude them from the training process. Finally, we integrate the positive and negative learning processes into a hybrid positive-negative learning framework to leverage the non-target classes across both labeled and unlabeled videos.

The main contributions of this paper are summarized as follows:

  • This paper introduces a novel paradigm for SS-TAL by emphasizing learning from non-target classes, transcending the conventional focus solely on the target class. The approach involves partitioning the label space of the predicted class distribution into different subspaces, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes.

  • Key aspects of this novel paradigm include identifying the positive and negative classes and learning from these non-target classes. The paper introduces innovative strategies for adaptively selecting high-quality positive and negative classes from the label space. Additionally, new positive and negative losses are proposed to guide the non-target learning effectively. These processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos.

  • We evaluate the proposed approach on THUMOS14 and ActivityNet v1.3 under a wide range of training settings. Extensive experiments demonstrate that our approach surpasses the previous state-of-the-art methods.

2 Related Work

Fully-Supervised Temporal Action Localization has witnessed significant advancements in recent years through using plentiful well-annotated videos. Concretely, early anchor-based methods [5, 35, 30] typically employ the multi-scale anchors and attach a classification head and a boundary regression head to refine these pre-defined anchors. Anchor-free methods [18, 40, 17, 33, 22] directly regress the boundary locations or perform frame-level action classification to reduce the complexity. Current prevailing Transformer-based methods [39, 20, 24, 14] tackle temporal action localization in a Transformer encoder-decoder framework, which models action instances as a set of learnable action queries.

Semi-Supervised Temporal Action Localization leverages valuable information from the unlabeled data with lower annotation cost. Existing arts [11, 31, 9, 21, 32] benefit from the development of general semi-supervised learning [25, 27] and follow two frameworks, i.e., consistency regularization and self-training. Ji et al. [11] design two essential types of sequential perturbations to make consistent action proposal predictions for both teacher and student models. Nag et al. [21] develop a proposal-free temporal masking model to solve the localization error propagation problem. Xia et al. [32] tackle the label noise problem and present a noise-tolerant framework to update the model with reliable pseudo labels that are strictly screened.

Refer to caption
Figure 2: An overview of our proposed Non-target Classes Learning framework. It follows the self-training paradigm, which iteratively uses the current model to assign pseudo labels to unlabeled videos and trains a new model on both the labeled videos and the pseudo-labeled videos. Given an unlabeled video snippet, the current model predicts a probability distribution of all classes. Our method adaptively partitions the label space Ωnormal-Ω\Omegaroman_Ω into a target class Ωtgtsuperscriptnormal-Ω𝑡𝑔𝑡\Omega^{tgt}roman_Ω start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT, positive classes Ωpossuperscriptnormal-Ω𝑝𝑜𝑠\Omega^{pos}roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT, negative classes Ωnegsuperscriptnormal-Ω𝑛𝑒𝑔\Omega^{neg}roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT, and ambiguous classes Ωambsuperscriptnormal-Ω𝑎𝑚𝑏\Omega^{amb}roman_Ω start_POSTSUPERSCRIPT italic_a italic_m italic_b end_POSTSUPERSCRIPT, by modeling both the confidence and rank of a class in relation to those of the target class. Based on the label space partition, we design the new positive learning loss possubscriptnormal-ℓ𝑝𝑜𝑠\ell_{pos}roman_ℓ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and negative learning loss negsubscriptnormal-ℓ𝑛𝑒𝑔\ell_{neg}roman_ℓ start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT to mine positive and negative semantics that are absent in the target class, while excluding ambiguous classes.

Learning on Pseudo Labels is an important yet key technology in semi-supervised learning. However, most approaches [6, 41, 16, 13, 23] are limited to learning directly from the target class, so it is inevitable that the model will be misled by noisy pseudo labels. Chen et al. [6] present a proposal self-assignment for pseudo label assignment, which injects the proposals from student into teacher and generates accurate pseudo labels to match each proposal in the student model accordingly. Apart from above methods, the complementary label has been used to specify a class that a sample does not belong to [10]. Yu et al. [37] theoretically analyze the problem of biased complementary labels and propose to estimate transition probabilities with no bias. Kim et al. [15] aim at learning clean data with ground truth labels while training noise data with a randomly selected label as a complementary label.

Differing from existing methods, we introduce a novel negative learning approach that adaptively selects richer negative classes based on the confidence of the target class. These negative classes are more informative, reducing the risk of selecting the true label. Additionally, our new positive learning method extracts additional semantics relevant to the true class that may be absent in the target class.

3 Method

3.1 Preliminaries

Problem Setting. Given a smaller set of Nlsuperscript𝑁𝑙N^{l}italic_N start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT labeled videos {Xil,Yil}i=1Nlsuperscriptsubscriptsuperscriptsubscript𝑋𝑖𝑙superscriptsubscript𝑌𝑖𝑙𝑖1superscript𝑁𝑙\{X_{i}^{l},Y_{i}^{l}\}_{i=1}^{N^{l}}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and a larger set of Nusuperscript𝑁𝑢N^{u}italic_N start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT unlabeled videos {Xiu}i=1Nusuperscriptsubscriptsuperscriptsubscript𝑋𝑖𝑢𝑖1superscript𝑁𝑢\{X_{i}^{u}\}_{i=1}^{N^{u}}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, semi-supervised temporal action localization (SS-TAL) aims to improve action detection by effectively learning from both labeled and unlabeled data. The annotation Yilsuperscriptsubscript𝑌𝑖𝑙Y_{i}^{l}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT of each labeled video contains the start time, end time, and action category of each action instance.

Feature Embedding. For a video X𝑋Xitalic_X, following conventions [28, 22], we extract its snippet-level features {𝒙i}i=1Nvsuperscriptsubscriptsubscript𝒙𝑖𝑖1superscript𝑁𝑣\{\boldsymbol{x}_{i}\}_{i=1}^{N^{v}}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT from consecutive frames by a fine-tuned two-stream network, where Nvsuperscript𝑁𝑣N^{v}italic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is the number of video snippets.

Baseline Model. Recent works [21, 32] formulate SS-TAL as a snippet-level classification task. Our method also adopts the proposal-free framework with self-training for SS-TAL, which locates action instances by a classification head and a mask head, as in prior arts [21, 32]. The learning objective is to minimize the loss function below:

=s+αu,superscript𝑠𝛼superscript𝑢\mathcal{L}=\mathcal{L}^{s}+\alpha\mathcal{L}^{u},caligraphic_L = caligraphic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + italic_α caligraphic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , (1)

where ssuperscript𝑠\mathcal{L}^{s}caligraphic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and usuperscript𝑢\mathcal{L}^{u}caligraphic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT denote the supervised loss and the unsupervised loss applied on labeled videos and unlabeled videos, respectively, and α𝛼\alphaitalic_α is a hyper-parameter. The main purpose of the action detection model is to learn the parameters θ𝜃\thetaitalic_θ of a model 𝔽(;θ)𝔽𝜃\mathbb{F}\left(\cdot;\theta\right)blackboard_F ( ⋅ ; italic_θ ) by optimizing a cross-entropy (CE) loss function on both labeled and unlabeled data:

s=1Nvi=1Nvce(𝔽(𝒙il;θ),𝒚il),superscript𝑠1superscript𝑁𝑣superscriptsubscript𝑖1superscript𝑁𝑣superscript𝑐𝑒𝔽superscriptsubscript𝒙𝑖𝑙𝜃superscriptsubscript𝒚𝑖𝑙\mathcal{L}^{s}=\frac{1}{N^{v}}\sum_{i=1}^{N^{v}}\ell^{ce}\left(\mathbb{F}% \left(\boldsymbol{x}_{i}^{l};\theta\right),\boldsymbol{y}_{i}^{l}\right),caligraphic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_c italic_e end_POSTSUPERSCRIPT ( blackboard_F ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ; italic_θ ) , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , (2)
u=1Nvi=1Nvce(𝔽(𝒙iu;θ),𝒚iu),superscript𝑢1superscript𝑁𝑣superscriptsubscript𝑖1superscript𝑁𝑣superscript𝑐𝑒𝔽superscriptsubscript𝒙𝑖𝑢𝜃superscriptsubscript𝒚𝑖𝑢\mathcal{L}^{u}=\frac{1}{N^{v}}\sum_{i=1}^{N^{v}}\ell^{ce}\left(\mathbb{F}% \left(\boldsymbol{x}_{i}^{u};\theta\right),\boldsymbol{y}_{i}^{u}\right),caligraphic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_c italic_e end_POSTSUPERSCRIPT ( blackboard_F ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ; italic_θ ) , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) , (3)

where 𝒙ilsuperscriptsubscript𝒙𝑖𝑙\boldsymbol{x}_{i}^{l}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝒙iusuperscriptsubscript𝒙𝑖𝑢\boldsymbol{x}_{i}^{u}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT are respectively the i𝑖iitalic_i-th snippet feature vectors of a labeled video and an unlabeled video. 𝒚ilC+1superscriptsubscript𝒚𝑖𝑙superscript𝐶1\boldsymbol{y}_{i}^{l}\in\mathbb{R}^{C+1}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C + 1 end_POSTSUPERSCRIPT and 𝒚iuC+1superscriptsubscript𝒚𝑖𝑢superscript𝐶1\boldsymbol{y}_{i}^{u}\in\mathbb{R}^{C+1}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C + 1 end_POSTSUPERSCRIPT are the one-hot vectors of their ground truth label and pseudo label, respectively, including C𝐶Citalic_C action classes and a background class.

3.2 Motivation

Existing approaches simply utilize the target class (i.e., the predicted class with the highest confidence) as the pseudo label. The target class tends to be highly noisy, given that the model is trained on a limited amount of labeled data, thereby significantly degrading the self-training. This paper moves beyond the traditional focus on the target class and addresses SS-TAL from a novel perspective, by learning informative semantics from non-target classes. The motivation for our approach stems from two key observations regarding a predicted class probability distribution on unlabeled data. First, when the ground truth label does not align with the target class, it frequently falls within other top-ranked classes in the prediction. Second, it is highly unlikely that the low-confidence or bottom-ranked classes contain the ground truth label.

Building upon these observations, we divide the label space of the predicted class probability distribution on an unlabeled video snippet into four subspaces:

Ω={1,,C+1}=ΩtgtΩposΩnegΩamb,Ω1𝐶1superscriptΩ𝑡𝑔𝑡superscriptΩ𝑝𝑜𝑠superscriptΩ𝑛𝑒𝑔superscriptΩ𝑎𝑚𝑏\Omega=\left\{1,\ldots,C+1\right\}=\Omega^{tgt}\cup\Omega^{pos}\cup\Omega^{neg% }\cup\Omega^{amb},roman_Ω = { 1 , … , italic_C + 1 } = roman_Ω start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT ∪ roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT ∪ roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT ∪ roman_Ω start_POSTSUPERSCRIPT italic_a italic_m italic_b end_POSTSUPERSCRIPT , (4)

where ΩtgtsuperscriptΩ𝑡𝑔𝑡\Omega^{tgt}roman_Ω start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT only holds the target class while ΩpossuperscriptΩ𝑝𝑜𝑠\Omega^{pos}roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT, ΩnegsuperscriptΩ𝑛𝑒𝑔\Omega^{neg}roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT and ΩambsuperscriptΩ𝑎𝑚𝑏\Omega^{amb}roman_Ω start_POSTSUPERSCRIPT italic_a italic_m italic_b end_POSTSUPERSCRIPT are the positive classes, negative classes, and ambiguous classes, respectively. Positive classes encompass non-target classes with high confidences, often covering the ground truth class. Negative classes comprise non-target classes with low confidences, making them unlikely to contain the ground truth class. The remaining non-target classes form the ambiguous classes. Complementary to traditional target class-based learning (Sec. 3.3), negative learning (Sec. 3.4) reinforces the model’s belief of which classes are incorrect, while positive learning (Sec. 3.5) empowers the model to extract richer semantics relevant to the true class but absent in the target class. Given the high uncertainty and noise associated with ambiguous classes, we exclude them from self-training.

3.3 Learning from Target Class

Existing approaches first obtain the probability distribution 𝒑=𝔽(𝒙u;θ)𝒑𝔽superscript𝒙𝑢𝜃\boldsymbol{p}=\mathbb{F}\left(\boldsymbol{x}^{u};\theta\right)bold_italic_p = blackboard_F ( bold_italic_x start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ; italic_θ ) from an unlabeled snippet 𝒙usuperscript𝒙𝑢\boldsymbol{x}^{u}bold_italic_x start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, and then use argmaxc(pc)𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑐subscript𝑝𝑐argmax_{c}(p_{c})italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) as its target class to construct the one-hot pseudo label vector 𝒚usuperscript𝒚𝑢\boldsymbol{y}^{u}bold_italic_y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. The learning objective is formulated as the cross-entropy loss between the model prediction and the target class:

tgt=c=1C+1yculogpc,superscript𝑡𝑔𝑡superscriptsubscript𝑐1𝐶1subscriptsuperscript𝑦𝑢𝑐subscript𝑝𝑐\ell^{tgt}=-\sum_{c=1}^{C+1}y^{u}_{c}\log p_{c},roman_ℓ start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C + 1 end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (5)

where C𝐶Citalic_C is the number of action classes and ycu{0,1}subscriptsuperscript𝑦𝑢𝑐01y^{u}_{c}\in\{0,1\}italic_y start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ { 0 , 1 } represents whether the target class is present. The model is trained by maximizing the log-likelihood of the target class.

3.4 Learning from Negative Classes

As previously discussed in Sec. 3.2, the model may exhibit uncertainty regarding whether a video snippet belongs to the noisy target class but can be fairly certain that it does not belong to negative classes. To effectively learn negative information, a negative class is chosen from non-target classes, and the model is then trained using a negative learning loss given by:

~neg=log(1pcneg),superscript~𝑛𝑒𝑔1subscript𝑝superscript𝑐𝑛𝑒𝑔\tilde{\ell}^{neg}=-\log\left(1-p_{c^{neg}}\right),over~ start_ARG roman_ℓ end_ARG start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT = - roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (6)

which aims to minimize the log-likelihood on the negative class. However, selecting suitable negative classes is challenging. On the one hand, only selecting one negative class is insufficient to learn valuable negative information. On the other hand, regarding all non-target classes as negative classes would carry the risk of negatively learning the ground truth semantics buried in non-target classes. Therefore, we design an adaptive negative learning strategy to tackle this challenge.

Specifically, let 𝒑=[p1,,pC+1]𝒑subscript𝑝1subscript𝑝𝐶1\boldsymbol{p}=\left[p_{1},\ldots,p_{C+1}\right]bold_italic_p = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ] denote the class probability distribution predicted on an unlabeled video snippet. Then, we sort it in ascending order of the confidence:

𝒑^=sorted(𝒑)=[min(𝒑),,max(𝒑)],^𝒑𝑠𝑜𝑟𝑡𝑒𝑑𝒑𝑚𝑖𝑛𝒑𝑚𝑎𝑥𝒑\hat{\boldsymbol{p}}=sorted(\boldsymbol{p})=\left[min(\boldsymbol{p}),\ldots,% max(\boldsymbol{p})\right],over^ start_ARG bold_italic_p end_ARG = italic_s italic_o italic_r italic_t italic_e italic_d ( bold_italic_p ) = [ italic_m italic_i italic_n ( bold_italic_p ) , … , italic_m italic_a italic_x ( bold_italic_p ) ] , (7)

where max(𝒑)𝑚𝑎𝑥𝒑max(\boldsymbol{p})italic_m italic_a italic_x ( bold_italic_p ) corresponds to the confidence of the target class. The higher max(𝒑)𝑚𝑎𝑥𝒑max(\boldsymbol{p})italic_m italic_a italic_x ( bold_italic_p ) is, the more certain the model is that the target class aligns with the ground truth class. It also means that we can treat more non-target classes as negative classes for learning negative information. This line of reasoning motivates us to design an adaptive negative learning strategy by taking the confidence of the target class as reference. Concretely, we first compute the cumulative probability of its bottom-k𝑘kitalic_k classes. If the cumulative probability is less than max(𝒑)𝑚𝑎𝑥𝒑max(\boldsymbol{p})italic_m italic_a italic_x ( bold_italic_p ), these k𝑘kitalic_k classes will be treated as negative classes that contribute equivalently to negative learning, which could be formulated as:

Ωneg={k:c=1kp^cmax(𝒑)},superscriptΩ𝑛𝑒𝑔conditional-set𝑘superscriptsubscript𝑐1𝑘subscript^𝑝𝑐𝑚𝑎𝑥𝒑\Omega^{neg}=\left\{k:\sum_{c=1}^{k}\hat{p}_{c}\leqslant max(\boldsymbol{p})% \right\},roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT = { italic_k : ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⩽ italic_m italic_a italic_x ( bold_italic_p ) } , (8)

where ΩnegsuperscriptΩ𝑛𝑒𝑔\Omega^{neg}roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT holds k𝑘kitalic_k negative classes that meet the above criteria. When max(𝒑)𝑚𝑎𝑥𝒑max(\boldsymbol{p})italic_m italic_a italic_x ( bold_italic_p ) is very high, it suggests that low-confidence classes carry a lower risk of containing ground truth semantics; therefore, the model will involve more low-confidence classes into ΩnegsuperscriptΩ𝑛𝑒𝑔\Omega^{neg}roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT for negative learning. When max(𝒑)𝑚𝑎𝑥𝒑max(\boldsymbol{p})italic_m italic_a italic_x ( bold_italic_p ) is very low, the ground truth semantics will be more likely to be buried in low-confidence classes; therefore, the model will only select a few negative classes since the cumulative probability of bottom-k𝑘kitalic_k classes is small. Based on the negative classes, we reformulate the negative learning loss as:

neg=cΩneglog(1pc).superscript𝑛𝑒𝑔subscript𝑐superscriptΩ𝑛𝑒𝑔1subscript𝑝𝑐\ell^{neg}=-\sum_{c\in\Omega^{neg}}\log\left(1-p_{c}\right).roman_ℓ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_c ∈ roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) . (9)

Our proposed adaptive negative learning will enable the model to effectively learn underlying negative information from as many negative classes as possible.

3.5 Learning from Positive Classes

Learning from the remaining non-target classes (excluding negative classes) is intriguing as the ground truth semantics are buried among them. However, learning positive information from all remaining non-target classes is suboptimal since ambiguous classes would confuse the model. Therefore, we leverage the confidence of the target class as an informative indicator to select positive classes:

Ωpos={k:p^kλmax(𝒑)},superscriptΩ𝑝𝑜𝑠conditional-set𝑘subscript^𝑝𝑘𝜆𝑚𝑎𝑥𝒑\Omega^{pos}=\left\{k:\hat{p}_{k}\geqslant\lambda\cdot max(\boldsymbol{p})% \right\},roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT = { italic_k : over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⩾ italic_λ ⋅ italic_m italic_a italic_x ( bold_italic_p ) } , (10)

where ΩpossuperscriptΩ𝑝𝑜𝑠\Omega^{pos}roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT holds k𝑘kitalic_k positive classes that meet the above criteria and λ𝜆\lambdaitalic_λ is a hyper-parameter. In this way, the model will only select the classes whose confidences are close to the target class since they are likely to share similar information related to the ground truth class. Based on the positive classes, we formulate the positive learning loss as:

pos=cΩposlogpc.superscript𝑝𝑜𝑠subscript𝑐superscriptΩ𝑝𝑜𝑠subscript𝑝𝑐\ell^{pos}=-\sum_{c\in\Omega^{pos}}\log p_{c}.roman_ℓ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_c ∈ roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT . (11)

The positive learning empowers the model to extract richer semantics relevant to the true class but absent in the target class.

3.6 Hybrid Positive-Negative Learning

Finally, we integrate the proposed negative learning and positive learning into our semi-supervised TAL framework. In training, for all labeled data, the ground truth labels are treated as the target classes with no doubt. The remaining classes, i.e., all non-target classes, will act as negative classes for negative learning, since they are completely unrelated to the ground truth label. Thus, we apply the cross-entropy loss and negative loss for all labeled data. For unlabeled data, we apply the cross-entropy loss for target classes, the positive and negative losses for positive and negative classes as mentioned above, respectively. The overall loss function is shown below:

=¯s+¯u+m+ref+rec,superscript¯𝑠superscript¯𝑢superscript𝑚superscript𝑟𝑒𝑓superscript𝑟𝑒𝑐\mathcal{L}=\bar{\mathcal{L}}^{s}+\bar{\mathcal{L}}^{u}+\mathcal{L}^{m}+% \mathcal{L}^{ref}+\mathcal{L}^{rec},caligraphic_L = over¯ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + over¯ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT , (12)

where the supervised loss ¯ssuperscript¯𝑠\bar{\mathcal{L}}^{s}over¯ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT contains the cross-entropy loss tgtsuperscript𝑡𝑔𝑡\ell^{tgt}roman_ℓ start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT and the negative learning loss negsuperscript𝑛𝑒𝑔\ell^{neg}roman_ℓ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT. The unsupervised loss ¯usuperscript¯𝑢\bar{\mathcal{L}}^{u}over¯ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT contains the positive learning loss possuperscript𝑝𝑜𝑠\ell^{pos}roman_ℓ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT and the negative learning loss negsuperscript𝑛𝑒𝑔\ell^{neg}roman_ℓ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT as well as tgtsuperscript𝑡𝑔𝑡\ell^{tgt}roman_ℓ start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT. In addition, the SS-TAL model is mainly composed of a classification head and a mask head, which is further optimized by the mask learning loss msuperscript𝑚\mathcal{L}^{m}caligraphic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the refinement loss refsuperscript𝑟𝑒𝑓\mathcal{L}^{ref}caligraphic_L start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT, and the feature reconstruction loss recsuperscript𝑟𝑒𝑐\mathcal{L}^{rec}caligraphic_L start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT, as in [21, 32].

In inference, the model generates action instance predictions for each testing video by the classification and mask predictions, as in SPOT [21]. More specifically, we can obtain candidate snippets by using a classification threshold and a localization threshold on the classification and mask heads, respectively. Therefore, only the video snippets with high class probabilities and mask scores are selected as top scoring snippets. We use a set of thresholds to produce sufficient candidates. For each candidate, we compute its confidence score by multiplying the classification probability and mask score. In post-processing, Soft-NMS [2] is finally applied to obtain top scoring results.

4 Experiments

Table 1: Main results on THUMOS14 and ActivityNet v1.3 with different percentages of labeled videos, where Baseline refer to the baseline model without positive and negative learning losses. Notably, SSP and SSTAP employ UntrimmedNet [28] trained with 100% class labels for proposal classification.
Label Method Backbone THUMOS14 (%) ActivityNet v1.3 (%)
0.3 0.4 0.5 0.6 0.7 Avg. 0.5 0.75 0.95 Avg.
10% SSP [11] TSN 44.2 34.1 24.6 16.9 9.3 25.8 38.9 28.7 8.4 27.6
SSTAP [31] TSN 45.6 35.2 26.3 17.5 10.7 27.0 40.7 29.6 9.0 28.2
SPOT [21] TSN 49.4 40.4 31.5 22.9 12.4 31.3 49.9 31.1 8.3 32.1
NPL [32] TSN 50.0 41.7 33.5 23.6 13.4 32.4 50.9 32.0 7.9 32.6
Baseline TSN 49.2 40.2 32.8 22.4 12.5 31.4 51.2 31.8 7.3 32.1
Ours TSN 50.9 42.3 34.9 24.7 14.6 33.5 53.0 34.4 9.2 34.5
20% SPOT [21] TSN 52.6 43.9 34.1 25.2 16.2 34.4 51.7 32.0 6.9 32.3
NPL [32] TSN 53.9 45.6 36.2 26.9 16.5 35.8 52.1 32.9 7.9 32.9
Baseline TSN 52.8 44.0 34.2 25.4 16.0 34.5 51.8 32.2 7.0 32.4
Ours TSN 54.6 46.4 37.0 27.2 17.1 36.5 53.5 34.7 9.4 34.8
40% SPOT [21] TSN 54.4 45.8 37.2 29.7 19.4 37.3 53.3 33.0 6.6 33.2
NPL [32] TSN 56.2 46.7 38.8 30.3 19.5 38.3 53.4 33.9 8.1 33.8
Baseline TSN 54.8 45.9 37.3 29.9 19.1 37.4 53.5 33.2 6.9 33.4
Ours TSN 57.5 48.0 39.6 31.5 21.4 39.6 54.1 35.6 9.4 35.4
60% SSP [11] TSN 53.2 46.8 39.3 29.7 19.8 37.8 49.8 34.5 7.0 33.5
SSTAP [31] TSN 56.4 49.5 41.0 30.9 21.6 39.9 50.1 34.9 7.4 34.0
SPOT [21] TSN 58.9 50.1 42.3 33.5 22.9 41.5 52.8 35.0 8.1 35.2
NPL [32] TSN 59.0 51.4 42.9 34.3 23.3 42.2 53.9 35.8 8.5 35.7
Baseline TSN 58.7 50.0 42.6 33.7 23.0 41.6 52.9 34.9 7.9 35.0
Ours TSN 59.9 52.6 43.9 35.7 24.0 43.2 54.4 35.8 9.5 35.9

4.1 Datasets and Metrics

Evaluation Datasets. Following conventions [39, 22], we evaluate our proposed method on two challenging TAL benchmarks, i.e., THUMOS14 [12] and ActivityNet v1.3 [3]. THUMOS14 [12] contains 200 validation videos and 213 testing videos, including 20 action categories. It is very challenging since each video has more than 15 action instances. Following the common setting [38], we use the validation set for training and evaluate on the testing set. ActivityNet v1.3 [3] is a large-scale benchmark for video-based action localization. It contains 10k training videos and 5k validation videos corresponding to 200 different actions. Following the standard practice [19], we train our method on the training set and test it on the validation set.

Evaluation Metrics. We use the mean Average Precision (mAP) as the evaluation metric. The tIoU thresholds are [0.3:0.1:0.7]delimited-[]:0.30.1:0.7[0.3:0.1:0.7][ 0.3 : 0.1 : 0.7 ] for THUMOS14 and [0.5:0.05:0.95]delimited-[]:0.50.05:0.95[0.5:0.05:0.95][ 0.5 : 0.05 : 0.95 ] for ActivityNet v1.3. We report the average mAP of the IoU thresholds between 0.5 and 0.95 with the step of 0.05 on ActivityNet v1.3. Also, we present the average mAP of the tIoU thresholds from 0.3 to 0.7 on THUMOS14.

4.2 Implementation Details

Following the conventional setting [31, 14, 21], we extract each video snippet feature over every fixed consecutive frames by TSN [29] pre-trained on Kinetics [34]. The temporal dimension is fixed at 100 and 256 for ActivityNet v1.3 and THUMOS14, respectively. Our action localization framework adopts the popular proposal-free approach SPOT [21], which is mainly composed of a classification head and a mask head. Our main contributions focus on the classification head, which originally adopts the cross-entropy loss for target classes. Also, we employ another anchor-free approach Actionformer [39] with the I3D backbone [4] for fair comparisons.

For semi-supervised setting, we first pre-train our model on the training set for 12 epochs and then we fine-tune the pre-trained model for 15 epochs with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for ActivityNet v1.3 and 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and a cosine learning rate decay is used. Following SPOT [21], we adopt the same label sharpening operator and the threshold set for mask. The Soft-NMS [2] is performed on ActivityNet v1.3 and THUMOS14 with a threshold of 0.6 and 0.4, respectively. α=1𝛼1\alpha=1italic_α = 1. For the labeling ratios, we introduce four SS-TAL settings with different label sizes. Following NPL [32], we randomly select 10%, 20%, 40%, and 60% training videos as the labeled set and the remaining as the unlabeled set. Both labeled and unlabeled sets are accessible for SS-TAL model training.

4.3 Comparison with State-of-the-art Methods

The main results are reported in Table 1, where we report mAP at different tIoU thresholds and average mAP. We can observe that our method achieves stable performance improvements over previous works across all data splits on both datasets. We also present the performance of our baseline model in Tabel 1. We can see that our main contributions achieve significant performance gains, benefiting from the superiority of the proposed framework.

Specifically, for the THUMOS14 dataset, it is a challenging TAL benchmark due to dense action instances and ambiguous semantics. Our method still outperforms all other comparable methods in all labeled ratios, indicating that the performance gains from our positive and negative learning strategies. Especially, our method obtains remarkable performance when the number of labeled data is very limited (with only 10% or 20% labeled videos). It demonstrates that our method could learn underlying valuable information from non-target classes.

The superiority of the proposed method is more emphasized for ActivityNet v1.3, which is a more large-scale video dataset so as to provide a larger label space for effective hybrid positive-negative learning. As depicted in Table 1, our method shows a distinct improvement compared to all other methods. Additionally, the improvements suggests that indirectly learning from positive and negative classes further benefits SS-TAL.

4.4 Ablation Study

Table 2: Ablation study of different losses on THUMOS14 and ActivityNet v1.3 with 10% labels, where tgtsuperscript𝑡𝑔𝑡\ell^{tgt}roman_ℓ start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT is the vanilla cross-entropy loss for snippet-level action classification, and negsuperscript𝑛𝑒𝑔\ell^{neg}roman_ℓ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT and possuperscript𝑝𝑜𝑠\ell^{pos}roman_ℓ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT are the proposed negative and positive learning losses for excavating complementary information.
tgtsuperscript𝑡𝑔𝑡\ell^{tgt}roman_ℓ start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT negsuperscript𝑛𝑒𝑔\ell^{neg}roman_ℓ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT possuperscript𝑝𝑜𝑠\ell^{pos}roman_ℓ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT THUMOS14 (%) ActivityNet v1.3 (%)
0.3 0.5 0.7 Avg. 0.5 0.75 0.95 Avg.
49.2 32.8 12.5 31.4 51.2 31.8 7.3 32.1
50.1 34.0 13.9 32.7 52.3 32.7 8.3 33.4
50.9 34.9 14.6 33.5 53.0 34.4 9.2 34.5

Effectiveness of loss terms. To prove our core insight, i.e., learning underlying informative semantics from non-target classes, we conduct experiments in Table 2 to ablate each loss step by step. Above all, we use tgtsuperscript𝑡𝑔𝑡\ell^{tgt}roman_ℓ start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT trained model as our baseline, achieving average mAP of 31.4% and 32.1% on THUMOS14 and Activitynet v1.3, respectively. Applying the proposed negsuperscript𝑛𝑒𝑔\ell^{neg}roman_ℓ start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT consistently improves the baseline by a large margin on both benchmarks, arguably since the potential negative information improves the snippet-level semantic discrimination. In addition, the proposed possuperscript𝑝𝑜𝑠\ell^{pos}roman_ℓ start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT also significantly improve the performance of the model by excavating the ground truth label related semantics from the positive classes.

Table 3: Ablation study of SS-TAL results on THUMOS14 using I3D features and Actionformer [39], where the label ratio is 10% and \star represents only using labeled videos. Method Bkb THUMOS14 (%) 0.3 0.5 0.7 Avg. ActF{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT [39] I3D 28.5 14.1 4.1 15.6 NPL (ActF) [32] I3D 32.8 20.1 7.2 20.3 Ours (ActF) I3D 34.1 21.3 7.9 21.4
Table 4: Empirical study of hyper-parameter λ𝜆\lambdaitalic_λ on THUMOS14 with 10% labels, where λ𝜆\lambdaitalic_λ affects the number of the positive classes for positive learning. λ𝜆\lambdaitalic_λ value THUMOS14 (%) 0.3 0.5 0.7 Avg. 0.90 50.6 34.8 14.3 33.2 0.85 50.9 34.9 14.6 33.5 0.80 50.2 34.3 14.1 33.0 0.75 50.0 34.1 14.0 32.9

Different backbone and detector. The proposed method features a hybrid positive-negative learning loss to learn complementary information from the label space of each video snippet, which is feature-agnostic and model-agnostic. To verify this point, we combine our method with another popular I3D [4] backbone in TAL and the powerful detector Actionformer [39], and then the comparable results are shown in Table 3. The performance gains support our point, confirming that the superiority of our method is feature-agnostic and model-agnostic.

Empirical study of hyper-parameter λ𝜆\lambdaitalic_λ. Ground truth classes are often hidden in positive classes, which is ignored by target-class-based learning methods. In contrast, we introduce a hyper-parameter λ𝜆\lambdaitalic_λ that adaptively selects the number of positive classes based on the confidence of the sample. Then, we conduct an ablation study to vary the value of λ𝜆\lambdaitalic_λ and delve into its impact on the performance. From Table 4, it can be observed that higher λ𝜆\lambdaitalic_λ choosing fewer positive classes may make it difficult to fully learn the informative semantics via positive learning while lower λ𝜆\lambdaitalic_λ choosing more classes as positive classes may carry the risk of involving unreliable ambiguous classes.

Refer to caption
Figure 3: Effect of our method on foreground-background subtask. We present the visualization of foreground feature and background feature on an unlabeled THUMOS14 video.
Refer to caption
Figure 4: Effect of our method on foreground-instance subtask. We present the visualization of features of four challenging classes on THUMOS14.

Qualitative analysis of the positive and negative learning. Learning complementary information from non-target classes contributes to improving the class-level representation. To verify this point, we present the visualizations of foreground-background features and foreground-instance features in Figure 3 and Figure 4, respectively. On the one hand, from Figure 3, the proposed hybrid positive-negative learning separates foreground and background features more clearly by excavating the ground truth semantics hidden in positive classes. On the other hand, from Figure 4, we can observe that the model produces a much clearer boundary of each class with the hybrid positive-negative learning. It shows that our method could improve the generalization ability of the model.

Figure 5: The average probability that the ground truth label is located in different label subspace, i.e. target class ΩtgtsuperscriptΩ𝑡𝑔𝑡\Omega^{tgt}roman_Ω start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT, positive classes ΩpossuperscriptΩ𝑝𝑜𝑠\Omega^{pos}roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT, ambiguous classes ΩambsuperscriptΩ𝑎𝑚𝑏\Omega^{amb}roman_Ω start_POSTSUPERSCRIPT italic_a italic_m italic_b end_POSTSUPERSCRIPT and negative classes ΩnegsuperscriptΩ𝑛𝑒𝑔\Omega^{neg}roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT. The experiment is performed on 90% unlabeled THUMOS14.
Refer to caption
Table 5: Comparison with the soft pseudo-label method [1] and complementary label [10]. The comparison results verify the superiority of our method over previous semi-supervised technologies. Method THUMOS14 (%) 0.3 0.5 0.7 Avg. soft pseudo label 49.5 33.2 12.7 31.7 complementary label 50.0 33.5 13.1 32.1 Ours 50.9 34.9 14.6 33.5
Figure 5: The average probability that the ground truth label is located in different label subspace, i.e. target class ΩtgtsuperscriptΩ𝑡𝑔𝑡\Omega^{tgt}roman_Ω start_POSTSUPERSCRIPT italic_t italic_g italic_t end_POSTSUPERSCRIPT, positive classes ΩpossuperscriptΩ𝑝𝑜𝑠\Omega^{pos}roman_Ω start_POSTSUPERSCRIPT italic_p italic_o italic_s end_POSTSUPERSCRIPT, ambiguous classes ΩambsuperscriptΩ𝑎𝑚𝑏\Omega^{amb}roman_Ω start_POSTSUPERSCRIPT italic_a italic_m italic_b end_POSTSUPERSCRIPT and negative classes ΩnegsuperscriptΩ𝑛𝑒𝑔\Omega^{neg}roman_Ω start_POSTSUPERSCRIPT italic_n italic_e italic_g end_POSTSUPERSCRIPT. The experiment is performed on 90% unlabeled THUMOS14.

Quantitative evaluation of label space. The proposed method could adaptively divide the entire label space into target class, positive classes, ambiguous classes, and negative classes, and then perform positive and negative learning to excavate ground truth semantics and underlying negative information. The key to performance improvement is whether positive classes contain the ground truth label while negative classes run a low risk of containing the ground truth label. Thus, we calculate the average probability that the ground truth label is located in these four class subspaces, as shown in Figure 5. It can be observed that our method could effectively use positive classes to mine ground truth semantics and exploit negative classes to improve the model as much as possible.

Comparison with other semi-supervised approaches. To validate the superiority of our method over previous semi-supervised technologies, we explore the soft pseudo-label method [1] and the complementary label method [10] (learning from a random non-target class). We incorporate their main ideas into our work. The comparison results in Table 5 indicate that the soft pseudo-label produced by the model is quite noisy and includes limited extra knowledge beyond the target label. In contrast to the complementary label, our hybrid positive-negative learning can adaptively extract richer, more informative action semantics from unlabeled videos while reducing the risk of choosing the true label.

4.5 Visualization results

As shown in Figure 6, we provide some qualitative results by previous work SPOT [21] and our approach, where the model is trained with 10% and 40% labeled data on both THUMOS14 and ActivityNet v1.3. Benefiting from using non-target classes, our method can locate and recognize the target actions more accurately, demonstrating the superiority of our method.

Refer to caption
Figure 6: Qualitative SS-TAL result comparison of our proposed method with SPOT [21] on two untrimmed videos from (a) THUMOS14 and (b) ActivityNet v1.3, respectively.

5 Limitation

This paper proposes to learn informative semantics from non-target classes instead of only target class, benefiting from the abundant information hidden in the label space. Thus, it is difficult for the proposed method to achieve significant performance gains when the training set and the label space are small. In addition, this work completely excludes all ambiguous classes in training, which may result in some of the indicative information being wasted. Therefore, improving the model by using ambiguous classes will be part of our future work.

6 Conclusion

In this paper, we introduce a novel paradigm for SS-TAL by emphasizing learning from non-target classes, transcending the conventional focus solely on the target class. The approach fist partitions the entire label space of the predicted class distribution into different subspaces, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. Then, we develop innovative strategies for adaptively selecting high-quality positive and negative classes from the label space. Additionally, new positive and negative losses are proposed to guide the non-target learning effectively. The extensive experiments on two popular benchmarks with consistent performance gains demonstrate the effectiveness of our method.

References

  • [1] Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: IJCNN. pp. 1–8 (2020)
  • [2] Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms–improving object detection with one line of code. In: ICCV. pp. 5561–5569 (2017)
  • [3] Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: A large-scale video benchmark for human activity understanding. In: CVPR. pp. 961–970 (2015)
  • [4] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. pp. 6299–6308 (2017)
  • [5] Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR. pp. 1130–1139 (2018)
  • [6] Chen, B., Chen, W., Yang, S., Xuan, Y., Song, J., Xie, D., Pu, S., Song, M., Zhuang, Y.: Label matching semi-supervised object detection. In: CVPR. pp. 14381–14390 (2022)
  • [7] Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: Motion-augmented rgb stream for action recognition. In: CVPR. pp. 7882–7891 (2019)
  • [8] Diba, A., Fayyaz, M., Sharma, V., Mahdi Arzani, M., Yousefzadeh, R., Gall, J., Van Gool, L.: Spatio-temporal channel correlation networks for action classification. In: ECCV. pp. 284–299 (2018)
  • [9] Ding, X., Wang, N., Gao, X., Li, J., Wang, X., Liu, T.: KFC: An efficient framework for semi-supervised temporal action localization. IEEE T-IP 30, 6869–6878 (2021)
  • [10] Ishida, T., Niu, G., Hu, W., Sugiyama, M.: Learning from complementary labels. In: NeurIPS (2017)
  • [11] Ji, J., Cao, K., Niebles, J.C.: Learning temporal action proposals with fewer labels. In: ICCV. pp. 7073–7082 (2019)
  • [12] Jiang, Y.G., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes (2014)
  • [13] **, Y., Wang, J., Lin, D.: Semi-supervised semantic segmentation via gentle teaching assistant. In: NeurIPS. pp. 2803–2816 (2022)
  • [14] Kim, J., Lee, M., Heo, J.P.: Self-feedback detr for temporal action detection. ICCV (2023)
  • [15] Kim, Y., Yim, J., Yun, J., Kim, J.: Nlnl: Negative learning for noisy labels. In: ICCV. pp. 101–110 (2019)
  • [16] Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Learning to learn from noisy labeled data. In: CVPR. pp. 5051–5059 (2019)
  • [17] Lin, C., Xu, C., Luo, D., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., Fu, Y.: Learning salient boundary feature for anchor-free temporal action localization. In: CVPR. pp. 3320–3329 (2021)
  • [18] Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: Boundary sensitive network for temporal action proposal generation. In: ECCV. pp. 3–19 (2018)
  • [19] Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., Torr, P.H.: Multi-shot temporal event localization: a benchmark. In: CVPR. pp. 12596–12606 (2021)
  • [20] Liu, X., Wang, Q., Hu, Y., Tang, X., Zhang, S., Bai, S., Bai, X.: End-to-end temporal action detection with transformer. IEEE TIP 31, 5427–5441 (2022)
  • [21] Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Semi-supervised temporal action detection with proposal-free masking. In: ECCV (2022)
  • [22] Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Temporal action detection with global segmentation mask learning. In: ECCV (2022)
  • [23] Qiao, P., Wei, Z., Wang, Y., Wang, Z., Song, G., Xu, F., Ji, X., Liu, C., Chen, J.: Fuzzy positive learning for semi-supervised semantic segmentation. In: CVPR. pp. 15465–15474 (2023)
  • [24] Shi, D., Zhong, Y., Cao, Q., Zhang, J., Ma, L., Li, J., Tao, D.: React: Temporal action detection with relational queries. In: ECCV. pp. 105–121. Springer (2022)
  • [25] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: FixMatch: Simplifying semi-supervised learning with consistency and confidence. In: NeurIPS. pp. 596–608 (2020)
  • [26] Song, L., Zhang, S., Yu, G., Sun, H.: Tacnet: Transition-aware context network for spatio-temporal action detection. In: CVPR. pp. 11987–11995 (2019)
  • [27] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS. pp. 1196–1205 (2017)
  • [28] Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR. pp. 4325–4334 (2017)
  • [29] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. pp. 20–36 (2016)
  • [30] Wang, Q., Zhang, Y., Zheng, Y., Pan, P.: RCL: Recurrent continuous localization for temporal action detection. In: CVPR. pp. 13566–13575 (2022)
  • [31] Wang, X., Zhang, S., Qing, Z., Shao, Y., Gao, C., Sang, N.: Self-supervised learning for semi-supervised temporal action proposal. In: CVPR. pp. 1905–1914 (2021)
  • [32] Xia, K., Wang, L., Zhou, S., Hua, G., Tang, W.: Learning from noisy pseudo labels for semi-supervised temporal action localization. In: ICCV. pp. 10160–10169 (2023)
  • [33] Xia, K., Wang, L., Zhou, S., Zheng, N., Tang, W.: Learning to refactor action and co-occurrence features for temporal action localization. In: CVPR. pp. 13874–13883 (2022)
  • [34] Xiong, Y., Wang, L., Wang, Z., Zhang, B., Song, H., Li, W., Lin, D., Qiao, Y., Van Gool, L., Tang, X.: Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797 (2016)
  • [35] Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: Sub-graph localization for temporal action detection. In: CVPR. pp. 10156–10165 (2020)
  • [36] Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR. pp. 591–600 (2020)
  • [37] Yu, X., Liu, T., Gong, M., Tao, D.: Learning with biased complementary labels. In: ECCV. pp. 68–83 (2018)
  • [38] Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., Gan, C.: Graph convolutional module for temporal action localization in videos. IEEE TPAMI 44(10), 6209–6223 (2022)
  • [39] Zhang, C., Wu, J., Li, Y.: ActionFormer: Localizing moments of actions with transformers. In: ECCV. pp. 492–510 (2022)
  • [40] Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: ECCV. pp. 539–555 (2020)
  • [41] Zhou, Q., Yu, C., Wang, Z., Qian, Q., Li, H.: Instant-teaching: An end-to-end semi-supervised object detection framework. In: CVPR. pp. 4081–4090 (2021)