Continual Few-shot Event Detection via Hierarchical Augmentation Networks

Abstract

Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types and learning new event types with few-shot samples. To mitigate these challenges, we propose a memory-based framework: Hierarchical Augmentation Networks (HANet). To memorize previous event types with limited memory, we incorporate prototypical augmentation into the memory set. For the issue of learning new event types in few-shot scenarios, we propose a contrastive augmentation module for token representations. Despite comparing with previous state-of-the-art methods, we also conduct comparisons with ChatGPT. Experiment results demonstrate that our method significantly outperforms all of these methods in multiple continual few-shot event detection tasks.

Keywords: Information Extraction, Continual Learning, Few-shot Learning

\NAT@set@cites

Continual Few-shot Event Detection via Hierarchical Augmentation Networks


Chenlong Zhang1,2thanks:  These authors contribute equally to this work., Pengfei Cao1,211footnotemark: 1, Yubo Chen1,2thanks:  Corresponding author., Kang Liu1,2,3
Zhiqiang Zhang4, Mengshu Sun4, Jun Zhao1,2
1The Laboratory of Cognition and Decision Intelligence for Complex Systems,
Institute of Automation, Chinese Academy of Sciences, Bei**g, China
2School of Artificial Intelligence, University of Chinese Academy of Sciences, Bei**g, China
3Shanghai Artificial Intelligence Laboratory, Shanghai, China
4Ant Group, Hangzhou, China
[email protected]
{pengfei.cao, yubo.chen, kliu, jzhao}@nlpr.ia.ac.cn

Abstract content

1.   Introduction

Event Detection (ED) involves detecting event triggers and classifying the corresponding event types (Ahn, 2006) (e.g., in Figure 1, the words “married” and “left” trigger events “Marry” and “Transport”, respectively.). It is an essential information extraction task that can be applied in various natural language processing applications. Conventional methods (Chen et al., 2015; Nguyen and Grishman, 2015) commonly model ED as a supervised task trained on fixed data with pre-defined event types. However, in real-world applications, new event types emerge continually.

Thus, Continual Event Detection (CED) has been proposed (Cao et al., 2020; Yu et al., 2021). The CED task assumes multiple ED tasks emerge continually, which requires ED models to learn new types while maintaining the capability of detecting previous types. The CED task is challenging due to the catastrophic forgetting problem (McCloskey and Cohen, 1989), where the model’s performance on previous tasks declines significantly when learning new tasks. To mitigate such a dilemma, previous works have proved that memory-based methods (see Figure 1) are the most effective in solving CED task (Cao et al., 2020; Yu et al., 2021; Liu et al., 2022). These methods preserve prototypical samples as memory set to replay previous knowledge. Abundant representative features can effectively remind the model of previous types, achieving state-of-the-art performance.

Refer to caption
Figure 1: Memory-based framework for continual few-shot event detection. It preserves previous knowledge by maintaining a memory set “M.” and transferring knowledge from previous models.

Even though these methods achieve remarkable performance, they all assume that the training samples in incremental tasks are sufficient. Actually, in practical applications, new events emerge successively, making it infeasible to obtain a sufficient number of high-quality samples for each emerging new event type. It is more commonplace to encounter incremental tasks with only a handful of annotated samples (e.g., 10, 5, or even 1) for each new type. Nonetheless, this circumstance has been overlooked by previous works.

To this end, we propose a new task: Continual Few-shot Event Detection (CFED), which aims to continually learn new event detection tasks with few-shot samples. For example, as shown in Figure 1, the first task (base task) denotes the regular ED task with abundant training samples (e.g., 100 samples are available for event type “Life: Marry”). Then, only a few samples are available for the emerging incremental tasks (e.g., there are only 5 labeled samples accessible for the new type “Movement: Transport”).

Obviously, CFED introduces a more challenging yet realistic scenario as it requires memorizing previous event types and learning new event types with few-shot samples. We present the two challenges specifically as follows:

Memorizing previous event types with few-shot samples: In the CED task, memroy-based methods use a multitude of exemplars (e.g., 50) in memory set to effectively characterize the prototypical feature space, thus alleviating catastrophic forgetting. However, in the CFED task, only 10, 5, or 1 sample is available for training. In extreme scenarios, there is only one sample per type available to be stored in the memory set for further replay. Therefore, how to utilize rare stored samples to mitigate catastrophic forgetting remains challenging.

Learning new event types with few-shot samples: Supervised methods usually require a large number of annotated samples (Lai et al., 2020; Deng et al., 2020; Zhang et al., 2022a). When trained with limited samples, these methods often struggle to generalize well and suffer from overfitting. Current large language models (llms) (Brown et al., 2020; Touvron et al., 2023) have demonstrated promising capability to learn from few-shot samples with their in-context learning ability. However, these models are constrained by limited knowledge (e.g., ChatGPT’s knowledge of world and events is limited after 2021). Though in-context learning is capable of temporarily empowering them with new event knowledge, it fails to truly inject this knowledge into the model(Moiseev et al., 2022). Therefore, We consider using a fine-tuned language model to solve the CFED task. How to effectively mitigate overfitting with few-shot samples for learning new event types is still a formidable challenge.

To address these problems, we propose a memory-based approach: Hierarchical Augmentation Network (HANet). When memorizing previous types, we devise prototypical augmentation to augment the prototypical feature space of exemplars in the memory, thus alleviating catastrophic forgetting. To address overfitting in learning new types, we design contrastive augmentation module to acquire valuable information from few-shot samples. Experimental results show that our method surpasses previous baselines significantly.

Our contributions can be summarized as follows:

(1) To the best of our knowledge, we are the first to propose continual few-shot event detection and construct benchmarks based on ACE and MAVEN.

(2) We propose a Hierarchical Augmentation Network (HANet), which leverage prototypical augmentation and contrastive augmentation to memorize previous event types and to learn new event types with few-shot samples.

(3) Experimental results demonstrate that our method significantly outperforms previous state-of-the-art methods in all CFED settings. Impressively, our method achieves 7.27%percent7.277.27\%7.27 % and 8.44%percent8.448.44\%8.44 % improvements on micro F1 in 4-way 5-shot MAVEN and 2-way 5-shot ACE settings. Moreover, experiments with ChatGPT show that our method achieves superior results. Our code and dataset are publicly available at https://github.com/chenlong-clock/CFED-HANet.

Refer to caption
Figure 2: Our system consists of a general event detector, prototypical augmentation, and contrastive augmentation. When learning new tasks with an event detector, the model replays prior knowledge from the augmented feature. Then, contrastive augmentation maximizes the acquisition of knowledge from few-shot samples.

2.   Problem Definition

Continual few-shot event detection (CFED) aims to detect emerging events with few-shot samples. As shown in Figure 1, given tasks 𝕋={T1,T2,,Tn}𝕋subscript𝑇1subscript𝑇2subscript𝑇𝑛\mathbb{T}=\{T_{1},T_{2},...,T_{n}\}blackboard_T = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, each task has individual training/validation/testing set Ti={Ditrain,Didev,Ditest}subscript𝑇𝑖superscriptsubscript𝐷𝑖𝑡𝑟𝑎𝑖𝑛superscriptsubscript𝐷𝑖𝑑𝑒𝑣superscriptsubscript𝐷𝑖𝑡𝑒𝑠𝑡T_{i}=\{D_{i}^{train},D_{i}^{dev},D_{i}^{test}\}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_v end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT }. Di={(𝐗ij,𝐘ij)}j=1msubscript𝐷𝑖subscriptsuperscriptsubscriptsuperscript𝐗𝑗𝑖subscriptsuperscript𝐘𝑗𝑖𝑚𝑗1D_{i}=\left\{(\mathbf{X}^{j}_{i},\mathbf{Y}^{j}_{i})\right\}^{m}_{j=1}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( bold_X start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT, where 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y are samples and their corresponding labels, and m𝑚mitalic_m is the number of event types in each task. The first sub-task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the base task Tbasesubscript𝑇𝑏𝑎𝑠𝑒T_{base}italic_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT that contains abundant training samples. The rest sub-tasks are defined as few-shot incremental tasks Tinc={T2,T3,,Tn}subscript𝑇𝑖𝑛𝑐subscript𝑇2subscript𝑇3subscript𝑇𝑛T_{inc}=\{T_{2},T_{3},...,T_{n}\}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, with only a few samples (e.g., 5 or 10) for each new event type. For any two tasks Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, their types are non-overlap**: TiTj=subscript𝑇𝑖subscript𝑇𝑗T_{i}\cap T_{j}=\emptysetitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅. At time step t𝑡titalic_t, for CFED task Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the training set is formulated as Cttrain=Dttrainsuperscriptsubscript𝐶𝑡𝑡𝑟𝑎𝑖𝑛superscriptsubscript𝐷𝑡𝑡𝑟𝑎𝑖𝑛C_{t}^{train}=D_{t}^{train}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT and the validation/testing set is Cttest=DttestCt1testsuperscriptsubscript𝐶𝑡𝑡𝑒𝑠𝑡superscriptsubscript𝐷𝑡𝑡𝑒𝑠𝑡superscriptsubscript𝐶𝑡1𝑡𝑒𝑠𝑡C_{t}^{test}=D_{t}^{test}\bigcup C_{t-1}^{test}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT ⋃ italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT, indicating the CFED system is supposed to keep stable performance on all observed labels Lt=i=1t{𝐘ij}j=1msubscript𝐿𝑡superscriptsubscript𝑖1𝑡superscriptsubscriptsuperscriptsubscript𝐘𝑖𝑗𝑗1𝑚L_{t}=\bigcup_{i=1}^{t}\{\mathbf{Y}_{i}^{j}\}_{j=1}^{m}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT { bold_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with the currently available training samples in task Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.   Methodology

The framework of our method is illustrated in Figure 2. It comprises a general event detector, a memory enhanced by prototypical augmentation, and a contrastive augmentation module. For input sentences, event detector performs trigger extraction. Then, the exemplars are augmented by prototypical augmentation to replay previous knowledge. Additionally, contrastive augmentation exploits information from each sample by applying an auxiliary contrastive loss. We provide a detailed introduction as follows.

3.1.   Event Detector

The event detector is composed of a trigger extractor and a classifier. Following previous works (Cao et al., 2020; Liu et al., 2022), we implement a pre-trained 12-layer BERT (Devlin et al., 2019) model to encode sentences. Specifically, given a sentence 𝐒={𝐱1,𝐱2,,[𝐞s,,𝐞e],,𝐱n}𝐒subscript𝐱1subscript𝐱2subscript𝐞𝑠subscript𝐞𝑒subscript𝐱𝑛\mathbf{S}=\{\mathbf{x}_{1},\mathbf{x}_{2},...,[\mathbf{e}_{s},...,\mathbf{e}_% {e}],...,\mathbf{x}_{n}\}bold_S = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , [ bold_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] , … , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } containing event triggers 𝐄=[𝐞s,,𝐞e]𝐄subscript𝐞𝑠subscript𝐞𝑒\mathbf{E}=[\mathbf{e}_{s},...,\mathbf{e}_{e}]bold_E = [ bold_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ], the hidden representation is 𝐇n×d𝐇superscript𝑛𝑑\mathbf{H}\in\mathbb{R}^{n\times d}bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT. We get hidden states of a trigger 𝐇𝐞subscript𝐇𝐞\mathbf{H_{e}}bold_H start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT by concatenating their start and end representations. Then, p(𝐲i|𝐡e)𝑝conditionalsubscript𝐲𝑖subscript𝐡𝑒p(\mathbf{y}_{i}|\mathbf{h}_{e})italic_p ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) for event type 𝐲iLtsubscript𝐲𝑖subscript𝐿𝑡\mathbf{y}_{i}\in L_{t}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at stage t𝑡titalic_t is obtained by the following equation:

p(𝐲i|𝐡e)=exp(𝐖iT𝐡e+𝐛i)j=1|Lt|exp(𝐖jT𝐡e+𝐛j)𝑝conditionalsubscript𝐲𝑖subscript𝐡𝑒superscriptsubscript𝐖𝑖𝑇subscript𝐡𝑒subscript𝐛𝑖superscriptsubscript𝑗1subscript𝐿𝑡superscriptsubscript𝐖𝑗𝑇subscript𝐡𝑒subscript𝐛𝑗p(\mathbf{y}_{i}|\mathbf{h}_{e})=\frac{\exp{(\mathbf{W}_{i}^{T}\mathbf{h}_{e}+% \mathbf{b}_{i})}}{\sum_{j=1}^{|L_{t}|}\exp{(\mathbf{W}_{j}^{T}\mathbf{h}_{e}+% \mathbf{b}_{j})}}italic_p ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT roman_exp ( bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG (1)

where 𝐖id×|Lt|subscript𝐖𝑖superscript𝑑subscript𝐿𝑡\mathbf{W}_{i}\in\mathbb{R}^{d\times|L_{t}|}bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT is a linear projection for classification. The possible types are Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, we train the model with Cross Entropy Loss:

ce=(X,Y)Tt𝐲log𝐩subscript𝑐𝑒subscript𝑋𝑌subscript𝑇𝑡𝐲𝐩\mathcal{L}_{ce}=-\sum_{(X,Y)\in T_{t}}\mathbf{y}\log\mathbf{p}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT ( italic_X , italic_Y ) ∈ italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_y roman_log bold_p (2)

where 𝐲𝐲\mathbf{y}bold_y is the ground-truth label for trigger 𝐡esubscript𝐡𝑒\mathbf{h}_{e}bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, 𝐩𝐩\mathbf{p}bold_p is the label distribution calculated by Equation (1).

3.2.   Prototypical Augmentation

We construct a memory set by selecting the most representative examples. Accordingly, we adopt a distance-based algorithm. Finally, prototypical augmentation is applied in the feature space.

3.2.1.   Memory Construction

After task Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we combine a memory set Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT comprising exemplars of current types with previous memory Mt1subscript𝑀𝑡1M_{t-1}italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Since only few samples are available for training in incremental tasks, the most extreme condition should be taken into account so that our method can be compatible with any real-world applications. Thus, we only select one exemplar (𝐱e,tj,𝐲e,tj)superscriptsubscript𝐱𝑒𝑡𝑗superscriptsubscript𝐲𝑒𝑡𝑗(\mathbf{x}_{e,t}^{j},\mathbf{y}_{e,t}^{j})( bold_x start_POSTSUBSCRIPT italic_e , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_e , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) for every category in Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

Mt={{(𝐱e,tj,𝐲e,tj)}j=1m,if t=1{(𝐱e,tj,𝐲e,tj)}j=1mMt1,if t>1subscript𝑀𝑡casessuperscriptsubscriptsuperscriptsubscript𝐱𝑒𝑡𝑗superscriptsubscript𝐲𝑒𝑡𝑗𝑗1𝑚if 𝑡1superscriptsubscriptsuperscriptsubscript𝐱𝑒𝑡𝑗superscriptsubscript𝐲𝑒𝑡𝑗𝑗1𝑚subscript𝑀𝑡1if 𝑡1M_{t}=\begin{cases}\left\{(\mathbf{x}_{e,t}^{j},\mathbf{y}_{e,t}^{j})\right\}_% {j=1}^{m},&\text{if }t=1\\ \left\{(\mathbf{x}_{e,t}^{j},\mathbf{y}_{e,t}^{j})\right\}_{j=1}^{m}\bigcup M_% {t-1},&\text{if }t>1\end{cases}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL { ( bold_x start_POSTSUBSCRIPT italic_e , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_e , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_t = 1 end_CELL end_ROW start_ROW start_CELL { ( bold_x start_POSTSUBSCRIPT italic_e , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_e , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋃ italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , end_CELL start_CELL if italic_t > 1 end_CELL end_ROW (3)

The combined Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is then treated as a part of the training set in the next task Tt+1=Tt+1Mtsubscript𝑇𝑡1subscript𝑇𝑡1subscript𝑀𝑡T_{t+1}=T_{t+1}\bigcup M_{t}italic_T start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⋃ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To select the most representative samples, we first create a prototype for each event type by averaging the encoded representations. Then we choose the closest sample measured by distance (e.g., L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Distance or Cosine Distance) as the exemplar.

3.2.2.   Prototypical Augmentation

Since conventional memory preserves plenty of representative samples, these samples characterize the feature space of their types. However, in our settings, the memory is limited to 1 for each type. The exemplar can only be represented as a point in the feature space (see Figure 2 (b)). To tackle this, we reconstruct the feature space of the exemplar by prototypical augmentation.

We get the exemplar’s representation 𝐡ejsuperscriptsubscript𝐡𝑒𝑗\mathbf{h}_{e}^{j}bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT that belongs to class j𝑗jitalic_j. We assume the pseudo feature space follows Gaussian Distribution. In view that exemplars are normally considered the most representative sample, their representation is regarded as the mean. The variance of the distribution is calculated in the exemplar selection process, where we calculate the mean squared deviation of all samples that belong to the same category:

σj2=1|𝐇ej|𝐡ij𝐇j(𝐡ijμj)2subscriptsuperscript𝜎2𝑗1superscriptsubscript𝐇𝑒𝑗subscriptsuperscriptsubscript𝐡𝑖𝑗superscript𝐇𝑗superscriptsuperscriptsubscript𝐡𝑖𝑗subscript𝜇𝑗2\mathbf{\sigma}^{2}_{j}=\frac{1}{|\mathbf{H}_{e}^{j}|}\sum_{\mathbf{h}_{i}^{j}% \in\mathbf{H}^{j}}(\mathbf{h}_{i}^{j}-\mathbf{\mu}_{j})^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | bold_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ bold_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

where Hejsubscriptsuperscript𝐻𝑗𝑒H^{j}_{e}italic_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are BERT representations that belong to event type 𝐘tjsuperscriptsubscript𝐘𝑡𝑗\mathbf{Y}_{t}^{j}bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. According to Equation (3), the memory set Mt1subscript𝑀𝑡1M_{t-1}italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is reformulated as Mt1=k=1i1{(𝐱e,kj,𝐲e,kj,σ2tj)}j=1msubscript𝑀𝑡1superscriptsubscript𝑘1𝑖1superscriptsubscriptsuperscriptsubscript𝐱𝑒𝑘𝑗superscriptsubscript𝐲𝑒𝑘𝑗superscriptsubscriptsuperscript𝜎2𝑡𝑗𝑗1𝑚M_{t-1}=\bigcup_{k=1}^{i-1}\left\{(\mathbf{x}_{e,k}^{j},\mathbf{y}_{e,k}^{j},{% \mathbf{\sigma}^{2}}_{t}^{j})\right\}_{j=1}^{m}italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT { ( bold_x start_POSTSUBSCRIPT italic_e , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_e , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We define the mean squared deviation of all exemplars as the variants of Gaussian distribution. When replaying exemplars, given the representation of exemplar 𝐡ejsuperscriptsubscript𝐡𝑒𝑗\mathbf{h}_{e}^{j}bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, we have μj=𝐡ejsubscript𝜇𝑗superscriptsubscript𝐡𝑒𝑗\mathbf{\mu}_{j}=\mathbf{h}_{e}^{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Then, we sample from the distribution to construct synthetic features multiple times:

𝐇^ej={𝐡^e,1j,,𝐡^e,nj}𝒩(μj,σj2)subscriptsuperscript^𝐇𝑗𝑒subscriptsuperscript^𝐡𝑗𝑒1subscriptsuperscript^𝐡𝑗𝑒𝑛similar-to𝒩subscript𝜇𝑗subscriptsuperscript𝜎2𝑗\hat{\mathbf{H}}^{j}_{e}=\{\hat{\mathbf{h}}^{j}_{e,1},\dots,\hat{\mathbf{h}}^{% j}_{e,n}\}\sim\mathcal{N}(\mathbf{\mu}_{j},\mathbf{\sigma}^{2}_{j})over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { over^ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e , 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e , italic_n end_POSTSUBSCRIPT } ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (5)

These synthetic features can represent the feature space of their category (i.e., prototypical space). Then we replay the memory:

re=𝐇^ej𝐲jlog𝐩^jsubscript𝑟𝑒superscriptsubscriptsuperscript^𝐇𝑗𝑒subscript𝐲𝑗subscript^𝐩𝑗\mathcal{L}_{re}=-\sum^{\hat{\mathbf{H}}^{j}_{e}}\mathbf{y}_{j}\log\hat{% \mathbf{p}}_{j}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUPERSCRIPT over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (6)

where 𝐩^jsubscript^𝐩𝑗\hat{\mathbf{p}}_{j}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is obtained from 𝐇ejsubscriptsuperscript𝐇𝑗𝑒\mathbf{H}^{j}_{e}bold_H start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT by Equation (1).

3.3.   Contrastive Augmentation

Overfitting is likely to appear in Tincsubscript𝑇𝑖𝑛𝑐T_{inc}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT when learning few-shot new event types. As shown in Figure 2(c), we propose contrastive augmentation (CA) to uncover the implicit inter-information in the token scale. Following Zhang et al. (2022b), we use multiple data augmentations (e.g., Dropout, Random Token Shuffle, and Random Token Replacement) to generate augmented tokens. These tokens are used to construct positive and negative pairs. Finally, we propose two contrastive losses to aggregate the information.

Algorithm 1 Training procedure
0:  Base task T1subscript𝑇1{T_{1}}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, incremental task {T2,,Tn}subscript𝑇2subscript𝑇𝑛\{T_{2},...,T_{n}\}{ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and model’s parameter θ𝜃\thetaitalic_θ
1:  initialize θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for base task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
2:  update parameter θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using loss function cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT and clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT
3:  get memory set M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
4:  for i=2𝑖2i=2italic_i = 2 to n𝑛nitalic_n do
5:     get a copy of the previous model’s parameter θi1subscript𝜃𝑖1\theta_{i-1}italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
6:     freeze parameter θi1subscript𝜃𝑖1\theta_{i-1}italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
7:     get combined training set Ti=TiMi1subscript𝑇𝑖subscript𝑇𝑖subscript𝑀𝑖1T_{i}=T_{i}\cup M_{i-1}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
8:     update parameter θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using loss function cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT, fdsubscript𝑓𝑑\mathcal{L}_{fd}caligraphic_L start_POSTSUBSCRIPT italic_f italic_d end_POSTSUBSCRIPT, pdsubscript𝑝𝑑\mathcal{L}_{pd}caligraphic_L start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT, resubscript𝑟𝑒\mathcal{L}_{re}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT, clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and trigsubscript𝑡𝑟𝑖𝑔\mathcal{L}_{trig}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_g end_POSTSUBSCRIPT
9:     get memory set Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
10:     update memory set Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = MiMi1subscript𝑀𝑖subscript𝑀𝑖1M_{i}\cup M_{i-1}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
11:  end for

3.3.1.   Contrastive Pairs Construction

We first construct positive pairs and negative pairs from batched data. Specifically, given a mini-batch ={(𝐱i,𝐲i)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝐲𝑖𝑖1𝑛\mathcal{B}=\{(\mathbf{x}_{i},\mathbf{y}_{i})\}_{i=1}^{n}caligraphic_B = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the original sentences are (𝐱i1,𝐲i1)superscriptsubscript𝐱𝑖1superscriptsubscript𝐲𝑖1(\mathbf{x}_{i}^{1},\mathbf{y}_{i}^{1})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) and the augmented sentences are {(𝐱ik,𝐲ik)}k=2m+1superscriptsubscriptsuperscriptsubscript𝐱𝑖𝑘superscriptsubscript𝐲𝑖𝑘𝑘2𝑚1\{(\mathbf{x}_{i}^{k},\mathbf{y}_{i}^{k})\}_{k=2}^{m+1}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT, where m𝑚mitalic_m is a hyperparameter, denoting the augmentation times. Thus, sentences that have the same origin can be described as 𝒪={(𝐱ik,𝐲ik)}k=1m+1𝒪superscriptsubscriptsuperscriptsubscript𝐱𝑖𝑘superscriptsubscript𝐲𝑖𝑘𝑘1𝑚1\mathcal{O}=\{(\mathbf{x}_{i}^{k},\mathbf{y}_{i}^{k})\}_{k=1}^{m+1}caligraphic_O = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT. Based on these pairs, we perform contrastive learning in sentence representation and trigger representation.

3.3.2.   Contrastive Sentence Representation Learning

As in BERT, the special [CLS] token generally conveys the sentence representation. Similar to Mou et al. (2022), we utilize contrastive sentence representation learning for 𝐡clssubscript𝐡𝑐𝑙𝑠\mathbf{h}_{cls}bold_h start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT. Representations originating from the same sentence are regarded as positive pairs and those that originate from different sentences are regarded as positive pairs. We leverage InfoNCE loss (Oord et al., 2018):

cls=1n1i||1mjk|𝒪|subscript𝑐𝑙𝑠1𝑛1superscriptsubscript𝑖1𝑚superscriptsubscript𝑗𝑘𝒪\displaystyle\mathcal{L}_{cls}=\frac{1}{n-1}\sum_{i}^{|\mathcal{B}|}-\frac{1}{% m}\sum_{j\neq k}^{|\mathcal{O}|}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_O | end_POSTSUPERSCRIPT (7)
logexp(S(𝐡clsij,𝐡clsik)/τ)pi||q|O|exp(S(𝐡clsij𝐡clspq)/τ)𝑆superscriptsubscriptsubscript𝐡𝑐𝑙𝑠𝑖𝑗superscriptsubscriptsubscript𝐡𝑐𝑙𝑠𝑖𝑘𝜏superscriptsubscript𝑝𝑖superscriptsubscript𝑞𝑂𝑆superscriptsubscriptsubscript𝐡𝑐𝑙𝑠𝑖𝑗superscriptsubscriptsubscript𝐡𝑐𝑙𝑠𝑝𝑞𝜏\displaystyle\log\frac{\exp(S({\mathbf{h}_{cls}}_{i}^{j},{\mathbf{h}_{cls}}_{i% }^{k})/\tau)}{\sum_{p\neq i}^{|\mathcal{B}|}\sum_{q}^{|O|}\exp(S({\mathbf{h}_{% cls}}_{i}^{j}{\mathbf{h}_{cls}}_{p}^{q})/\tau)}roman_log divide start_ARG roman_exp ( italic_S ( bold_h start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_O | end_POSTSUPERSCRIPT roman_exp ( italic_S ( bold_h start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG

where S()𝑆S(\cdot)italic_S ( ⋅ ) is the similarity function, and τ𝜏\tauitalic_τ is a temperature parameter to smooth the distribution and control the similarity range by scaling the output.

3.3.3.   Contrastive Trigger Representation Learning

Considering trigger representations, we propose to construct positive pairs when triggers within \mathcal{B}caligraphic_B belong to the same types, while they should form negative pairs when belonging to different types. The contrastive loss in trigger representation is:

trig=1n1il||1mjk|𝒪|[𝐲ij=𝐲lk]subscript𝑡𝑟𝑖𝑔1𝑛1superscriptsubscript𝑖𝑙1𝑚superscriptsubscript𝑗𝑘𝒪delimited-[]superscriptsubscript𝐲𝑖𝑗superscriptsubscript𝐲𝑙𝑘\displaystyle\mathcal{L}_{trig}=\frac{1}{n-1}\sum_{i\neq l}^{|\mathcal{B}|}-% \frac{1}{m}\sum_{j\neq k}^{|\mathcal{O}|}[\mathbf{y}_{i}^{j}=\mathbf{y}_{l}^{k}]caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i ≠ italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_O | end_POSTSUPERSCRIPT [ bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] (8)
logexp(S(𝐡eij,𝐡elk)/τ)pi||q|𝒪|[𝐲ij𝐲pq]exp(S(𝐡eij,𝐡epq)/τ)𝑆superscriptsubscriptsubscript𝐡𝑒𝑖𝑗superscriptsubscriptsubscript𝐡𝑒𝑙𝑘𝜏superscriptsubscript𝑝𝑖superscriptsubscript𝑞𝒪delimited-[]superscriptsubscript𝐲𝑖𝑗superscriptsubscript𝐲𝑝𝑞𝑆superscriptsubscriptsubscript𝐡𝑒𝑖𝑗superscriptsubscriptsubscript𝐡𝑒𝑝𝑞𝜏\displaystyle\log\frac{\exp(S({\mathbf{h}_{e}}_{i}^{j},{\mathbf{h}_{e}}_{l}^{k% })/\tau)}{\sum_{p\neq i}^{|\mathcal{B}|}\sum_{q}^{|\mathcal{O}|}[\mathbf{y}_{i% }^{j}\neq\mathbf{y}_{p}^{q}]\exp(S({\mathbf{h}_{e}}_{i}^{j},{\mathbf{h}_{e}}_{% p}^{q})/\tau)}roman_log divide start_ARG roman_exp ( italic_S ( bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_O | end_POSTSUPERSCRIPT [ bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≠ bold_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] roman_exp ( italic_S ( bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) / italic_τ ) end_ARG

3.4.   Knowledge Distillation

Similar to Cao et al. (2020), we use Knowledge Distillation at feature-level and predict-level. At task Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we distill knowledge from Tt1subscript𝑇𝑡1T_{t-1}italic_T start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT .

Feature-level Distillation. We get previously and currently normalized representations 𝐡~~𝐡\tilde{\mathbf{h}}over~ start_ARG bold_h end_ARG and 𝐡𝐡\mathbf{h}bold_h at the last layer’s hidden states. We measure the similarity by function S()𝑆S(\cdot)italic_S ( ⋅ ) (Cosine Similarity). The feature-level distillation loss is:

fd=(X,Y)Tt1S(𝐡~,𝐡)subscript𝑓𝑑subscript𝑋𝑌subscript𝑇𝑡1𝑆~𝐡𝐡\mathcal{L}_{fd}=\sum_{(X,{Y})\in T_{t}}1-S(\tilde{\mathbf{h}},\mathbf{h})caligraphic_L start_POSTSUBSCRIPT italic_f italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_X , italic_Y ) ∈ italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 - italic_S ( over~ start_ARG bold_h end_ARG , bold_h ) (9)

Predict-level Distillation. As is demonstrated in Hinton et al. (2015), given trigger representations hesubscript𝑒h_{e}italic_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, we obtain probability distribution:

p(𝐲i|𝐡e)=exp(𝐖iT𝐡e+𝐛i)/τdjLt1exp(𝐖jT𝐡e+𝐛j)/τd𝑝conditionalsubscript𝐲𝑖subscript𝐡𝑒superscriptsubscript𝐖𝑖𝑇subscript𝐡𝑒subscript𝐛𝑖subscript𝜏𝑑subscript𝑗subscript𝐿𝑡1superscriptsubscript𝐖𝑗𝑇subscript𝐡𝑒subscript𝐛𝑗subscript𝜏𝑑p(\mathbf{y}_{i}|\mathbf{h}_{e})=\frac{\exp(\mathbf{W}_{i}^{T}\mathbf{h}_{e}+% \mathbf{b}_{i})/\tau_{d}}{\sum_{j\in L_{t-1}}\exp(\mathbf{W}_{j}^{T}\mathbf{h}% _{e}+\mathbf{b}_{j})/\tau_{d}}italic_p ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG (10)

where τdsubscript𝜏𝑑\tau_{d}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the temperature to control the smoothness of the distribution target. We compute previous and current probability distribution 𝐩~~𝐩\tilde{\mathbf{p}}over~ start_ARG bold_p end_ARG and 𝐩𝐩\mathbf{p}bold_p on previous label set Lt1subscript𝐿𝑡1L_{t-1}italic_L start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The training objective is:

pd=(X,Y)Tt𝐩~log𝐩subscript𝑝𝑑subscript𝑋𝑌subscript𝑇𝑡~𝐩𝐩\mathcal{L}_{pd}=-\sum_{(X,Y)\in T_{t}}\tilde{\mathbf{p}}\log\mathbf{p}caligraphic_L start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT ( italic_X , italic_Y ) ∈ italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG bold_p end_ARG roman_log bold_p (11)

3.5.   Training

We present detailed training procedures in Algorithm 1. In view that cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT is the primary training objective and clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT plays an auxiliary role to help exploit sentence information, we enable cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT and clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT in Tbasesubscript𝑇𝑏𝑎𝑠𝑒T_{base}italic_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT. In Tincsubscript𝑇𝑖𝑛𝑐T_{inc}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT, we incorporate the distillation losses (fdsubscript𝑓𝑑\mathcal{L}_{fd}caligraphic_L start_POSTSUBSCRIPT italic_f italic_d end_POSTSUBSCRIPT and pdsubscript𝑝𝑑\mathcal{L}_{pd}caligraphic_L start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT) and the exemplar replay loss (resubscript𝑟𝑒\mathcal{L}_{re}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT) as they rely on previous knowledge for training. We exclusively enable trigsubscript𝑡𝑟𝑖𝑔\mathcal{L}_{trig}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_g end_POSTSUBSCRIPT in Tincsubscript𝑇𝑖𝑛𝑐T_{inc}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT due to its superior effectiveness in few-shot learning. Each loss function is weighted by a factor λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where i{ce,re,cls,trig,fd,pd}𝑖𝑐𝑒𝑟𝑒𝑐𝑙𝑠𝑡𝑟𝑖𝑔𝑓𝑑𝑝𝑑i\in\{ce,re,cls,trig,fd,pd\}italic_i ∈ { italic_c italic_e , italic_r italic_e , italic_c italic_l italic_s , italic_t italic_r italic_i italic_g , italic_f italic_d , italic_p italic_d }.

4.   Experiments

4.1.   Continual Few-shot Event Detection Benchmarks

We construct our benchmarks based on two publicly available datasets:

MAVEN (Wang et al., 2020): The original MAVEN dataset contains 168 event types, which is a massive general domain event detection dataset. Regarding the training/validation/testing split, similar to Yu et al. (2021), the test set is built upon the initial development set. We randomly select samples in the original training set to collect another development set. For incremental task split, we select the most frequent types to construct CFED tasks. Accordingly, we randomly sample 100 instances for each type in the base task, and 5 or 10 instances for each type in the incremental task.

ACE 2005 (Walker et al., 2006): The ACE 2005 dataset consists of 33 event types. The training/validation/testing split is formed by previously mentioned works (Yang and Mitchell, 2016; Nguyen et al., 2016). We execute the identical operation on the incremental task split as we do on the MAVEN dataset to construct CFED tasks.

Our experiments contain 5 sub-tasks. We define the task containing m𝑚mitalic_m event types for each sub-task and k𝑘kitalic_k training samples for each type as m𝑚mitalic_m-way𝑤𝑎𝑦wayitalic_w italic_a italic_y k𝑘kitalic_k-shot𝑠𝑜𝑡shotitalic_s italic_h italic_o italic_t CFED task. We select 10 and 20 most frequent types to conduct 2-way 5-shot, 2-way 10-shot, 4-way 5-shot and 4-way 10-shot tasks. We randomly sample 100 instances for each type in Tbasesubscript𝑇𝑏𝑎𝑠𝑒T_{base}italic_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT, 5 and 10 instances for each type in Tincsubscript𝑇𝑖𝑛𝑐T_{inc}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT.

4.2.   Evaluation Metrics

Following Cao et al. (2020), we use micro F1 score to evaluate the performance under each stage. For stage Citestsuperscriptsubscript𝐶𝑖𝑡𝑒𝑠𝑡C_{i}^{test}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT we calculate F1i𝐹subscript1𝑖{F1}_{i}italic_F 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on all observed event types, as is defined in section 2. Micro F1 score enables a comprehensive evaluation of the prediction results for all categories. We define F1¯micro=i=1nF1isubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜superscriptsubscript𝑖1𝑛𝐹subscript1𝑖\bar{F1}_{micro}=\sum_{i=1}^{n}F1_{i}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_F 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the metric for overall performance on CFED.

4.3.   Baseline Systems

Fine-tune. We fine-tune BERT continually on every sub-task. Typically, this option is the lower boundary in Continual Learning.

Combined Retrain. We retrain the model by combining all training samples of currently known types every time a new task arrives. It is usually regarded as the upperbound.

EWC (Kirkpatrick et al., 2017), which is an regularization-based method. It applies a regularization term to restrict updates for parameters that are important for previous task.

LwF (Li and Hoiem, 2017), which contains a distillation module to match the probability of previous models to maintain previous knowledge.

ICaRL (Rebuffi et al., 2017), which is a memory-based method. Besides, they utilize a representation learning method.

KCN (Cao et al., 2020), which is a popular continual event detection method following the memory replay-knowledge distillation paradigm.

KT (Yu et al., 2021). It generally follows the memory-based paradigm with a novel initialization method to transfer knowledge.

EMP (Liu et al., 2022). Besides memory replay, it introduces prompt learning of each event type to load previous types’ knowledge.

4.4.   Implementation Details

All baselines are implemented in the same settings as follows. BERT model is the open-sourced 110M bert-base-uncased from HuggingFace111https://huggingface.co/bert-base-uncased. The number of training iterations is 30, the batch size is 4, AdamW(Loshchilov and Hutter, 2019) is used as the optimizer, the learning rate is set to 2e-5, and the weight decay is set to 1e-4. The memory capacity is 1 for each type.All computations are performed on the NVIDIA GeForce RTX 3090 (24GB) platform with 5 different random seeds. More detailed implementations can be seen in the open-sourced code repository.

Method 4-way 5-shot 4-way 10-shot
1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT 1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT
Fine-tune 40.43±2.34 33.17±3.55 17.5±2.07 19.72±0.92 21.01±0.87 26.36±1.3 40.43±2.34 38.18±2.83 20.46±1.11 20.35±2.19 23.57±1.01 28.6±0.92
Retrain 40.43±2.34 42.1±1.13 39.61±1.12 43.03±1.56 47.43±0.67 42.52±0.7 40.43±2.34 44.27±1.36 44.76±1.37 48.28±1.43 53.66±0.97 46.28±0.95
EWC 40.43±2.34 34.29±1.41 17.4±1.5 18.61±2.52 20.43±1.67 26.23±1.39 40.43±2.34 36.42±3.34 19.69±0.93 20.02±1.14 23.72±1.19 28.06±1.01
LwF 40.43±2.34 37.27±4.9 26.69±4.07 24.7±1.47 30.54±1.43 31.93±2.05 40.43±2.34 41.09±2.8 31.89±0.57 30.57±1.09 34.43±2.08 35.68±0.69
ICaRL 35.82±4.76 37.16±4.85 33.74±2.85 35.54±2.37 35.98±2.48 35.65±2.93 35.82±4.76 42.43±4.48 37.45±1.58 40.11±0.9 41.04±1.17 39.37±2.05
KCN 40.43±2.35 48.38±1.66 41.99±2.01 41.32±1.53 40.29±1.51 42.48±1.49 40.43±2.35 51.15±1.19 45.22±1.22 44.31±0.69 44.47±1.51 45.12±1.09
KT 41.04±1.59 40.19±2.17 35.21±1.34 32.69±0.78 33.77±0.58 36.58±1.06 41.04±1.59 44.39±0.91 40±1.3 39.42±0.33 37.87±0.95 40.54±0.58
EMP 40.17±1.34 30.95±0.75 31.21±1.32 22.9±2.09 22.25±1.43 29.5±0.76 40.17±1.34 32.33±0.69 32.95±1.11 26.68±1.5 28.16±1.89 32.06±0.8
HANet(Ours) 41.91±3.76 51.39±1.55 43.21±3.19 43.53±4.21 43.89±5.65 44.79±2.33 41.91±3.76 53.17±1.27 46.71±2.51 46.36±3.64 48.12±5.49 47.25±2.23
Table 1: F1micro𝐹subscript1𝑚𝑖𝑐𝑟𝑜{F1}_{micro}italic_F 1 start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT of every sub-task and F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT across all sub-tasks on 4-way MAVEN benchmark.
Method 2-way 5-shot 2-way 10-shot
1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT 1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT
Fine-tune 60.86±2.96 52.09±9.59 46.37±10 26.64±6.98 23.15±4.66 41.82±3.56 60.86±2.96 48.17±9.8 49.55±2.91 23.29±8.2 24.66±3.23 41.31±3.31
Retrain 60.86±2.96 62.45±4.27 52.21±7.83 52.2±4.68 58.36±6.09 57.22±4.48 60.86±2.96 63.39±2.87 63.75±2.67 61.23±2.08 64.25±3.13 62.7±1.3
EWC 60.86±2.96 49.3±8.93 45.41±10.43 27.14±11.24 22.36±3.9 41.02±4.85 60.86±2.96 47.58±10.11 51.15±3.05 23.82±7.67 21.79±3.1 41.04±2.78
LwF 60.86±2.96 47.31±10.4 38.91±12.89 23.31±13.46 28.4±2.83 39.76±6.85 60.86±2.96 46.98±8.32 50.77±3.35 33.48±2.7 29.69±2.91 44.36±2.2
ICaRL 50.85±6.51 52.21±2.72 37.39±6.78 31.33±6.31 28.85±5.04 40.13±4.1 50.85±6.51 52.06±2.66 42.45±6.48 32.89±4.96 34.7±3.93 42.59±2.8
KCN 60.86±2.96 56.38±5.03 47.56±10.41 38.62±9.47 37.05±7.11 48.09±6.41 60.86±2.96 59.41±6.74 57.39±6.19 46.48±6.1 44.3±5.43 53.69±4.42
KT 53.16±2.25 42.55±2.33 33.93±2.97 38.48±8.66 31.27±9.34 39.88±3.84 53.16±2.25 59.12±1.78 50.02±5.13 49.02±5.34 28.54±2.95 47.97±2.67
EMP 54.78±1.49 40.49±1.9 24.32±3.37 27.15±8.46 22.53±6.02 33.85±2.96 54.78±1.49 37.28±7.37 19.6±4.96 34.69±4.76 24.19±6.62 34.11±3.48
HANet(Ours) 61.16±2.29 63.07±3.09 57.5±5.98 53.21±4.64 54.31±3.21 57.85±2.91 61.16±2.29 66.84±2.88 64.68±3.77 58.02±6.58 54.37±5.94 61.02±3.46
Table 2: F1micro𝐹subscript1𝑚𝑖𝑐𝑟𝑜{F1}_{micro}italic_F 1 start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT of every sub-task and F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT across all sub-tasks on 2-way ACE benchmark.

4.5.   Main Results

We conduct each experiment 5 times and report the 𝐦𝐞𝐚𝐧𝐬±𝐬𝐭𝐝.plus-or-minus𝐦𝐞𝐚𝐧𝐬𝐬𝐭𝐝\mathbf{means\pm std.}bold_means ± bold_std . on MAVEN and ACE benchmarks in comparison with previously mentioned baselines. We report results in Table 1, and Table 2 and Figure 3. From the results, we can observe that:

(1) Compared with previous baselines, our approach significantly outperforms them across all sub-tasks. On 4-way 5-shot MAVEN and 2-way 5-shot ACE, our model obtains improvements of 7.27%percent7.277.27\%7.27 % and 8.44%percent8.448.44\%8.44 % on F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT when compared with previous state-of-the-art methods. Our approach even exceeds the strong retrain baseline with improvements of 5.94%percent5.945.94\%5.94 % and 5.56%percent5.565.56\%5.56 % on F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT, which strongly proves the effectiveness of our approach.

Refer to caption
Figure 3: F1micro𝐹subscript1𝑚𝑖𝑐𝑟𝑜{F1}_{micro}italic_F 1 start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT performance of every sub-task on 2-way MAVEN and 4-way ACE.

(2) KCN and KT achieve relatively good performance. As we limit the memory capacity to only one sample for each type to replay, they can learn little knowledge from memory replay, which strongly demonstrates the importance of characterizing prototypical feature space.

Method 2-way 5-shot 2-way 10-shot
1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT 1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT
HANet(Ours) 67.16 56.01 54.80 54.89 55.22 57.62 67.16 54.22 58.31 56.90 58.09 58.94
w/o Replay* 67.16 51.02 44.15 38.76 36.78 47.57 67.16 48.13 48.14 41.07 40.01 48.90
w/o Distill 67.16 46.83 42.77 37.17 42.90 47.37 67.16 45.45 44.07 44.90 47.77 49.87
w/o PA 67.16 54.28 53.01 50.98 52.21 55.53 67.16 52.94 57.47 53.91 55.38 57.37
w/o CA 59.67 54.45 49.14 50.08 49.57 52.58 59.67 53.31 53.75 53.16 53.46 54.67
w/o PA and CA 59.67 51.43 43.32 44.32 42.04 48.16 59.67 45.03 45.90 47.14 48.35 49.22
Table 3: We perform ablation studies, comparing F1micro𝐹subscript1𝑚𝑖𝑐𝑟𝑜{F1}_{micro}italic_F 1 start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT by removing each component at a time.
Benchmark Method 2-way 1-shot 2-way 2-shot
1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT 1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT
MAVEN HANet(Ours) 67.16 45.54 38.28 42.39 40.40 46.75 67.16 55.87 50.35 51.63 51.39 55.28
gpt-3.5-turbo 54.22 55.25 41.60 37.88 33.31 44.45 57.00 58.51 43.64 40.39 36.56 47.22
ACE HANet(Ours) 60.99 51.93 41.67 41.54 35.84 46.40 60.99 58.38 39.48 41.76 44.60 49.04
gpt-3.5-turbo 42.20 50.29 40.51 43.46 35.21 42.33 56.36 49.72 45.16 44.44 42.96 47.73
Table 4: Comparison with gpt-3.5-turbo on MAVEN and ACE benchmark.
Way-num Method MAVEN ACE
5-shot 10-shot 5-shot 10-shot
2way w/o CA 52.58 54.67 48.27 60.45
Dropout 54.68 56.32 53.06 61.87
Shuffle 57.62 58.94 55.10 63.98
RTR 54.60 56.57 55.53 63.27
Retrain 51.78 54.93 49.54 60.69
4way w/o CA 45.68 48.95 64.66 68.70
Dropout 44.36 47.45 67.41 68.58
Shuffle 48.47 49.91 70.31 69.90
RTR 46.18 47.96 67.93 68.11
Retrain 42.53 46.59 65.21 68.65
Table 5: F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜{\bar{F1}_{micro}}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT of different augmentation methods on MAVEN and ACE benchmarks. We also list the “w/o CA” and Retrain method for comparison.

(3) When compared with methods optimized for continual event detection, traditional methods: EWC, LwF, and ICaRL perform poorly. The giant gap between the lower bound and HANet illustrates that CFED is a challenging task.

4.6.   Ablation Study

We conduct ablation study to validate the effectiveness of each component. We choose 2-way MAVEN for the ablation study in Table 3. The “Replay*” denotes removing memory replay. As prototypical augmentation is based on memory set, resubscript𝑟𝑒\mathcal{L}_{re}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT is also set to 00 in “Replay*”. The distillation losses fdsubscript𝑓𝑑\mathcal{L}_{fd}caligraphic_L start_POSTSUBSCRIPT italic_f italic_d end_POSTSUBSCRIPT and pdsubscript𝑝𝑑\mathcal{L}_{pd}caligraphic_L start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT are removed in “w/o Distill”. resubscript𝑟𝑒\mathcal{L}_{re}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT and clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and trigsubscript𝑡𝑟𝑖𝑔\mathcal{L}_{trig}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i italic_g end_POSTSUBSCRIPT are removed in settings “w/o PA” and “w/o CA”, respectively. Here are the conclusions:

(1) Effectiveness of Prototypical Augmentation. Compared with removing prototypical augmentation (PA), PA boosts the performance by an average of 2.09%percent2.092.09\%2.09 % and 1.57%percent1.571.57\%1.57 %. Meanwhile, with the task proceeding, the model can gain more improvements, demonstrating that PA plays an increasingly vital effect in alleviating catastrophic forgetting. We also plot t-SNE visualization in Figure 4 to show how PA contributes to memorizing previous event types.

Refer to caption
Figure 4: Embedding space visualization via t-SNE on original and prototypical augmented feature in task T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Points within the same color indicate identical event types. As we can see, after prototypical augmentation, the intra-class distances become closer for each type. Besides, some hard samples (pointed in the squared region) initially proximate to the centers of other classes in the original space become easier to classify after prototypical augmentation, showcasing the effectiveness of prototypical augmentation.

(2) Effectiveness of Contrastive Augmentation. In comparison with removing contrastive augmentation, our approach delivers improvements of 5.04%percent5.045.04\%5.04 % and 4.27%percent4.274.27\%4.27 % on F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT, which indicates that contrastive augmentation is beneficial in mitigating overfitting in few-shot incremental tasks. Although we focus more on on Tincsubscript𝑇𝑖𝑛𝑐T_{inc}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT, the model can greatly benefit from the auxiliary objectives in Tbasesubscript𝑇𝑏𝑎𝑠𝑒T_{base}italic_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT.

(3) Effectiveness of Prototypical Augmentation and Contrastive Augmentation. When removing prototypical augmentation and contrastive augmentation, the F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT faces a sharp decline of 9.46%percent9.469.46\%9.46 % and 9.72%percent9.729.72\%9.72 %, implying the synergistic effect of the two modules to address the CFED problem.

4.7.   Effect of Augmentation Method in Contrastive Augmentation

Different augmentation methods affect contrastive augmentation. We evaluate “Dropout”, “Shuffle”, and “Random Token Replacement” (“RTR”). As mentioned in Gao et al. (2021), “Dropout” means making a forward pass with dropout modules. “Shuffle” randomly shuffle the sentence. “RTR” refers to randomly replacing non-trigger tokens with other tokens. From Table 5, we can draw the following conclusion: In most cases, “Shuffle” is the most effective method. “Dropout” performs worse than the others, however, it still outperforms “w/o CA”.

Refer to caption
Figure 5: F1micro𝐹subscript1𝑚𝑖𝑐𝑟𝑜{F1}_{micro}italic_F 1 start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT performance of each sub-task in Larger MAVEN benchmark.
Method 2-way 1-shot 2-way 2-shot
1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT 1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT
Fine-tune 59.67 26.81 28.34 22.96 18.79 31.31 59.67 56.17 41.67 33.13 22.81 42.69
Retrain 59.67 42.34 33.33 29.04 28.25 38.53 59.67 44.68 37.73 38.70 40.98 44.35
EWC 59.67 35.95 28.22 15.79 16.17 31.16 59.67 55.68 47.96 36.10 26.92 45.27
LwF 59.67 5.28 24.63 27.11 30.82 29.50 59.67 36.72 34.07 28.94 28.71 37.62
ICaRL 52.29 36.71 34.18 31.06 25.77 36.00 52.29 41.38 34.44 33.47 29.19 38.15
KCN 59.67 39.10 43.19 41.97 38.18 44.42 59.67 54.40 50.67 49.98 47.58 52.46
KT 54.32 5.94 5.78 3.70 3.61 14.67 54.32 35.22 32.71 27.47 28.23 35.59
EMP 57.21 4.95 5.53 5.42 5.29 15.68 57.21 18.28 6.84 7.06 8.43 19.56
HANet(Ours) 67.16 45.54 38.28 42.39 40.40 46.75 67.16 55.87 50.35 51.63 51.39 55.28
Table 6: 2-way Continual Fewer-shot Event Detection Task in MAVEN benchmark.
Method 2-way 1-shot 2-way 2-shot
1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT 1 2 3 4 5 F1¯microsubscript¯𝐹1𝑚𝑖𝑐𝑟𝑜\bar{F1}_{micro}over¯ start_ARG italic_F 1 end_ARG start_POSTSUBSCRIPT italic_m italic_i italic_c italic_r italic_o end_POSTSUBSCRIPT
Fine-tune 57.75 52.97 26.47 15.87 3.50 31.31 57.75 49.38 26.01 22.52 29.71 37.07
Retrain 57.75 43.16 30.16 31.69 28.36 38.22 57.75 48.91 33.86 36.97 35.01 42.50
EWC 57.75 45.09 25.37 16.18 6.51 30.18 57.75 50.60 23.87 13.90 25.46 34.32
LwF 57.75 37.50 18.31 7.97 6.37 25.58 57.75 44.00 16.72 16.29 29.45 32.84
ICaRL 54.68 45.96 27.08 25.29 22.34 35.07 54.68 43.81 32.89 33.12 28.49 38.60
KCN 57.75 54.13 40.71 43.97 26.52 44.61 57.75 51.37 36.83 34.66 40.40 44.20
KT 51.90 1.47 1.36 1.14 1.51 11.48 51.90 40.19 24.03 24.20 20.81 32.22
EMP 56.10 1.77 3.59 3.59 3.70 13.75 56.10 34.11 16.57 3.62 15.09 25.10
HANet(Ours) 60.99 51.93 41.67 41.54 35.84 46.40 60.99 58.38 39.48 41.76 44.60 49.04
Table 7: 2-way Continual Fewer-shot Event Detection Task in ACE benchmark.

4.8.   Evaluation in Extreme Scenarios

To validate the effeciveness of our method in various CFED applications, we conduct experiments to investigate on extreme conditions with more incremental tasks and fewer shot numbers. . Larger CFED Task. We exploit MAVEN benchmark to select 100 most frequent types to conduct 10-task 10-way task. From the results in Figure 5, we conclude that existing methods can not generalize well to larger CFED, meanwhile, HANet still maintains the best performance, showcasing strong continual learning ability in more practical situations.

Continual Fewer-shot Event Detection Task. To explore the minimum samples from which models can learn to maintain good performance, we perform 2-way 1-shot and 2-way 2-shot experimental settings. According to Table 6 and 7, our method outperforms other baselines, proving the ability to better utilize few-shot samples in severe conditions when dealing with CFED tasks.

4.9.   Capability of LLM in Solving Continual Few-shot Event Detection

Recently, there have been growing discussions (Chen et al., 2023; Wang et al., 2023) about the capabilities of Large Language Models (LLMs) on IE tasks. Though these LLMs demonstrate promising abilities to learn from few-shot samples, their performance on continual few-shot event detection is to be discussed. In this section, we aim to evaluate the capability of ChatGPT in CFED settings. We conduct comparisons with gpt-3.5-turbo222https://api.openai.com/v1/chat/completions.

Following Event Extraction Trigger instructions by Wang et al. (2023) to perform in-context learning in gpt-3.5-turbo (Ouyang et al., 2022), we use few-shot samples as instructions selected from the training set. The original training set in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT contains 100 samples, we randomly select 1 or 2 samples every time a new test sample arrives. Specifically, at stage Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we conduct evaluations in Cttestsuperscriptsubscript𝐶𝑡𝑡𝑒𝑠𝑡C_{t}^{test}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT by providing few-shot samples of each type in Cttrainsuperscriptsubscript𝐶𝑡𝑡𝑟𝑎𝑖𝑛C_{t}^{train}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT. Detailed instructions and cases of gpt-3.5-turbo are shown in Appendix A.

From the results illustrated in Table 4, we can observe that, compared with gpt-3.5-turbo failed to perform well on continual few-shot event detection tasks. Our method outperforms gpt-3.5-turbo significantly.

5.   Related Work

5.1.   Traditional Event Detection

Impressive progress has been made in research related to traditional event detection by neural network-based methods (Chen et al., 2015; Nguyen and Grishman, 2015; Liu et al., 2017; Chen et al., 2018; Lu et al., 2019). These approaches greatly improved the performance on the ideal ED task. Nevertheless, they face considerable catastrophic forgetting and few-shot overfitting when handling continual event types with few samples, which seriously restricts their real-world applications.

5.2.   Continual Event Detection

The major challenge of Continual ED is to learn emerging tasks while avoiding forgetting previous tasks (McCloskey and Cohen, 1989; Ring, 1994; Thrun and Mitchell, 1995; Thrun, 1998). Cao et al. (2020) construct a replay-distillation method to preserve knowledge from memory set and previous models. Besides replay and distillation, Yu et al. (2021) utilize an initialization method to transfer knowledge. Liu et al. (2022) adopt prompt learning for preserving previous knowledge. Although these works perform well on Continual ED, their abilities are limited with few-shot samples.

5.3.   Few-shot Event Detection

Few-shot event detection aims to learn great representations with insufficient samples. Lai et al. (2020) propose two matching losses to provide cluster signals for few-shot learning. Deng et al. (2020) introduce a prototypical network with dynamic memory. Zhang et al. (2022a) design a hybrid contrastive learning approach. Zhao et al. (2022) align event types to FrameNet to obtain more instances for prototype calculation. Since these methods only concentrate on few-shot tasks with fixed types, they dismiss the continual situation.

6.   Conclusions

In this paper, we focus on a more realistic yet challenging scenario of continual few-shot event detection, where the system is required to detect and classify events on continually emerging new types with limited labeled data. We propose a Hierarchical Augmentation Network (HANet). To alleviate catastrophic forgetting in memorizing previous event types, we incorporate prototypical augmentation to preserve previous knowledge with limited exemplars. We also devise a contrastive augmentation module to tackle with overfitting when learning new event types. This module leverages valuable token information from limited samples in incremental tasks. We conduct a series of experiments to show that our model perform well on continual few-shot event detection tasks, achieving state-of-the-art performance compared with previous baselines and ChatGPT.

7.   Limitations

Though performing well on the CFED task, there are still some limitations to be mentioned: (1) Our method focuses on a fixed emerging number of event types and the shot number of each few-shot task is unchanging, which is still ideal in real-world scenarios. (2) Though we propose space augmentation for prototypes in memory, the approach still requires extra storage space, which limits its application in some extreme scenarios. (3) Since our method performs well for event detection, it has the potential to explore the possibility of extending our approach to other IE applications (e.g., Relation Extraction and Named Entity Recognition). We leave this as future work.

8.   Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2022ZD0160503), and the National Natural Science Foundation of China (No. 62176257). This work is also supported by the Youth Innovation Promotion Association CAS, and Yunnan Provincial Major Science and Technology Special Plan Projects (No.202202AD080004).

9.   Bibliographical References

\c@NAT@ctr

  • Ahn (2006) David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 1–8.
  • Aljundi et al. (2018) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) to forget. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III, page 144–161, Berlin, Heidelberg. Springer-Verlag.
  • Black et al. (2022) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. Gpt-neox-20b: An open-source autoregressive language model.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Cao et al. (2020) Pengfei Cao, Yubo Chen, Jun Zhao, and Taifeng Wang. 2020. Incremental event detection via knowledge consolidation networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 707–717, Online. Association for Computational Linguistics.
  • Chen et al. (2016) Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. 2016. Net2net: Accelerating learning via knowledge transfer. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Chen et al. (2023) Xuanting Chen, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, Minlong Peng, Jie Zhou, Tao Gui, Qi Zhang, and Xuan**g Huang. 2023. How robust is gpt-3.5 to predecessors? a comprehensive study on language understanding tasks. arXiv preprint arXiv:2303.00293.
  • Chen et al. (2015) Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 167–176, Bei**g, China. Association for Computational Linguistics.
  • Chen et al. (2018) Yubo Chen, Hang Yang, Kang Liu, Jun Zhao, and Yantao Jia. 2018. Collective event detection via a hierarchical and bias tagging networks with gated multi-level attention mechanisms. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1267–1276, Brussels, Belgium. Association for Computational Linguistics.
  • Deng et al. (2020) Shumin Deng, Ningyu Zhang, Jiaojian Kang, Yichi Zhang, Wei Zhang, and Huajun Chen. 2020. Meta-learning with dynamic-memory-based prototypical network for few-shot event detection. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 151–159.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Doddington et al. (2004) George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie M Strassel, and Ralph M Weischedel. 2004. The automatic content extraction (ace) program-tasks, data, and evaluation. In Lrec, volume 2, pages 837–840. Lisbon.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Jerfel et al. (2019) Ghassen Jerfel, Erin Grant, Tom Griffiths, and Katherine A Heller. 2019. Reconciling meta-learning and continual learning with online mixtures of tasks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Ke et al. (2021) Zixuan Ke, Bing Liu, Hu Xu, and Lei Shu. 2021. CLASSIC: Continual and contrastive learning of aspect sentiment classification tasks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6871–6883, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  • Lai et al. (2020) Viet Dac Lai, Thien Huu Nguyen, and Franck Dernoncourt. 2020. Extensively matching for few-shot learning event detection. In Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events, pages 38–45, Online. Association for Computational Linguistics.
  • Li and Hoiem (2017) Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
  • Liu et al. (2022) Minqian Liu, Shiyu Chang, and Lifu Huang. 2022. Incremental prompting: Episodic memory prompt for lifelong event detection. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2157–2165, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  • Liu et al. (2017) Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017. Exploiting argument information to improve event detection via supervised attention mechanisms. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1789–1798, Vancouver, Canada. Association for Computational Linguistics.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Lu et al. (2019) Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2019. Distilling discrimination and generalization knowledge for event detection via delta-representation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4366–4376, Florence, Italy. Association for Computational Linguistics.
  • McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
  • Moiseev et al. (2022) Fedor Moiseev, Zhe Dong, Enrique Alfonseca, and Martin Jaggi. 2022. SKILL: Structured knowledge infusion for large language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1581–1588, Seattle, United States. Association for Computational Linguistics.
  • Mou et al. (2022) Yutao Mou, Keqing He, Yanan Wu, Zhiyuan Zeng, Hong Xu, Huixing Jiang, Wei Wu, and Weiran Xu. 2022. Disentangled knowledge transfer for OOD intent discovery with unified contrastive learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 46–53, Dublin, Ireland. Association for Computational Linguistics.
  • Nguyen et al. (2016) Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 300–309, San Diego, California. Association for Computational Linguistics.
  • Nguyen and Grishman (2015) Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 365–371, Bei**g, China. Association for Computational Linguistics.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010.
  • Ring (1994) Mark Bishop Ring. 1994. Continual Learning in Reinforcement Environments. Ph.D. thesis, University of Texas at Austin, USA. UMI Order No. GAX95-06083.
  • Ritter et al. (2018) Hippolyt Ritter, Aleksandar Botev, and David Barber. 2018. Online structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
  • Thrun (1998) Sebastian Thrun. 1998. Lifelong learning algorithms. Learning to learn, 8:181–209.
  • Thrun and Mitchell (1995) Sebastian Thrun and Tom M Mitchell. 1995. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25–46.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al. (2023) Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, et al. 2023. Instructuie: Multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:2304.08085.
  • Yang and Mitchell (2016) Bishan Yang and Tom M. Mitchell. 2016. Joint extraction of events and entities within a document context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 289–299, San Diego, California. Association for Computational Linguistics.
  • Yu et al. (2021) Pengfei Yu, Heng Ji, and Prem Natarajan. 2021. Lifelong event detection with knowledge transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5278–5290, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Zhang et al. (2022a) Ruihan Zhang, Wei Wei, Xian-Ling Mao, Rui Fang, and Dangyang Chen. 2022a. HCL-TAT: A hybrid contrastive learning method for few-shot event detection with task-adaptive threshold. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1808–1819, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Zhang et al. (2022b) Yuwei Zhang, Haode Zhang, Li-Ming Zhan, Xiao-Ming Wu, and Albert Lam. 2022b. New intent discovery with pre-training and contrastive learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 256–269, Dublin, Ireland. Association for Computational Linguistics.
  • Zhao et al. (2022) Kailin Zhao, Xiaolong **, Long Bai, Jiafeng Guo, and Xueqi Cheng. 2022. Knowledge-enhanced self-supervised prototypical network for few-shot event detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6266–6275, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

10.   Language Resource References

\c@NAT@ctr

 

  • Walker et al. (2006) Walker, Christopher and Strassel, Stephanie and Medero, Julie and Maeda, Kazuaki. 2006. ACE 2005 Multilingual Training Corpus . Linguistic Data Consortium (LDC), ISLRN 458-031-085-383-4. PID https://catalog.ldc.upenn.edu/LDC2006T06.
  • Wang et al. (2020) Wang, Xiaozhi and Wang, Ziqi and Han, Xu and Jiang, Wangyi and Han, Rong and Liu, Zhiyuan and Li, Juanzi and Li, Peng and Lin, Yankai and Zhou, Jie. 2020. MAVEN: A Massive General Domain Event Detection Dataset. Association for Computational Linguistics. PID https://aclanthology.org/2020.emnlp-main.129.

Appendix A. Instructions for large language models

In this section, we show gpt-3.5-turbo’s instructions and cases for the continual few-shot event detection task in Figure 6 and Figure 7. When learning new event types, we simply append new options and examples for these types as in-context learning prompts.

Refer to caption
Figure 6: Instructions of gpt-3.5-turbo for CFED
Refer to caption
Figure 7: Cases of gpt-3.5-turbo for CFED