MetaIE: Distilling a Meta Model from LLM for All Kinds of Information Extraction Tasks

Letian Peng, Zilong Wang, Feng Yao, Zihan Wang, **gbo Shang
University of California, San Diego
{lepeng, zlwang, fengyao, ziw224, jshang}@ucsd.edu
   Corresponding authors.
Abstract

Information extraction (IE) is a fundamental area in natural language processing where prompting large language models (LLMs), even with in-context examples, cannot defeat small LMs tuned on very small IE datasets. We observe that IE tasks, such as named entity recognition and relation extraction, all focus on extracting important information, which can be formalized as a label-to-span matching. In this paper, we propose a novel framework MetaIE to build a small LM as meta-model by learning to extract “important information”, i.e., the meta-understanding of IE, so that this meta-model can be adapted to all kind of IE tasks effectively and efficiently. Specifically, MetaIE obtains the small LM via a symbolic distillation from an LLM following the label-to-span scheme. We construct the distillation dataset via sampling sentences from language model pre-training datasets (e.g., OpenWebText in our implementation) and prompting an LLM to identify the typed spans of “important information”. We evaluate the meta-model under the few-shot adaptation setting. Extensive results on 13 datasets from 6 IE tasks confirm that MetaIE can offer a better starting point for few-shot tuning on IE datasets and outperform other meta-models from (1) vanilla language model pre-training, (2) multi-IE-task pre-training with human annotations, and (3) single-IE-task symbolic distillation from LLM. Moreover, we provide comprehensive analyses of MetaIE, such as the size of the distillation dataset, the meta-model architecture, and the size of the meta-model.111Code, datasets, and model checkpoints: https://github.com/KomeijiForce/MetaIE.

1 Introduction

Large language models (LLMs), such as ChatGPT (OpenAI, 2023), benefit from vast amount of training data and have demonstrated exceptional performance across various areas through in-context learning (ICL) (Dong et al., 2023). However, when it comes to information extraction (IE), LLMs, even with ICL examples, struggle to compete with smaller LMs (e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019)) fine-tuned on very small training sets (Peng et al., 2023; Wadhwa et al., 2023; Gao et al., 2024). This is usually regarded as a limitation of LLMs in following a specific extraction scheme (Xu et al., 2023). Meanwhile, it is worth mentioning that conducting auto-regressive inference with LLMs is expensive and time-consuming, hindering their application in conducting IE over large corpora.

We observe that IE tasks, such as named entity recognition (NER) and relation extraction (RE), all focus on extracting important information, which can be formalized as label-to-span instructions. Specifically, all IE tasks can be decomposed as several instructions such as “given an IE label (l𝑙litalic_l), extract a span from the input text” (Figure 1), where l𝑙litalic_l can be (1) Person, Location, Organization in NER to recognize entities or (2) Tom births at in RE to verify if there is a certain relation between two entities by checking the other entity can be recognized or not. Following these label-to-span instructions, LLMs can handle all kinds of IE tasks and return imperfect yet semantically reasonable answers. To this end, we argue that LLMs can be distilled into meta-models for IE which can quickly fine-tuned on few-shot training sets for better task-specific performance.

Refer to caption
Figure 1: An overview of different transfer learning schemes involved in the experiments.

In this paper, we propose a novel framework MetaIE to build a small LM as a meta-model by learning to extract “important information”, i.e., the meta-understanding of IE, and we show that this meta-model can be adapted to all kind of IE tasks effectively and efficiently. Some prior work have built meta-models for a specific IE tasks, e.g., UniversalNER (Zhou et al., 2023) explores the potential of building a meta-model for NER tasks. Our work is more ambitious at a larger scope for all IE tasks.

MetaIE obtains the small LM via a symbolic distillation (West et al., 2022) from an LLM following the label-to-span scheme. We construct the distillation dataset via sampling sentences from language model pre-training datasets and prompting an LLM to identify the typed spans of “important information”. In particular, we implement this idea with 100,000100000100,000100 , 000 sentences from the OpenWebText corpus (Gokaslan & Cohen, 2019), which contains various webpage texts and is also a subset of the popular language model pre-training dataset. We feed these sentences to GPT-3.5-turbo for identifying “important information”, which is then used to distill small LMs. It is worth mentioning that MetaIE is applicable to all types of small LMs and one only needs to convert the label-span pairs following the corresponding labeling scheme (e.g., BIO sequence labeling for encoders like RoBERTa, seq2seq labeling for encoder-decoders like BART).

Our evaluation focuses on the few-shot learning ability of the meta-model for different IE tasks. We mainly compare MetaIE with meta-models from (1) vanilla language model pre-training, (2) multi-IE-task pre-training with human annotations, and (3) single-IE-task symbolic distillation from LLM. Large-scale datasets for NER, RE, and event extraction (EE) tasks are used in single-IE-task and multi-IE-task pre-training, therefore, these datasets shall be considered as in-task-distributional for these two methods. For a more comprehensive evaluation, we further include out-of-task-distributional datasets from (1) semantic role labeling (SRL) (Carreras & Màrquez, 2005), (2) aspect-based sentiment analysis (ABSA) (Pontiki et al., 2014), and (3) aspect-sentiment triplet extraction (ASTE) (Xu et al., 2020), totaling 13 datasets across 6 IE tasks. In our experiments, MetaIE generally achieves the best performance, only very occasionally losing to task-specific distillation on some in-task-distributional datasets. This demonstrates that MetaIE is a strong and efficient method to distill the meta-understanding of IE from LLMs into small LMs. Remarkably, distilling from the LLM-produced dataset following the traditional human annotation schemes performs poorly. Therefore, the success of MetaIE, rather than from purely using LLMs, shall also come from our label-to-span scheme.

We have conducted comprehensive analyses of MetaIE. We study the scaling-up rules to investigate the model and dataset size boundaries in obtaining the meta-understanding of IE. We showcase the diversity of the types of important information in the MetaIE distillation dataset. We show that the RoBERTa with sequence labeling framework is the best meta-model architecture compared with sequence-to-sequence and decoder-only models, at a similar scale.

Our contributions are three-fold:

  • We are the first to build a small LM as a meta-model for all kinds of IE tasks.

  • We propose a novel label-to-span scheme that unifies all IE tasks and applies symbolic distillation to distill the meta-understanding from an LLM to a small LM.

  • We have a rigorous experiment design, which covers various IE tasks and meta-model methods. Comprehensive experiment results support the intuitive expectation and advantage of our MetaIE.

2 Related Works

2.1 Information Extraction

Information extraction (IE) is one of the most popular and vital domains in natural language processing. Early IE systems are generally developed for a single IE dataset like NER (dos Santos & Guimarães, 2015), RE (Katiyar & Cardie, 2016), or EE (Chen et al., 2015). Due to the gap between the label sets and annotation styles of different IE datasets, few-shot IE frameworks (Ding et al., 2021; Han et al., 2018; Ma et al., 2023) are proposed to quickly learn models on new datasets. The IE models are pre-trained on a large scale of IE labels and then transferred to the target domain by fine-tuning on few examples. With the emergence of LLMs, researchers have started to train LMs on multiple IE tasks with unified formats (Lu et al., 2022; Paolini et al., 2021). LLMs fine-tuned for general purpose (OpenAI, 2023; Touvron et al., 2023) have also shown strong potential to understand new IE tasks with their instruction-following ability. However, these LLMs still lag behind supervised models (Xu et al., 2023), potentially due to the difficulty of specifying the required pattern for extraction in different datasets. Moreover, the cost of LLMs limits their application to IE on a large corpus. This paper aims to transfer the meta-understanding of IE from LLMs to lighter-weight models, which produce a flexible model with high adaptability to any target IE task.

2.2 Model Distillation

Model distillation (Hinton et al., 2015; Gou et al., 2021) is the process of transferring knowledge from large models (teacher models) to small ones (student models). Traditional distillation optimizes the similarity between logits produced by the teacher and student models (Hinton et al., 2015; Kim et al., 2019; Mirzadeh et al., 2020). Symbolic distillation (West et al., 2022; Li et al., 2023; West et al., 2023) for language models learns a student model on texts generated by the teacher model. In comparison with traditional distillation, symbolic distillation allows the student model to focus on one aspect of the teacher model (West et al., 2022), which can be some high-level ability, such as chain-of-thought reasoning (Li et al., 2023), with much smaller model size. For IE, symbolic model distillation has been successfully applied for an IE subtask, NER (Zhou et al., 2023), which distills an NER model that can extract entities in a broad domain. This paper aims to distill the cross-IE task ability of LLMs, i.e., meta-understanding of IE and proposes a meta-model that can effectively learn IE tasks with few examples.

2.3 Meta Learning

Meta-learning (Finn et al., 2017b) enables the models to learn new tasks better, i.e., stronger transfer learning ability. MAML (Finn et al., 2017a) proposes a framework to learn a better starting point for few-shot learning by utilizing multiple datasets for loss updating. Reptile (Nichol et al., 2018), similar to MAML, simplifies the meta-learning algorithm by performing stochastic gradient descent not only within each task but also across tasks, making it more efficient and easier to implement. The Prototypical Networks method (Snell et al., 2017) employs a distance-based classification approach, where it learns a metric space in which classification can be performed by computing distances to prototype representations of each class. While most meta-learning methods are experimented on classification tasks, pre-training on multiple datasets (Ding et al., 2021) and prototypical networks (Ji et al., 2022) have been applied for IE. While these methods focus on specific IE tasks like NER, we aim to optimize a starting point for general IE tasks by distilling from LLMs.

3 Our MetaIE Framework

3.1 Label-to-span Scheme

We formalize the IE task as given an IE label l𝑙litalic_l (e.g., Person in NER), extracting a span s𝑠sitalic_s from a sentence X=[x1,,xn]𝑋subscript𝑥1subscript𝑥𝑛X=[x_{1},\cdots,x_{n}]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]. The span s𝑠sitalic_s can be represented as xi:jsubscript𝑥:𝑖𝑗x_{i:j}italic_x start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT including the words from i𝑖iitalic_i-th to j𝑗jitalic_j-th. Denoting the IE process as a map** fIE()subscript𝑓𝐼𝐸f_{IE}(\cdot)italic_f start_POSTSUBSCRIPT italic_I italic_E end_POSTSUBSCRIPT ( ⋅ ), it can be represented as s=fIE(X|l)𝑠subscript𝑓𝐼𝐸conditional𝑋𝑙s=f_{IE}(X|l)italic_s = italic_f start_POSTSUBSCRIPT italic_I italic_E end_POSTSUBSCRIPT ( italic_X | italic_l ). Machine learning-based methods aim to learn the map** by optimizing a model Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with parameter θ𝜃\thetaitalic_θ. For a specific IE task (e.g., NER), the IE label set (Task)superscript𝑇𝑎𝑠𝑘\mathcal{L}^{(Task)}caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT will contain l𝑙litalic_l falling inside the task label, i.e., (l(Task))𝑙superscript𝑇𝑎𝑠𝑘(l\in\mathcal{L}^{(Task)})( italic_l ∈ caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT ). Based on the general definition of IE, the general IE label set (IE)superscript𝐼𝐸\mathcal{L}^{(IE)}caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT can be any textual description, thus Task,(Task)(IE)for-allTasksuperscript𝑇𝑎𝑠𝑘superscript𝐼𝐸\forall\text{Task},\mathcal{L}^{(Task)}\subset\mathcal{L}^{(IE)}∀ Task , caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT ⊂ caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT.

In this paper, we aim to learn a meta-model that can be easily adapted to different IE tasks. In the current practice of IE, the “meta-model” is generally pre-trained in a single IE task with a large number of labels ((pt)(Task)superscript𝑝𝑡superscript𝑇𝑎𝑠𝑘\mathcal{L}^{(pt)}\subset\mathcal{L}^{(Task)}caligraphic_L start_POSTSUPERSCRIPT ( italic_p italic_t ) end_POSTSUPERSCRIPT ⊂ caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT). Then, the meta-model can be fine-tuned on few-shot examples to quickly adapt to different downstream IE datasets in the same task, such that (ft)(Task)superscript𝑓𝑡superscript𝑇𝑎𝑠𝑘\mathcal{L}^{(ft)}\subset\mathcal{L}^{(Task)}caligraphic_L start_POSTSUPERSCRIPT ( italic_f italic_t ) end_POSTSUPERSCRIPT ⊂ caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT. We expand this learning scheme to a general meta-model that works for all existing and potentially new IE tasks. To achieve this goal, our intuition is to pre-train the model to learn the label-to-span map** with the label set approximating the general IE label distribution (pt)(IE)similar-tosuperscript𝑝𝑡superscript𝐼𝐸\mathcal{L}^{(pt)}\sim\mathcal{L}^{(IE)}caligraphic_L start_POSTSUPERSCRIPT ( italic_p italic_t ) end_POSTSUPERSCRIPT ∼ caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT. As the label sets of all IE tasks are subsets of (IE)superscript𝐼𝐸\mathcal{L}^{(IE)}caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT, our meta-model will enjoy an efficient transfer to all IE tasks.

3.2 Distillation Dataset Construction

Refer to caption
Figure 2: The prompt used in our experiments to build the dataset for symbolic distillation.

To apply a symbolic distillation of the meta-understanding of IE from LLMs, we prompt LLMs to create data for distillation by querying them to extract “important information” from texts as shown in Figure 2. Our expectation for the dataset is to cover as many l𝑙litalic_l as possible to approximate the broad (IE)superscript𝐼𝐸\mathcal{L}^{(IE)}caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT set to better distill the meta-model for all kinds of IE tasks. We query LLMs to annotate some raw corpora 𝒳𝒳\mathcal{X}caligraphic_X to build the MetaIE dataset. Given each X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X, the LLM is instructed to generate a series of (l,s)𝑙𝑠(l,s)( italic_l , italic_s ) pairs. We do not set any limitation to l𝑙litalic_l to better approximate the broad (IE)superscript𝐼𝐸\mathcal{L}^{(IE)}caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT set.

Implementation

We select the paragraphs from OpenWebText (Gokaslan & Cohen, 2019), Since OpenWebText it is a popular dataset used in language model pre-training, we are not introducing new texts. We split the paragraphs by sentences and only use the first sentence of each paragraph for a higher diversity and to avoid the ambiguity caused by coreference. The LLM is instructed to formalize all (l,s)𝑙𝑠(l,s)( italic_l , italic_s ) pairs in the prompting output as “- Place (l)𝑙(l)( italic_l ): New York (s)𝑠(s)( italic_s )”, which are extracted by regular expression matching. Considering there might be multiple spans returned for l𝑙litalic_l, we split the span by conjunctions like comma.

n𝑛nitalic_n-gram (Count) Example IE Labels (Relative Frequency)
1111-gram (270k) Location (7.73%), Event (4.67%), Action (4.24%), Topic (3.57%), Subject (3.25%), Person (2.71%), Date (2.70%), Source (2.44%)
2222-gram (44.5k) Target audience (1.27%), Time period (0.998%), Individuals involved (0.992%), Action taken (0.877%), Political affiliation (0.762%), Parties involved (0.758%), Release date (0.697%), TV show (0.686%)
3333-gram (16.9k) Source of information (2.02%), Cause of death (1.17%), Call to action (1.02%), Date of birth (0.739%), Date and time (0.727%), Date of death (0.562%), Type of content (0.337%), Reason for arrest (0.325%)
4444-gram (7.39k) Purpose of the bill (0.325%), Location of the incident (0.271%), Name of the person (0.271%), Number of people killed (0.203%), Number of people affected (0.189%), Content of the bill (0.162%), Number of people arrested (0.149%), Source of the information (0.149%)
5absent5\geq 5≥ 5-gram (5.37k) Dates of birth and death (0.13%), Age at the time of death (0.112%), Total number of votes cast (0.0931%), Feature: Auschwitz through the Lens of the SS (0.0931%), Number of people on board (0.0745%), Name of the person involved (0.0745%), Date and time of publication (0.0745%), Action taken by President Obama (0.0745%)
Table 1: Example IE Labels, Counts, and Relative Frequency in our constructed symbolic distillation dataset, grouped by the number of tokens.

Table 1 shows some statistics and example results of the labels returned by the LLM, illustrating a broad spectrum of IE domains, ranging from simple entities and events to complex relationships and contexts. The diversity in the n𝑛nitalic_n-gram categories showcases the model’s ability to capture a wide array of query types. This variety underscores the comprehensive coverage and nuanced understanding that LLMs bring to the task of generating queries across different facets of the IE domain.

3.3 Distillation Framework

We illustrate the distillation with a sequence labeling model (dos Santos & Guimarães, 2015) that suits well for encoder-based language models (e.g., RoBERTa (Liu et al., 2019)). Given a sequence of words X=[x1,,xn]𝑋subscript𝑥1subscript𝑥𝑛X=[x_{1},\cdots,x_{n}]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], the sequence labeling model will tag each word by outputting Y=[y1,,yn]𝑌subscript𝑦1subscript𝑦𝑛Y=[y_{1},\cdots,y_{n}]italic_Y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]. Following the traditional BIO labeling scheme, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be B𝐵Bitalic_B (begin), I𝐼Iitalic_I (inner), and O𝑂Oitalic_O (none). The model is trained on word tagging and the tags are decoded into spans by searching sequences that begin with B𝐵Bitalic_B and continue by I𝐼Iitalic_I. In traditional sequence labeling models, the B𝐵Bitalic_B and I𝐼Iitalic_I tags generally consist of label information such as B𝐵Bitalic_B-place or I𝐼Iitalic_I-person. In our case, we formalize the tagging in a query-dependent way since the model needs to handle arbitrary queries. We attach the label information as a prefix like “place: ” to the beginning of the input text. The input text is then labeled by the BIO scheme, where the span label is indicated in the prefix. Finally, the BIO sequences are used to fine-tune the sequence labeling models. This distillation process can also be adapted to Seq2Seq encoder-decoder models and Causal LM-based decoder-only models. We use sequence labeling models for the main experiment based on their empirical advantage in IE tasks, which we also empirically find support in the analysis in Section 5.2.

4 Experiments

4.1 IE Tasks and Datasets

To deeply delve into the differences between different model distillation or meta-learning methods, we include a wide variety of tasks:

  1. 1.

    Named Entity Recognition (NER) extracts named entities with their labels from texts. We include 6666 NER datasets that was studied in Ushio & Camacho-Collados (2021), i.e., (1) CoNLL2003, (2) BioNLP2004, (3) WNUT2017, (4) MIT-Movie, (5) MIT-Restaurant, (6) BC5CDR, which covers various domains: news, medical, social media, and reviews.

  2. 2.

    Relation Extraction (RE) extracts named entities, and in addition, identifies the relationships between them. We include 2222 popular datasets, (1) ADE (Gurulingappa et al., 2012) and (2) CoNLL2004 (Carreras & Màrquez, 2004) representing RE on medical and news domain. We evaluate the performance of RE models on both relation detection and the detection of entities involved in the relations.

  3. 3.

    Event Extraction (EE) extracts event triggers and their arguments. We use the standard ACE2005 dataset (Walker et al., 2006) for EE evaluation. We compare the model performance on both event trigger detection (T) evaluation task and trigger-augment pair detection (A) evaluation task.

  4. 4.

    Semantic Role Labeling (SRL) extracts predicates (verbs) and their arguments. We select the CoNLL2005 (Carreras & Màrquez, 2005) dataset for SRL. We follow previous works to learn backbone LMs on samples from the Brown training dataset and then test them on Brown and WSJ test datasets.

  5. 5.

    Aspect-based Sentiment Analysis (ABSA) extracts aspect terms and the sentiment polarity towards them. We select SemEval2014 (Pontiki et al., 2014) as the dataset for ABSA, with its two subsets: 14res and 14lap including reviews about restaurants and laptops.

  6. 6.

    Aspect Sentiment Triplet Extraction (ASTE) extracts aspect terms and the corresponding opinion terms that contain the sentiment polarity towards them. We use the same SemEval2014 dataset as for ABSA, on which aspect-sentiment triplets are further annotated by Xu et al. (2020).

For a fair comparison, we formalize all those tasks as s=fIE(X|l)𝑠subscript𝑓𝐼𝐸conditional𝑋𝑙s=f_{IE}(X|l)italic_s = italic_f start_POSTSUBSCRIPT italic_I italic_E end_POSTSUBSCRIPT ( italic_X | italic_l ), which can be found in the Appendix A. For each task, we query each possible label to extract (l,s)𝑙𝑠(l,s)( italic_l , italic_s ) pairs. For spans conflicting with each other, as we run label-wise extractions, we only keep the one with a higher BI sequence probability. For tasks that extractions are dependent on each other (e.g., RE, EE, SRL, ASTE), we follow Paolini et al. (2021) to run multi-stage extractions for these tasks. As ACE2005 involves too many labels, we report the unlabeled performance on detecting the triggers and arguments for all methods for comparison.

4.2 Evaluation Metric: Few-shot Fine-tuning Performance

We use the few-shot fine-tuning performance on all IE tasks to evaluate the meta-model’s quality. Specifically, all methods in our evaluation will provide us a backbone LM. We then conduct few-shot fine-tuning from the training dataset for fine-tuning with sample details in Appendix B. Finally, we evaluate them on the test dataset using the micro F1 score as the evaluation metric. For multi-task pre-training baselines, tasks without large-scale annotations (SRL, ABSA, ASTE) are out-of-distribution tasks.

The default backbone LM we used for fine-tuning is RoBERTa-Large (Liu et al., 2019), which is a traditional bidirectional encoder used for learning IE tasks formalized as sequence tagging. The learning rate is set to 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT with AdamW (Loshchilov & Hutter, 2019) as the optimizer and a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017). We fine-tune the backbone LM with batch size 64646464 for a single epoch to avoid overfitting.

4.3 Compared Methods

We first include a comparison with the teacher model GPT-3.5-turbo via LLM Prompting with in-context learning (ICL). For ICL, we provide 5555 examples in the prompt of our query. Based on previous discoveries on LLM-based IE (Peng et al., 2023; Wadhwa et al., 2023; Gao et al., 2024), we shall expect that fine-tuned small LMs work better than the LLM.

We compare our MetaIE with a variety of methods from the following three categories

  1. 1.

    Vanilla LM fine-tuning (FT), i.e., directly using the vanilla pre-trained LM as the backbone LM in fine-tuning.

  2. 2.

    Task-level Meta-learning (ML)+FT. It is expected to have a strong performance to other datasets in the same IE task but poor generalization to other IE tasks.

    • Transfer (Human) is a baseline that trains the backbone LM on large-scale human annotations of a specific IE task. Specifically, we use FewNerd (Ding et al., 2021) for NER, FewRels (Han et al., 2018) for RE, and FewEvents (Ma et al., 2023) for EE.

    • Transfer (LLM) uses the same datasets in Transfer (Human) but queries the LLM to annotate them following the human workflow. This baseline aims to compare the quality of annotation from humans and LLMs following the conventional annotation schema.

    • Task Distillation distills from LLMs by querying answers for specific IE tasks. We implement this by providing in-context task-specific examples to control the LLM-produced data similar to the label IE task. The input texts are set to be the same as MetaIE to avoid bias.

    • NER Distillation applies the model distilled following Task Distillation but tests them on non-NER tasks to evaluate its cross-task transferability.

  3. 3.

    IE-level Meta-learning (ML)+FT aims to learn an IE model with strong transferability to all IE tasks. Our MetaIE also falls into this category.

    • MultiIE merges the multiple human-annotated IE datasets (FewNerd, FewRels, FewEvents) to train a backbone LM, which represents a multi-task baseline with human annotations.

    • MAML (Finn et al., 2017a) is a traditional meta-learning baseline that merges gradients on different datasets to build a model that can be quickly transferred to these datasets. We use the datasets in MultiIE for MAML in the experiment.

For all baselines, the data number for meta-learning is controlled to the same as MetaIE by sampling towards a fair comparison.

4.4 Result

Category Method NER
ConLL2003200320032003 BioNLP2004200420042004 WNUT2017201720172017 MIT-Movie MIT-Restaurant BC5CDR
LLM Prompting ICL 59.6859.6859.6859.68 48.0848.0848.0848.08 36.5136.5136.5136.51 46.0846.0846.0846.08 60.6260.6260.6260.62 59.8259.8259.8259.82
FT Vanilla 32.5832.5832.5832.58 36.0636.0636.0636.06 33.8733.8733.8733.87 57.6557.6557.6557.65 63.4063.4063.4063.40 18.1518.1518.1518.15
Task-level ML+FT Transfer
    Human 71.6171.6171.6171.61 54.5854.5854.5854.58 43.1543.1543.1543.15 64.80 69.1769.1769.1769.17 72.0272.0272.0272.02
    LLM 67.7467.7467.7467.74 45.6245.6245.6245.62 45.3645.3645.3645.36 59.5959.5959.5959.59 69.1969.1969.1969.19 73.1473.1473.1473.14
Task Distillation 74.86 56.18 50.09 65.70 71.48 71.0171.0171.0171.01
IE-level ML+FT MultiIE 63.9463.9463.9463.94 52.4752.4752.4752.47 44.2944.2944.2944.29 58.4358.4358.4358.43 69.3869.3869.3869.38 71.2071.2071.2071.20
MAML 66.9766.9766.9766.97 53.0953.0953.0953.09 46.1446.1446.1446.14 60.5760.5760.5760.57 68.8668.8668.8668.86 72.5872.5872.5872.58
MetaIE 71.4971.4971.4971.49 55.76 44.3344.3344.3344.33 65.64 71.33 75.21
Category Method RE (NER) RE EE
ADE CoNLL2004200420042004 ADE CoNLL2004200420042004 ACE2005200520052005 (T) ACE2005200520052005 (A)
LLM Prompting ICL 63.5563.5563.5563.55 58.4758.4758.4758.47 39.0239.0239.0239.02 31.3431.3431.3431.34 60.4760.4760.4760.47 28.7928.7928.7928.79
FT Vanilla 25.9725.9725.9725.97 62.1362.1362.1362.13 15.6715.6715.6715.67 33.5233.5233.5233.52 67.4667.4667.4667.46 32.8632.8632.8632.86
Task-level ML+FT Transfer
    Human 41.5641.5641.5641.56 69.27 20.5320.5320.5320.53 37.5137.5137.5137.51 72.79 35.7735.7735.7735.77
    LLM 35.4335.4335.4335.43 66.9366.9366.9366.93 14.3514.3514.3514.35 35.0735.0735.0735.07 65.1765.1765.1765.17 34.8634.8634.8634.86
Task Distillation 66.9966.9966.9966.99 68.6668.6668.6668.66 41.92 41.5841.5841.5841.58 67.3467.3467.3467.34 34.5634.5634.5634.56
NER Distillation 67.3567.3567.3567.35 69.88 32.7332.7332.7332.73 35.6835.6835.6835.68 66.1766.1766.1766.17 32.8632.8632.8632.86
IE-level ML+FT MultiIE 53.2653.2653.2653.26 69.14 18.2318.2318.2318.23 39.6539.6539.6539.65 71.1671.1671.1671.16 35.2335.2335.2335.23
MAML 56.9556.9556.9556.95 69.28 38.6538.6538.6538.65 42.0742.0742.0742.07 68.2268.2268.2268.22 35.8435.8435.8435.84
MetaIE 69.29 69.47 40.43 43.50 69.8569.8569.8569.85 36.83
Category Method SRL ABSA ASTE
Brown WSJ 14RES 14LAP 14RES 14LAP
LLM Prompting ICL 28.7928.7928.7928.79 31.5631.5631.5631.56 53.0453.0453.0453.04 35.6235.6235.6235.62 58.9458.9458.9458.94 44.8744.8744.8744.87
FT Vanilla 52.5952.5952.5952.59 56.4756.4756.4756.47 24.4624.4624.4624.46 10.3210.3210.3210.32 39.1739.1739.1739.17 41.5041.5041.5041.50
Task-level ML+FT NER Distillation 43.6543.6543.6543.65 51.2951.2951.2951.29 10.7710.7710.7710.77 11.2111.2111.2111.21 40.0640.0640.0640.06 38.4038.4038.4038.40
IE-level ML+FT MultiIE 52.2652.2652.2652.26 56.6356.6356.6356.63 38.2238.2238.2238.22 35.2835.2835.2835.28 24.9124.9124.9124.91 40.4940.4940.4940.49
MAML 52.6952.6952.6952.69 56.2356.2356.2356.23 40.2240.2240.2240.22 34.4534.4534.4534.45 30.8330.8330.8330.83 40.9540.9540.9540.95
MetaIE 54.50 58.49 50.96 39.71 43.30 43.10
Table 2: Few-shot transferring performance (F1 score) of different meta-learning sources on IE tasks. Bold: Performance of the small LM that is not significantly different from the best one. (p<0.05)𝑝0.05(p<0.05)( italic_p < 0.05 )

The result from our experiments is presented in Table 2. The vanilla model is poorly transferred by fine-tuning to all kinds of IE tasks. The model with meta-learning on a single IE task, NER, is only well-transferred to other NER datasets but poorly-transferred to other IE tasks. Among IE-level meta-learning methods, the MultiIE model can be transferred to in-domain IE tasks with outstanding performance but still fails to be transferred to out-of-domain IE tasks, either with regular pre-training or meta-learning frameworks like MAML. In contrast to all these baselines, our MetaIE shows a strong transferability to all IE tasks, especially on out-of-domain tasks for MultiIE. Thus, the experiment results are highly consistent with our claim in IE task transferability that wider pre-training label set (IE)superscript𝐼𝐸\mathcal{L}^{(IE)}caligraphic_L start_POSTSUPERSCRIPT ( italic_I italic_E ) end_POSTSUPERSCRIPT will enable macro transferability of the model to all IE tasks.

Besides the main discovery, we can also observe that LLM-based meta-learning outperforms the pre-training on human annotation. Take NER as an instance, while both label sets satisfy (NER)superscript𝑁𝐸𝑅\mathcal{L}\subset\mathcal{L}^{(NER)}caligraphic_L ⊂ caligraphic_L start_POSTSUPERSCRIPT ( italic_N italic_E italic_R ) end_POSTSUPERSCRIPT, the \mathcal{L}caligraphic_L proposed by LLMs is much more diverse than the fixed set in human annotated datasets, which again verifies the importance of the label distribution, even in task-specific distillation.

The comparison with the teacher model also shows the student model generally outperforming the teacher model under few-shot supervision. Thus, we conclude fine-tuning a distilled student IE model to perform better than inference by the teacher LLMs with few-shot in-context examples. This further verifies the advantage of model distillation in meta-learning which enables more efficient and effective transfer.

5 Further Analysis

5.1 Size Analysis

We explore how the scale of the student model or the data number affects the distillation quality. For the model scale, we compare among RoBERTa-Small, RoBERTa-Base, and RoBERTa-Large. For the data scale, we increase the sampling size to 640K640𝐾640K640 italic_K and pre-train the student model with different amounts of data.

Refer to caption
Figure 3: The size analysis of the student model scale on different IE tasks and domains.
Refer to caption
Figure 4: The size analysis of the distillation data scale on different IE tasks and domains.

The analysis of model size is presented in Figure 3, we can observe the performance of a student model can be scaled up by more parameters. Also, for simple tasks (like NER) with a general domain (like CoNLL2004), a tiny student model is competent for the distillation. However, for specific domains or complex tasks, the student model needs more parameters for generalization.

The analysis of data size is presented in Figure 4, we observe the existence of a threshold between 80K160Ksimilar-to80𝐾160𝐾80K\sim 160K80 italic_K ∼ 160 italic_K to endow the student model with the meta-understanding of IE. Also, a small amount of meta data like 10K10𝐾10K10 italic_K can significantly benefit the transferring.

5.2 Distillation Framework Comparison

We compare student models following different distillation frameworks (because of their architectures) to investigate how this factor affects the distillation effectiveness.

Seq2Seq implements the distillation by learning to extract a group of spans based on the IE label as in the distillation dataset. We include two Seq2Seq models: BART-Large (Lewis et al., 2020) and T5-Base (Raffel et al., 2020), which contain the same scale of parameters as in the RoBERTa-Large in our previous experiments.

CausalLM is similar to Seq2Seq but only uses the decoder model instead of the encoder-decoder as in Seq2Seq. We also include two CausalLM-based models with similar parameter scales: GPT2-Medium (Brown et al., 2020) and OPT-350M (Zhang et al., 2022).

We also include another sequence labeling model BERT-Large-Cased (Devlin et al., 2019) as a baseline to explore the influence of the backbone model quality on the learning performance. For all models, we pre-train them using our MetaIE dataset with the same hyperparameters.

Framework Model ConLL2003200320032003 BioNLP2004200420042004 WNUT2017201720172017 MIT-Movie MIT-Restaurant BC5CDR
Seq-Labeling BERT 63.0163.0163.0163.01 52.3952.3952.3952.39 32.7132.7132.7132.71 61.7561.7561.7561.75 62.5062.5062.5062.50 66.2466.2466.2466.24
RoBERTa 71.49 54.88 44.3344.3344.3344.33 65.64 71.33 75.21
Seq2Seq BART 71.39 47.1847.1847.1847.18 46.74 62.7662.7662.7662.76 67.9867.9867.9867.98 65.9065.9065.9065.90
T5 64.0164.0164.0164.01 42.3542.3542.3542.35 40.7440.7440.7440.74 55.0555.0555.0555.05 53.6053.6053.6053.60 38.6738.6738.6738.67
CausalLM GPT 57.2057.2057.2057.20 37.2937.2937.2937.29 36.8936.8936.8936.89 52.1452.1452.1452.14 60.4660.4660.4660.46 61.0361.0361.0361.03
OPT 52.3952.3952.3952.39 37.6437.6437.6437.64 34.4834.4834.4834.48 53.0753.0753.0753.07 53.5953.5953.5953.59 52.8652.8652.8652.86
Table 3: Comparison between different frameworks on MetaIE distillation.

We compare the performance of different distillation frameworks on NER as an example and the result is demonstrated in Table 3. Sequence labeling models perform the best in few-shot transfer learning, which indicates their advantage in the distillation of meta-understanding of IE. This can be attributed to the consistency of sequence labeling with the extraction nature. We thus conclude distilling IE knowledge to a traditional sequence labeling model is better than those popular generative models. Between sequence labeling models, RoBERTa outperforms BERT, showing a better student model also benefits the distillation procedure.

6 Limitation Discussion

Efficiency The efficiency of the unified label-to-span will be O(|(Task)|)𝑂superscript𝑇𝑎𝑠𝑘O(|\mathcal{L}^{(Task)}|)italic_O ( | caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT | ), which is lower than the traditional O(1)𝑂1O(1)italic_O ( 1 ) (number of LM forwarding) BIO sequence labeler with label information in the labeling result. This will limit the application of our model to cases where |(Task)|superscript𝑇𝑎𝑠𝑘|\mathcal{L}^{(Task)}|| caligraphic_L start_POSTSUPERSCRIPT ( italic_T italic_a italic_s italic_k ) end_POSTSUPERSCRIPT | is large. This efficiency is a trade-off for the ability to process any IE label, which enables the fast transfer of the BIO model to different IE tasks.

Bias in LLM-proposed labels As pointed out in previous works (Gallegos et al., 2023; Fang et al., 2023), LLMs have biases in their responses. This can also be observed in the statistics of our distillation dataset. Thus, the small meta-model might also inherit the bias and have better transferability to labels that LLMs prefer than others.

7 Conclusions and Future Work

This paper presents a novel approach for distilling the meta-understanding of IE from LLMs into more efficient, smaller language models through a synthesized dataset, MetaIE. Our findings indicate that this method not only enhances the adaptability and efficiency of smaller models but also outperforms existing single-task and multi-task distillation methods in various IE tasks. The success of MetaIE underscores the potential of leveraging LLM’s meta-understanding to improve the performance and versatility of smaller models in complex tasks, offering a promising direction for future research in model distillation and IE. Future work will explore a better way for meta-learning by distilling from LLMs and other meta-tasks can be trained based on distillation.

References

  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  • Carreras & Màrquez (2004) Xavier Carreras and Lluís Màrquez. Introduction to the conll-2004 shared task: Semantic role labeling. In Hwee Tou Ng and Ellen Riloff (eds.), Proceedings of the Eighth Conference on Computational Natural Language Learning, CoNLL 2004, Held in cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6-7, 2004, pp.  89–97. ACL, 2004. URL https://aclanthology.org/W04-2412/.
  • Carreras & Màrquez (2005) Xavier Carreras and Lluís Màrquez. Introduction to the conll-2005 shared task: Semantic role labeling. In Ido Dagan and Daniel Gildea (eds.), Proceedings of the Ninth Conference on Computational Natural Language Learning, CoNLL 2005, Ann Arbor, Michigan, USA, June 29-30, 2005, pp.  152–164. ACL, 2005. URL https://aclanthology.org/W05-0620/.
  • Chen et al. (2015) Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Bei**g, China, Volume 1: Long Papers, pp.  167–176. The Association for Computer Linguistics, 2015. doi: 10.3115/V1/P15-1017. URL https://doi.org/10.3115/v1/p15-1017.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  • Ding et al. (2021) Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. Few-nerd: A few-shot named entity recognition dataset. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  3198–3213. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.ACL-LONG.248. URL https://doi.org/10.18653/v1/2021.acl-long.248.
  • Dong et al. (2023) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, **g**g Xu, Lei Li, and Zhifang Sui. A survey on in-context learning, 2023.
  • dos Santos & Guimarães (2015) Cícero Nogueira dos Santos and Victor Guimarães. Boosting named entity recognition with neural character embeddings. In Xiangyu Duan, Rafael E. Banchs, Min Zhang, Haizhou Li, and A. Kumaran (eds.), Proceedings of the Fifth Named Entity Workshop, NEWS@ACL 2015, Bei**g, China, July 31, 2015, pp.  25–33. Association for Computational Linguistics, 2015. doi: 10.18653/V1/W15-3904. URL https://doi.org/10.18653/v1/W15-3904.
  • Fang et al. (2023) Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. Bias of ai-generated content: An examination of news produced by large language models. CoRR, abs/2309.09825, 2023. doi: 10.48550/ARXIV.2309.09825. URL https://doi.org/10.48550/arXiv.2309.09825.
  • Finn et al. (2017a) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.  1126–1135. PMLR, 2017a. URL http://proceedings.mlr.press/v70/finn17a.html.
  • Finn et al. (2017b) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.  1126–1135. PMLR, 2017b. URL http://proceedings.mlr.press/v70/finn17a.html.
  • Gallegos et al. (2023) Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey. CoRR, abs/2309.00770, 2023. doi: 10.48550/ARXIV.2309.00770. URL https://doi.org/10.48550/arXiv.2309.00770.
  • Gao et al. (2024) Jun Gao, Huan Zhao, Wei Wang, Changlong Yu, and Ruifeng Xu. Eventrl: Enhancing event extraction with outcome supervision for large language models. CoRR, abs/2402.11430, 2024. doi: 10.48550/ARXIV.2402.11430. URL https://doi.org/10.48550/arXiv.2402.11430.
  • Gokaslan & Cohen (2019) Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  • Gou et al. (2021) Jian** Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey. Int. J. Comput. Vis., 129(6):1789–1819, 2021. doi: 10.1007/S11263-021-01453-Z. URL https://doi.org/10.1007/s11263-021-01453-z.
  • Gurulingappa et al. (2012) Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Informatics, 45(5):885–892, 2012. doi: 10.1016/J.JBI.2012.04.008. URL https://doi.org/10.1016/j.jbi.2012.04.008.
  • Han et al. (2018) Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  4803–4809. Association for Computational Linguistics, 2018. doi: 10.18653/V1/D18-1514. URL https://doi.org/10.18653/v1/d18-1514.
  • Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015. URL http://arxiv.longhoe.net/abs/1503.02531.
  • Ji et al. (2022) Bin Ji, Shasha Li, Shaoduo Gan, Jie Yu, Jun Ma, Huijun Liu, and **g Yang. Few-shot named entity recognition with entity-level prototypical network enhanced by dispersedly distributed prototypes. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pp.  1842–1854. International Committee on Computational Linguistics, 2022. URL https://aclanthology.org/2022.coling-1.159.
  • Katiyar & Cardie (2016) Arzoo Katiyar and Claire Cardie. Investigating lstms for joint extraction of opinion entities and relations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/P16-1087. URL https://doi.org/10.18653/v1/p16-1087.
  • Kim et al. (2019) Jangho Kim, Yash Bhalgat, **won Lee, Chirag Patel, and Nojun Kwak. QKD: quantization-aware knowledge distillation. CoRR, abs/1911.12491, 2019. URL http://arxiv.longhoe.net/abs/1911.12491.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  7871–7880. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.703. URL https://doi.org/10.18653/v1/2020.acl-main.703.
  • Li et al. (2023) Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Ye** Choi. Symbolic chain-of-thought distillation: Small models can also ”think” step-by-step. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  2665–2679. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.150. URL https://doi.org/10.18653/v1/2023.acl-long.150.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.longhoe.net/abs/1907.11692.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Lu et al. (2022) Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. Unified structure generation for universal information extraction. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  5755–5772. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.395. URL https://doi.org/10.18653/v1/2022.acl-long.395.
  • Ma et al. (2023) Yubo Ma, Zehao Wang, Yixin Cao, and Aixin Sun. Few-shot event detection: An empirical study and a unified view. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  11211–11236. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.628. URL https://doi.org/10.18653/v1/2023.acl-long.628.
  • Mirzadeh et al. (2020) Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  5191–5198. AAAI Press, 2020. doi: 10.1609/AAAI.V34I04.5963. URL https://doi.org/10.1609/aaai.v34i04.5963.
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018. URL http://arxiv.longhoe.net/abs/1803.02999.
  • OpenAI (2023) OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  • Paolini et al. (2021) Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, Rishita Anubhai, Cícero Nogueira dos Santos, Bing Xiang, and Stefano Soatto. Structured prediction as translation between augmented natural languages. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=US-TP-xnXI.
  • Peng et al. (2023) Letian Peng, Zihan Wang, and **gbo Shang. Less than one-shot: Named entity recognition via extremely weak supervision. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp.  13603–13616. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.findings-emnlp.908.
  • Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. Semeval-2014 task 4: Aspect based sentiment analysis. In Preslav Nakov and Torsten Zesch (eds.), Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, August 23-24, 2014, pp.  27–35. The Association for Computer Linguistics, 2014. doi: 10.3115/v1/s14-2004. URL https://doi.org/10.3115/v1/s14-2004.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  4077–4087, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  • Ushio & Camacho-Collados (2021) Asahi Ushio and Jose Camacho-Collados. T-NER: An all-round python library for transformer-based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp.  53–62, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-demos.7. URL https://aclanthology.org/2021.eacl-demos.7.
  • Wadhwa et al. (2023) Somin Wadhwa, Silvio Amir, and Byron C. Wallace. Revisiting relation extraction in the era of large language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  15566–15589. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.868. URL https://doi.org/10.18653/v1/2023.acl-long.868.
  • Walker et al. (2006) Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. ACE 2005 Multilingual Training Corpus. Web Download, 2006. URL https://catalog.ldc.upenn.edu/LDC2006T06. LDC Catalog No. LDC2006T06.
  • West et al. (2022) Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Ye** Choi. Symbolic knowledge distillation: from general language models to commonsense models. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  4602–4625. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.NAACL-MAIN.341. URL https://doi.org/10.18653/v1/2022.naacl-main.341.
  • West et al. (2023) Peter West, Ronan Le Bras, Taylor Sorensen, Bill Yuchen Lin, Liwei Jiang, Ximing Lu, Khyathi Chandu, Jack Hessel, Ashutosh Baheti, Chandra Bhagavatula, and Ye** Choi. Novacomet: Open commonsense foundation models with symbolic knowledge distillation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp.  1127–1149. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.findings-emnlp.80.
  • Xu et al. (2023) Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, and Enhong Chen. Large language models for generative information extraction: A survey. CoRR, abs/2312.17617, 2023. doi: 10.48550/ARXIV.2312.17617. URL https://doi.org/10.48550/arXiv.2312.17617.
  • Xu et al. (2020) Lu Xu, Hao Li, Wei Lu, and Lidong Bing. Position-aware tagging for aspect sentiment triplet extraction. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  2339–2349. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.183. URL https://doi.org/10.18653/v1/2020.emnlp-main.183.
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/ARXIV.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
  • Zhou et al. (2023) Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. Universalner: Targeted distillation from large language models for open named entity recognition. CoRR, abs/2308.03279, 2023. doi: 10.48550/ARXIV.2308.03279. URL https://doi.org/10.48550/arXiv.2308.03279.

Appendix A Label-to-Span Formalization

NER

Person: John/B Smith/I loves/O his/O hometown/O ,/O Los/O Angeles/O

RE

Person: John/B Smith/I loves/O his/O hometown/O ,/O Los/O Angeles/O

John Smith births in: John/O Smith/O loves/O his/O hometown/O ,/O Los/B Angeles/I

EE

Trigger: John/O Smith/O loves/B his/O hometown/O ,/O Los/O Angeles/O

Argument for Trigger “loves”: John/O Smith/O loves/O his/O hometown/O ,/O Los/B Angeles/I

SRL

Verb: John/O Smith/O loves/B his/O hometown/O ,/O Los/O Angeles/O

A1 Argument for Verb “loves”: John/O Smith/O loves/O his/O hometown/B ,/O Los/O Angeles/O

ABSA

Positive Term: John/O Smith/O loves/O his/O hometown/O ,/O Los/B Angeles/I

ASTE

Positive Opinion: John/O Smith/O loves/B his/O hometown/O ,/O Los/O Angeles/O

Aspect for Opinion “loves”: John/O Smith/O loves/O his/O hometown/O ,/O Los/B Angeles/I

Appendix B Few-shot Details

NER

samples 5555-shot examples that contain a certain type of entity for each entity type.

RE

samples 5555-shot examples that contain a certain type of relation for each relation type.

EE

samples 5%percent55\%5 % examples from the original training dataset.

SRL

samples 50505050-shot examples from the original training dataset.

ABSA

samples 5555-shot examples that contain terms with a certain sentiment polarity for each sentiment polarity type.

ASTE

samples 5555-shot examples that contain aspect-opinion triplet with a certain sentiment polarity for each sentiment polarity type.