License: arXiv.org perpetual non-exclusive license
arXiv:2312.03987v1 [cs.CL] 07 Dec 2023
\useunder

\ul

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Meihao Fan2, Xiaoyue Han2, Ju Fan2, Chengliang Chai3, Nan Tang4, Guoliang Li1, Xiaoyong Du2 2Renmin University of China, 3Bei**g Institute of Technology,4HKUST (GZ), 3Tsinghua University
{fmh1art, cloverhxy, fanj, duyong}@ruc.edu.cn, [email protected], [email protected], [email protected]
Abstract

Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BatchER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.

I Introduction

Entity resolution (ER), which finds entities that refer to the same real-world object, is a crucial task for data cleaning and data integration. Its applications span across various domains, with particular significance in healthcare, finance, customer relationship management, law enforcement, and many others.

The state-of-the-art (SOTA) results in ER are achieved through the application of deep learning methodologies. These methods [1, 2, 3, 4, 5] involve the utilization of Transformer-based models, which are trained on extensive datasets comprising numerous (e.g., hundreds or thousands) labeled entity pairs.


Standard Prompting and Batch Prompting. Meanwhile, large-scale pre-trained language models (LLMs), such as GPT models [6], have adopted an emerging learning paradigm called in-context learning (ICL), which does not require to update the model parameters of LLMs [7, 8, 9, 10]. It facilitates effective learning from a restricted set of labeled input context demonstrations, referred to as demonstrations.

Next, we use an example to illustrate the typical way of in-context learning, referred to as standard prompting.

Example 1

[Standard Prompting] Figure 1(a) shows standard prompting for ER. The user needs to provide a task description, several demonstrations (i.e., the ER pairs with known matching or non-matching labels), and one question (i.e., the ER pair whose label is unknown). An LLM (e.g., GPT-4) can then answer whether the two entities in the question match or not. normal-□\Box

Refer to caption
Figure 1: Standard Prompting and Batch Prompting

Recent studies have shown that standard prompting for ER is effective on matching accuracy [11, 12]. However, a key limitation of this approach is its monetary cost of calling APIs of LLMs, as it necessitates providing a task description and a set of demonstrations for each question, as explained in the following example. For instance, consider a table with 1,000 records that require about 500,000 predictions for ER. Suppose that each pair has similar-to\sim60606060 words or similar-to\sim90909090 tokens. Then, querying GPT-4 with standard prompting consisting of 3333 demonstrations and 1111 question will cost 500,000×(90×(3+1))×(0.01/1000)=$1,800formulae-sequence50000090310.011000currency-dollar1800500,000\times(90\times(3+1))\times(0.01/1000)=\$1,800500 , 000 × ( 90 × ( 3 + 1 ) ) × ( 0.01 / 1000 ) = $ 1 , 800, where the pricing of GPT-4 API services is $0.01currency-dollar0.01\$0.01$ 0.01 per 1K tokens (https://openai.com/pricing).

To be cost-effective, a natural alternative is to use a set, or a batch of questions when prompting the LLMs, which is known as batch prompting.

Example 2

[Batch Prompting] As shown in Figure 1(b), the user needs to provide a task description, a set of demonstrations, and a set of questions. Subsequently, the underlying LLM can answer whether each question (i.e., entity pair) in this batch matches or not. normal-□\Box

However, despite some very recent attempts of batch prompting for general natural language tasks [13, 14, 15, 16], as far as we know, exploring the effectiveness of batch prompting for ER under different design choices is not addressed. To bridge the gaps, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. To achieve this, we introduce a batch prompting framework called BatchER that consists of two main modules, demonstration selection and question batching. Based on the framework, we conduct extensive experiments on well-known ER benchmarks to systemically investigate the following two key questions.


A Design Space Exploration on Both Accuracy and Cost. Due to the importance of ER and the increasing ability of in-context learning, it is highly desired to systemically study batching prompting for ER, under different design choices, on both matching accuracy and monetary cost. To this end, we categorize different choices in question batching and demonstration selection. For question batching, we categorize existing methods as similarity-based, diversity-based and random question. For demonstration selection, we classify existing methods as fixed, topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch, and topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗂𝗈𝗇\mathsf{quesion}sansserif_quesion.


A Novel Covering-based Selection Strategy. While empirically exploring the above design space, we find that existing solutions only consider selecting top-k𝑘kitalic_k demonstrations after a batch of questions is determined, without considering whether the selected demonstrations can well cover all questions in a batch. Thus, we further study the problem:“how to select a batch of questions and how to select a set of demonstrations collectively, such that the demonstrations can well cover all questions which can best guide LLMs to provide answers”? We model the problem as a set cover problem, which is known as NP-hard. We solve the problem by devising a covering-based selection strategy, which selects demonstrations by considering relevance and coverage. The covering-based strategy aims to generate a labeled demonstration set by selecting the minimum number of demonstrations to cover all questions and then labeling them, and thus can effectively balance the trade-off between matching accuracy and monetary cost.


A Summary of Experiments. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Our experimental findings reveal insights into accuracy and cost of different batch prompting strategies. (1) Batch prompting can bring 4x-7x cost saving and achieve higher and more stable accuracy than standard prompting. (2) The design choice that combines diversity-based question batching and our proposed covering-based demonstration selection is the most favorable, i.e., achieving the highest accuracy while incurring the lowest cost. (3) Our BatchER framework is the most cost-effective, compared with not only PLM-based methods [1, 2, 3] fine-tuned with extensive labeled data, but also LLM-based methods with manually designed prompting [11].


Contributions. We make the following notable contributions.

  1. 1.

    We investigate the design space of batch prompting for ER, by introducing a framework BatchER and systematically categorizing existing methods for question batching and demonstration selection in Section II.

  2. 2.

    We introduce specific question batching strategies (Section III) and demonstration selection methods for ER (Section IV). We devise a novel covering-based selection strategy to connect the process of question batching and demonstration selecting in Section V.

  3. 3.

    We empirically evaluate our batch prompting framework BatchER (Section VI). We make all codes and datasets in our experiments public at Github111https://github.com/fmh1art/BatchER. Based on the evaluation, we provide insights on the strengths and limitations of various strategies, which guides designing cost-effective ICL approaches to ER.

II Batch Prompting for Entity Resolution: A Design Space Exploration

II-A Entity Resolution

Let TAsubscript𝑇𝐴T_{A}italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and TBsubscript𝑇𝐵T_{B}italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT be relational tables with m𝑚mitalic_m attributes. Each tuple refers to an entity consisting of m𝑚mitalic_m properties, i.e., for a tuple aTA𝑎subscript𝑇𝐴a\in T_{A}italic_a ∈ italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, a={𝚊𝚝𝚝𝚛i,𝚟𝚊𝚕i}i=1m𝑎superscriptsubscriptsubscript𝚊𝚝𝚝𝚛𝑖subscript𝚟𝚊𝚕𝑖𝑖1𝑚a=\{\mathtt{attr}_{i},\mathtt{val}_{i}\}_{i=1}^{m}italic_a = { typewriter_attr start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , typewriter_val start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT where 𝚊𝚝𝚝𝚛isubscript𝚊𝚝𝚝𝚛𝑖\mathtt{attr}_{i}typewriter_attr start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝚟𝚊𝚕isubscript𝚟𝚊𝚕𝑖\mathtt{val}_{i}typewriter_val start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the i𝑖iitalic_i-th attribute name and value respectively. The problem of entity resolution (ER) is to identify all the entity pairs (a,b)TA×TB𝑎𝑏subscript𝑇𝐴subscript𝑇𝐵(a,b)\in T_{A}\times T_{B}( italic_a , italic_b ) ∈ italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT that refer to the same object in the real world based on the corresponding attributes.

An end-to-end ER system consists of a 𝖻𝗅𝗈𝖼𝗄𝖾𝗋𝖻𝗅𝗈𝖼𝗄𝖾𝗋\mathsf{blocker}sansserif_blocker and a 𝗆𝖺𝗍𝖼𝗁𝖾𝗋𝗆𝖺𝗍𝖼𝗁𝖾𝗋\mathsf{matcher}sansserif_matcher. The 𝖻𝗅𝗈𝖼𝗄𝖾𝗋𝖻𝗅𝗈𝖼𝗄𝖾𝗋\mathsf{blocker}sansserif_blocker’s goal is to identify a subset of TA×TBsubscript𝑇𝐴subscript𝑇𝐵T_{A}\times T_{B}italic_T start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT containing candidate pairs with a high probability of being matched [1, 17, 18] while the 𝗆𝖺𝗍𝖼𝗁𝖾𝗋𝗆𝖺𝗍𝖼𝗁𝖾𝗋\mathsf{matcher}sansserif_matcher’s objective is to determine whether each entity pair (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) in the above candidate set refers to the same real-world entity (i.e., 𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀\mathsf{matching}sansserif_matching) or not (i.e., 𝗇𝗈𝗇𝗇𝗈𝗇\mathsf{non}sansserif_non-𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀\mathsf{matching}sansserif_matching). While the design of an effective blocking strategy is beyond the scope of this paper, we employ a widely accepted blocking method [1, 18, 19] to produce the aforementioned pairwise candidate set.

II-B In-Context Learning

In-context learning (ICL). It refers to the capability of LLMs to learn from a few demonstrations in the input context without any parameters updating [6].

ICL for ER. Given any entity pair (ai,bi)subscript𝑎𝑖subscript𝑏𝑖(a_{i},b_{i})( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we utilize a serialization function to serialize it into a text by concatenating all attribute names and values within the entity pair:

𝒮((ai,bi))𝒮subscript𝑎𝑖subscript𝑏𝑖\displaystyle\mathcal{S}((a_{i},b_{i}))caligraphic_S ( ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) =𝒮(ai)[SEP]𝒮(bi)absent𝒮subscript𝑎𝑖[SEP]𝒮subscript𝑏𝑖\displaystyle=\mathcal{S}(a_{i})\texttt{[SEP]}\mathcal{S}(b_{i})= caligraphic_S ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [SEP] caligraphic_S ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)
𝒮(e)𝒮𝑒\displaystyle\mathcal{S}(e)caligraphic_S ( italic_e ) =𝚊𝚝𝚝𝚛1:𝚟𝚊𝚕1𝚊𝚝𝚝𝚛m:𝚟𝚊𝚕m:absentsubscript𝚊𝚝𝚝𝚛1subscript𝚟𝚊𝚕1subscript𝚊𝚝𝚝𝚛𝑚:subscript𝚟𝚊𝚕𝑚\displaystyle=\mathtt{attr}_{1}:\mathtt{val}_{1}...\mathtt{attr}_{m}:\mathtt{% val}_{m}= typewriter_attr start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : typewriter_val start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … typewriter_attr start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : typewriter_val start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

where [SEP] is used to separate entities of a pair and 𝒮𝒮\mathcal{S}caligraphic_S(\cdot) denotes the serialization function of each data entity e𝑒eitalic_e.

Then, we construct a prompt consisting of a task description 𝙳𝚎𝚜𝚌𝙳𝚎𝚜𝚌\mathtt{Desc}typewriter_Desc, several serialized pairs with golden labels 𝙳𝚎𝚖𝚘𝚜𝙳𝚎𝚖𝚘𝚜\mathtt{Demos}typewriter_Demos (denoted as demonstrations in this paper) and a serialized pair 𝚀𝚞𝚎𝚜𝚝𝚒𝚘𝚗𝚀𝚞𝚎𝚜𝚝𝚒𝚘𝚗\mathtt{Question}typewriter_Question to be queried (denoted as question). By feeding them to an LLM G𝐺Gitalic_G, we generate the target y𝑦yitalic_y with the next token prediction, which can be regarded as a conditional text generation problem:

y=argmaxyYPG(y|𝙳𝚎𝚜𝚌𝙳𝚎𝚖𝚘𝚜supervision of ER task𝚀𝚞𝚎𝚜𝚝𝚒𝚘𝚗)𝑦subscript𝑦𝑌subscript𝑃𝐺conditional𝑦direct-sumsuperscriptdirect-sum𝙳𝚎𝚜𝚌𝙳𝚎𝚖𝚘𝚜supervision of ER task𝚀𝚞𝚎𝚜𝚝𝚒𝚘𝚗\displaystyle y=\arg\max_{y\in Y}P_{G}(y~{}|\overbrace{~{}\mathtt{Desc}\oplus% \mathtt{Demos}}^{\text{supervision of ER task}}\oplus\mathtt{Question})italic_y = roman_arg roman_max start_POSTSUBSCRIPT italic_y ∈ italic_Y end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_y | over⏞ start_ARG typewriter_Desc ⊕ typewriter_Demos end_ARG start_POSTSUPERSCRIPT supervision of ER task end_POSTSUPERSCRIPT ⊕ typewriter_Question ) (2)

where Y={Y=\{italic_Y = {𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀\mathsf{matching}sansserif_matching, 𝗇𝗈𝗇𝗇𝗈𝗇\mathsf{non}sansserif_non-𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀\mathsf{matching}sansserif_matching}}\}} is the label space.

As Eq. 2 shows, G𝐺Gitalic_G receives the task’s supervision only from a pre-defined task description (𝙳𝚎𝚜𝚌𝙳𝚎𝚜𝚌\mathtt{Desc}typewriter_Desc) and the concatenated demonstrations (𝙳𝚎𝚖𝚘𝚜𝙳𝚎𝚖𝚘𝚜\mathtt{Demos}typewriter_Demos). Usually, In-context learning is highly sensitive to the provided demonstrations and different question selection strategies will bring huge fluctuations in performance [20, 21]. Thus, a comprehensive exploration for selecting beneficial demonstrations deserves a detailed design.

II-C The BatchER Framework and Design Space

Refer to caption
Figure 2: Our proposed BatchER framework, which consists of (a) question batching and (b) demonstration selection.

Despite the remarkable accuracy of ICL [22, 16, 13], the cost of finance may be very expensive, since most LLM companies such as OpenAI charge users based on the token consumption.

To reduce the cost of interfacing LLMs while maintaining high accuracy, batch prompting is proposed, which allows to query a batch of questions with several demonstrations and asks LLM to make multiple predictions in one interface [23].

Example 3

Figure 1 shows the difference between Standard Prompting and Batch Prompting. Although both select two demonstrations for LLMs to lean in context, Batch Prompting asks LLMs to answer 2 questions at one interface, which approximately saves tokens of 2222 demonstrations and 1111 task descriptions. Naturally, the more questions we put in a batch, the more cost of interfacing LLMs will be reduced. normal-□\Box


The BatchER Framework. We can observe that two critical components in the prompt of Batch Prompting are in-context demonstrations and questions. Thus, to design effective Batch Prompting, we introduce a framework called BatchER that consists of the modules of in-context demonstration selection and question batching, as shown in Figure 2. The BatchER framework takes a set of questions, i.e., entity pairs {q}𝑞\{q\}{ italic_q } as input, and aims to produce a set of batch prompts, which are then fed into an LLM. As a prompt needs in-context demonstrations, BatchER also considers a set of entity pairs without 𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀\mathsf{matching}sansserif_matching/𝗇𝗈𝗇𝗇𝗈𝗇\mathsf{non}sansserif_non-𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀𝗆𝖺𝗍𝖼𝗁𝗂𝗇𝗀\mathsf{matching}sansserif_matching results as an Unlabeled Demonstration Pool. In this section, we first formally define the above two modules and then systematically explore the design space of Batch Prompting for ER by categorizing each individual module in the BatchER framework.

  • Question Batching. Considering a Question Set M𝑀Mitalic_M of questions to be queried, Question Batching aims to iteratively select b𝑏bitalic_b questions and group them into one batch Bi={qj}j=1bsubscript𝐵𝑖superscriptsubscriptsubscript𝑞𝑗𝑗1𝑏B_{i}=\{q_{j}\}_{j=1}^{b}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. To ensure all questions will be queried at least once, the union set of all batches should equal to the original question set, satisfying Bi=Msubscript𝐵𝑖𝑀\bigcup B_{i}=M⋃ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M.

  • Demonstration Selection. Considering a large pool of unlabeled demonstrations Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT from which we iteratively select several data points {dj}subscript𝑑𝑗\{d_{j}\}{ italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } for each batch Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We assume manual annotation will be adopted for the selected data to generate labeled demonstrations Di={(dj,y)}subscript𝐷𝑖subscript𝑑𝑗𝑦D_{i}=\{(d_{j},y)\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y ) } which will be used to guide LLMs to make predictions for batched questions.

To put the above together, the BatchER Framework takes a Question Set M𝑀Mitalic_M and an Unlabeled Demonstration Pool Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as input and outputs a set of question batches B={Bi}𝐵subscript𝐵𝑖B=\{B_{i}\}italic_B = { italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } along with a set of corresponding demonstrations D={Di}𝐷subscript𝐷𝑖D=\{D_{i}\}italic_D = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, satisfying Bi=Msubscript𝐵𝑖𝑀\bigcup B_{i}=M⋃ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M and DiDusubscript𝐷𝑖subscript𝐷𝑢\bigcup D_{i}\subseteq D_{u}⋃ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

TABLE I: A Design Space Exploration
Modules Categorization
Question Batching (1) Random
(2) Similarity-based
(3) Diversity-based
Demonstration Selection (1) Fixed
(2) Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch
(3) Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question
(4) Covering-based (Our proposal)


A Design Space Exploration. To utilize in-context learning for ER, several challenges should be addressed. First, the question batching and demonstration selection require a feature extractor to map questions and demonstrations into a vector space, which facilitates the measurement of their relevance. However, the widely used semantics-based feature extractor may fail to select beneficial demonstrations due to the lack of task-specific signals [22]. Second, although in-context learning shows stable and remarkable performance in Standard Prompting with relevant demonstration selection [24, 25, 16], effective demonstration selection strategies still lacks a comprehensive investigation on the trade-off between accuracy and cost. At last, the choice of batching strategy is of great significance in downstream performance, which deserves in-depth investigation.

To address the challenges, we propose a categorization of design choices for each module in BatchER, which forms a design space as shown in Table I. We first explore strategies for the question batching module and discuss different feature extractors used for measuring relevance among questions (Section III). Subsequently, we investigate methods for selecting demonstrations for a batch (Section IV). We note that BatchER is extensible, i.e., it is possible to incorporate new modules, new categories, or new methods or variants of existing methods. Moreover, it is possible to define the search space from a different angle; that is, we contend that our proposal is rational, but may not be unique.

III Question Batching

Refer to caption
Figure 3: Question Batching Framework

This section explores the question batching strategies, as shown in Table I. To this end, we first describe a general framework of question batching, as illustrated in Figure 3. Specifically, given a Question Set M𝑀Mitalic_M of entity pairs, the framework produces batches of questions in three steps.

  • Feature Extraction. We first use a Feature Extractor to cast the questions into feature vectors

  • Question Clustering. We then adopt an unsupervised clustering algorithm such as DBSCAN or K-Means to group the questions into clusters.

  • Question Batching. We finally group questions into batches based on the clusters using various strategies.

In the remaining of this section, we mainly introduce three representative batching strategies, including random question batching, similarity-based question batching, and diversity-based question batching, which have been adopted by previous studies [23, 26] (Section III-A). Next, as feature extraction and distance measurement (for clustering) are involved in the batching process, we then discuss two feature extraction methods in Section III-B. Note that, for question clustering, we adopt DBSCAN [27], as the algorithm achieves the best performance. Due to the space limit, this section does not discuss various clustering algorithms, which are not the focus of this paper.

III-A Batching Strategies

Given clustered questions, BatchER generates batches based on the following three representative strategies, random question batching, similarity-based question batching, and diversity-based question batching, which have been adopted by previous studies [23, 26].


Similarity-based Question Batching. The intuition of this strategy is to group similar questions within the same clusters into the same batch. To this end, we iteratively select b𝑏bitalic_b (i.e., batch size) questions from the same cluster to form a batch, to ensure that questions in the same batch have similar feature vectors to each other. In particular, during the final stage of batch generation, some clusters may contain questions fewer than the required batch size b𝑏bitalic_b. In such case, we select the largest remaining cluster, denoted as Cmaxsubscript𝐶maxC_{\text{max}}italic_C start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. We then seek to pair it with another cluster whose size exactly matches b|Cmax|𝑏subscript𝐶maxb-|C_{\text{max}}|italic_b - | italic_C start_POSTSUBSCRIPT max end_POSTSUBSCRIPT |, to form a complete batch. If no such cluster exists, we opt for the next largest cluster, randomly selecting b|Cmax|𝑏subscript𝐶maxb-|C_{\text{max}}|italic_b - | italic_C start_POSTSUBSCRIPT max end_POSTSUBSCRIPT | elements from them to form a batch in conjunction with Cmaxsubscript𝐶maxC_{\text{max}}italic_C start_POSTSUBSCRIPT max end_POSTSUBSCRIPT.


Diversity-based Question Batching. The intuition of this strategy is to group questions that are from diversified clusters into a batch. In this batching strategy, batches are also generated in two stages. Firstly, we ensure batch diversity by selecting one question from each of b𝑏bitalic_b different clusters, such that the questions in different batches have obvious differences in feature vectors from each other. Then, when the batching process nears completion, we may encounter scenarios where the number of available clusters is less than b𝑏bitalic_b. In such instance, we simply ensure the diversity of batches generated from a limited number of clusters by selecting questions from remaining clusters in a round-robin manner.

Example 4

[Question Batching] Consider the questions in Figure 3. We denote the three clusters as Ca={qia}i=12subscript𝐶𝑎superscriptsubscriptsubscriptsuperscript𝑞𝑎𝑖𝑖12C_{a}=\{q^{a}_{i}\}_{i=1}^{2}italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Cb={qib}i=13subscript𝐶𝑏superscriptsubscriptsubscriptsuperscript𝑞𝑏𝑖𝑖13C_{b}=\{q^{b}_{i}\}_{i=1}^{3}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and Cc={qic}i=14subscript𝐶𝑐superscriptsubscriptsubscriptsuperscript𝑞𝑐𝑖𝑖14C_{c}=\{q^{c}_{i}\}_{i=1}^{4}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, respectively.

(1) For similarity-based question batching, we sequentially select Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Ccsubscript𝐶𝑐C_{c}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, forming batches B1={q1b,q2b,q3b}subscript𝐵1subscriptsuperscript𝑞𝑏1subscriptsuperscript𝑞𝑏2subscriptsuperscript𝑞𝑏3B_{1}=\{q^{b}_{1},q^{b}_{2},q^{b}_{3}\}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } and B2={q1c,q2c,q3c}subscript𝐵2subscriptsuperscript𝑞𝑐1subscriptsuperscript𝑞𝑐2subscriptsuperscript𝑞𝑐3B_{2}=\{q^{c}_{1},q^{c}_{2},q^{c}_{3}\}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }. Subsequently, from the remaining clusters Ca={q1a,q2a}subscript𝐶𝑎subscriptsuperscript𝑞𝑎1subscriptsuperscript𝑞𝑎2C_{a}=\{q^{a}_{1},q^{a}_{2}\}italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } and Cc={q4c}subscript𝐶𝑐subscriptsuperscript𝑞𝑐4C_{c}=\{q^{c}_{4}\}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }, we choose the larger cluster Casubscript𝐶𝑎C_{a}italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and combine it with Ccsubscript𝐶𝑐C_{c}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to create B3={q1a,q2a,q4c}subscript𝐵3subscriptsuperscript𝑞𝑎1subscriptsuperscript𝑞𝑎2subscriptsuperscript𝑞𝑐4B_{3}=\{q^{a}_{1},q^{a}_{2},q^{c}_{4}\}italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }.

(2) For diversity-based question batching, we can generate diverse batches B1={q1a,q1b,q1c}subscript𝐵1subscriptsuperscript𝑞𝑎1subscriptsuperscript𝑞𝑏1subscriptsuperscript𝑞𝑐1B_{1}=\{q^{a}_{1},q^{b}_{1},q^{c}_{1}\}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } and B2={q2a,q2b,q2c}subscript𝐵2subscriptsuperscript𝑞𝑎2subscriptsuperscript𝑞𝑏2subscriptsuperscript𝑞𝑐2B_{2}=\{q^{a}_{2},q^{b}_{2},q^{c}_{2}\}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } in the initial stages by iteratively selecting one question from Casubscript𝐶𝑎C_{a}italic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Ccsubscript𝐶𝑐C_{c}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Then with remaining clusters Cb={q3b}subscript𝐶𝑏subscriptsuperscript𝑞𝑏3C_{b}=\{q^{b}_{3}\}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } and Cc={q3c,q4c}subscript𝐶𝑐subscriptsuperscript𝑞𝑐3subscriptsuperscript𝑞𝑐4C_{c}=\{q^{c}_{3},q^{c}_{4}\}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }, we sequentially select questions from Ccsubscript𝐶𝑐C_{c}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Ccsubscript𝐶𝑐C_{c}italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to generate the final batch B3={q3c,q3b,q4c}subscript𝐵3subscriptsuperscript𝑞𝑐3subscriptsuperscript𝑞𝑏3subscriptsuperscript𝑞𝑐4B_{3}=\{q^{c}_{3},q^{b}_{3},q^{c}_{4}\}italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }. normal-□\Box


Random Question Batching. We also consider a straightforward random question batching strategy, which is commonly adopted in the existing works [23, 26]. In this approach, each batch is formed by randomly selecting questions from the remaining question set. Due to this randomness, the generated batches may contain a mix of both similar and dissimilar questions. This implies that a random batch, to some extent, represents a middle ground between a similar batch and a diverse batch.

III-B Feature Extractor

The process of batching questions in the previous section relies on the utilization of a feature extractor to convert questions into corresponding feature vectors. Subsequently, these feature vectors are used to calculate distances between questions and then serve as the basis for the clustering procedure. Formally, given a set of questions M𝑀Mitalic_M, we need to define a feature extractor f𝑓fitalic_f and a distance function 𝚍𝚒𝚜𝚝𝚍𝚒𝚜𝚝\mathtt{dist}typewriter_dist, and thus the distance of any two questions qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be calculated via 𝚍𝚒𝚜𝚝(𝐯i,𝐯j)𝚍𝚒𝚜𝚝subscript𝐯𝑖subscript𝐯𝑗\mathtt{dist}(\mathbf{v}_{i},\mathbf{v}_{j})typewriter_dist ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) between the two feature vectors, i.e., 𝐯isubscript𝐯𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐯jsubscript𝐯𝑗\mathbf{v}_{j}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We notice that the distance function can be further defined by a variety of ways, such as Euclidean distance or cosine similarity (distance). In our experiments, we define the distance function based on the Euclidean distance, which achieves the best performance among others.

Next, we introduce two types of feature extractors, one based on semantics and the other being structure-aware.


Semantics-based Feature Extractor. Semantics-based feature extractor utilizes a pre-trained language model (PLM) to encode each serialized question. For ER task, as all questions are structural pairs, i.e., with multiple attributes, we first use the serialization function defined in Eq.(1) to generate serialized questions and pass it to a PLM, such as SBERT [28] and RoBerta [29] to generate embedding-based representations. Formally, given a question q𝑞qitalic_q, the feature vector 𝐯𝐯\mathbf{v}bold_v can be generated as follow:

𝐯=𝙴𝚗𝚌𝚘𝚍𝚎𝚛(𝒮(q))𝐯𝙴𝚗𝚌𝚘𝚍𝚎𝚛𝒮𝑞\displaystyle\mathbf{v}=\texttt{Encoder}(\mathcal{S}(q))bold_v = Encoder ( caligraphic_S ( italic_q ) ) (3)

where Encoder denotes the encoding function of a PLM. Although the above feature extractor formulates the relevance as semantic distance, it may have the limitation of ignoring the structural information. This inspires us to introduce another feature extraction method, which can capture structural similarity to model relevance.


Structure-aware Feature Extractor. Structure-aware feature extractor employs a string similarity function to map attribute-matching signals of two entities of a question into a low-dimensional space, which enables the generated feature vectors to capture structural information and task-related knowledge. Formally, given a structural pair (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), we derive the feature vector by calculating the similarities of attributes between a𝑎aitalic_a and b𝑏bitalic_b. Since attribute values typically take a string format, we can compute similarity sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on attribute 𝚊𝚝𝚝𝚛isubscript𝚊𝚝𝚝𝚛𝑖\mathtt{attr}_{i}typewriter_attr start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with string similarity function, e.g., Levenshtein ratio and Jaccard.

Using the Jaccard similarity, we tokenize 𝚟𝚊𝚕iasubscriptsuperscript𝚟𝚊𝚕𝑎𝑖\mathtt{val}^{a}_{i}typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝚟𝚊𝚕ibsubscriptsuperscript𝚟𝚊𝚕𝑏𝑖\mathtt{val}^{b}_{i}typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into sets and compute the similarity as:

si=𝙹𝙰𝙲(𝚟𝚊𝚕ia,𝚟𝚊𝚕ib)=|𝚟𝚊𝚕ia𝚟𝚊𝚕ib||𝚟𝚊𝚕ia𝚟𝚊𝚕ib|subscript𝑠𝑖𝙹𝙰𝙲subscriptsuperscript𝚟𝚊𝚕𝑎𝑖subscriptsuperscript𝚟𝚊𝚕𝑏𝑖subscriptsuperscript𝚟𝚊𝚕𝑎𝑖subscriptsuperscript𝚟𝚊𝚕𝑏𝑖subscriptsuperscript𝚟𝚊𝚕𝑎𝑖subscriptsuperscript𝚟𝚊𝚕𝑏𝑖s_{i}=\mathtt{JAC}(\mathtt{val}^{a}_{i},\mathtt{val}^{b}_{i})=\frac{|\mathtt{% val}^{a}_{i}\cap\mathtt{val}^{b}_{i}|}{|\mathtt{val}^{a}_{i}\cup\mathtt{val}^{% b}_{i}|}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = typewriter_JAC ( typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG | typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG (4)

where 𝚟𝚊𝚕iasubscriptsuperscript𝚟𝚊𝚕𝑎𝑖\mathtt{val}^{a}_{i}typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the tokenized set of attribute value 𝚟𝚊𝚕isubscript𝚟𝚊𝚕𝑖\mathtt{val}_{i}typewriter_val start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of entity a𝑎aitalic_a and |𝚟𝚊𝚕ia|subscriptsuperscript𝚟𝚊𝚕𝑎𝑖|\mathtt{val}^{a}_{i}|| typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | represents corresponding token-set size.

The Levenshtein ration (LR) derives from the Levenshtein edit distance (LED) [30], representing the minimum number of edits needed to transform one string into another, as:

si=𝙻𝚁(𝚟𝚊𝚕ia,𝚟𝚊𝚕ib)=1𝙻𝙴𝙳(𝚟𝚊𝚕ia,𝚟𝚊𝚕ib)ssubscript𝑠𝑖𝙻𝚁subscriptsuperscript𝚟𝚊𝚕𝑎𝑖subscriptsuperscript𝚟𝚊𝚕𝑏𝑖1𝙻𝙴𝙳subscriptsuperscript𝚟𝚊𝚕𝑎𝑖subscriptsuperscript𝚟𝚊𝚕𝑏𝑖𝑠s_{i}=\mathtt{LR}(\mathtt{val}^{a}_{i},\mathtt{val}^{b}_{i})=1-\frac{\mathtt{% LED}(\mathtt{val}^{a}_{i},\mathtt{val}^{b}_{i})}{s}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = typewriter_LR ( typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 - divide start_ARG typewriter_LED ( typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_s end_ARG (5)

where 𝙻𝙴𝙳𝙻𝙴𝙳\mathtt{LED}typewriter_LED is the Levenshtein edit distance function and s𝑠sitalic_s represents the sum of string length of 𝚟𝚊𝚕iasubscriptsuperscript𝚟𝚊𝚕𝑎𝑖\mathtt{val}^{a}_{i}typewriter_val start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝚟𝚊𝚕ibsubscriptsuperscript𝚟𝚊𝚕𝑏𝑖\mathtt{val}^{b}_{i}typewriter_val start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Thus, given a question q𝑞qitalic_q with entity pair (a,b)𝑎𝑏(a,b)( italic_a , italic_b ), the feature vector 𝐯𝐯\mathbf{v}bold_v can be generated by concatenating the similarities of all attributes make 𝐯={si}i=1m𝐯superscriptsubscriptsubscript𝑠𝑖𝑖1𝑚\mathbf{v}=\{s_{i}\}_{i=1}^{m}bold_v = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.

Refer to caption
Figure 4: An example instance of Entity Resolution.
Example 5

[Feature Extraction] Figure 4 shows an example instance of entity resolution.

(1) For semantics-based feature extractor, we first serialize q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with Eq. 1 and obtain S(q1)=𝑆subscript𝑞1absentS(q_{1})=italic_S ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =“title:Rashi, album:Here…, genre:Dance… [SEP] title:Rashi, album:Here…, genre:Music”. Then we utilize a pre-trained language model such as SBERT to encode the embedding as feature vector 𝐯1subscript𝐯1\mathbf{v}_{1}bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

(2) For structure-aware feature extractor, to generate 𝐯1subscript𝐯1\mathbf{v}_{1}bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we first compute the string similarities of “Rashi” and “Rashi”, “Here Comes the Fuzz” and “Here Comes The Fuzz [Explicit]”, and “Dance,Music,Hip-Hop” and “Music”. Suppose we utilize 𝙻𝚁𝙻𝚁\mathtt{LR}typewriter_LR function, the similarities of title, album, and genre can be computed as 1, 0.73, and 0.42. Second, the similarities are concatenated to make up the feature vector 𝐯1=[1,0.73,0.42]subscript𝐯110.730.42\mathbf{v}_{1}=[1,0.73,0.42]bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ 1 , 0.73 , 0.42 ]. Similarly, the feature vector of q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be computed as 𝐯2=[0.33,0,0.46]subscript𝐯20.3300.46\mathbf{v}_{2}=[0.33,0,0.46]bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ 0.33 , 0 , 0.46 ]. normal-□\Box

IV Demonstration Selection

Refer to caption
Figure 5: Demonstration Selection Framework, where blue circles and yellow circles represent demonstrations and questions respectively, and values on edges represent distances.

Figure 5 illustrates the framework of demonstration selection and describes four demonstration selection methods. Given an Unlabeled Demonstration Pool Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and a set of generated question batches B𝐵Bitalic_B, demonstration selection aims to select beneficial in-context demonstrations Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each batch BiBsubscript𝐵𝑖𝐵B_{i}\in Bitalic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_B, which will be then manually labeled. To further specify the concept of four demonstration selection methods, we give an illustration for each method. For simplicity, we only consider two closest demonstrations for each question.

IV-A Fixed Demonstration Selection

A basic idea is to sample fixed K𝐾Kitalic_K demonstrations and then allocate them to every batch. In Figure 5, we generate two fixed demonstrations by randomly sampling from the unlabeled demonstration pool and allocate these two demonstrations to each batch. This method brings a fixed annotation cost. However, existing studies show that random demonstrations may incur unstable performance of ICL [20, 21].

IV-B Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch Demonstration Selection

Similar to the strategy in Standard Prompting of recommending top k𝑘kitalic_k most relevant demonstrations [31], this strategy selects the k𝑘kitalic_k most relevant demonstrations for each batch. Since a batch Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a demonstration d𝑑ditalic_d are not in the same dimension, we first define the relevance between Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and d𝑑ditalic_d based on the distance function 𝚍𝚒𝚜𝚝𝚍𝚒𝚜𝚝\mathtt{dist}typewriter_dist defined in Section III-B:

𝚍𝚒𝚜𝚝*(Bi,d)=minqjBi𝚍𝚒𝚜𝚝(qj,d)superscript𝚍𝚒𝚜𝚝subscript𝐵𝑖𝑑subscriptsubscript𝑞𝑗subscript𝐵𝑖𝚍𝚒𝚜𝚝subscript𝑞𝑗𝑑\mathtt{dist}^{*}(B_{i},d)=\min_{q_{j}\in B_{i}}\mathtt{dist}(q_{j},d)typewriter_dist start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) = roman_min start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT typewriter_dist ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_d ) (6)

which shows that we define the relevance between Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and d𝑑ditalic_d as the minimum distance between d𝑑ditalic_d and all questions in the batch. Based on this, we can use the k𝙽𝙽𝑘𝙽𝙽k\mathtt{NN}italic_k typewriter_NN algorithm to generate k𝑘kitalic_k in-context demonstrations for Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by Di=k𝙽𝙽(Bi,Du)subscript𝐷𝑖𝑘𝙽𝙽subscript𝐵𝑖subscript𝐷𝑢D_{i}=k\mathtt{NN}(B_{i},D_{u})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k typewriter_NN ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ). In Figure 5, we set k𝑘kitalic_k as batch size |Bi|subscript𝐵𝑖|B_{i}|| italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, and thus Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch sequentially selects k𝑘kitalic_k demonstrations (bold blue circles) based on the k𝑘kitalic_k shortest edges (red dotted lines).

However, this method may not be able to assign relevant demonstrations for some particular questions in a batch. Thus, for such questions, the LLM may fail in finding relevant demonstrations for reference to provide the correct answers.

IV-C Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question Demonstration Selection

To address the above issue, we investigate a demonstration selection method that select the k𝑘kitalic_k most relevant demonstrations for each question in the batch. This is based on the assumption that, since relevant demonstrations are beneficial when querying the individual question, the set of relevant demonstrations will also benefit when querying the whole batch. Formally, considering a batch Bi={qi}i=1bsubscript𝐵𝑖superscriptsubscriptsubscript𝑞𝑖𝑖1𝑏B_{i}=\{q_{i}\}_{i=1}^{b}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, the in-context demonstration set Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be generated as Di=qjBik𝙽𝙽(qj,Du)subscript𝐷𝑖subscriptsubscript𝑞𝑗subscript𝐵𝑖𝑘𝙽𝙽subscript𝑞𝑗subscript𝐷𝑢D_{i}=\bigcup_{q_{j}\in B_{i}}k\mathtt{NN}(q_{j},D_{u})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k typewriter_NN ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ). Figure 5 illustrates the basic idea of the Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question method where we set k=1𝑘1k=1italic_k = 1 and select the most relevant demonstration for each question in the batch.

Although this method is likely to improve the accuracy of ICL, it may have a limitation of incurring large monetary cost. Also, it may generate long prompts which could lead to long text comprehension issue and input length overrun.

IV-D Covering-based Demonstration Selection

A key limitation of Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question and Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch is that they may incur substantial labeling cost, which is caused by labeling the selected demonstrations. To mitigate this, we introduce a new approach based on the idea of using demonstrations to “cover” all questions in the batch Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where “cover” means that the distance between question p𝑝pitalic_p and demonstration d𝑑ditalic_d is smaller than a threshold t𝑡titalic_t. This is based on the assumption that the beneficial demonstrations are a set of relevant data points and all beneficial to a given question. In Figure 5, we assume that demonstrations with a shorter distance than 5 can be regarded as a beneficial reference when answering the question. Thus, we first select the top demonstration to cover the left two questions. Then, to cover the last question, the rightest demonstration is selected.

It is important to recognize that for the given batches, multiple selection choices that fulfill the aforementioned covering-based criteria exist. Thus, in Section V, we will formally formulate this the covering-based problem and propose an efficient algorithm to solve the problem.

V Covering-based Demonstration Selection

The covering-based method aims to address two main problems. First, we need to select a minimal subset of demonstrations from an unlabeled demonstration pool to cover all the questions of all batches. Then, for each batch, we need to further select some demonstrations from this subset, ensuring the covering of each question in the batch while minimizing the total number of tokens. Below, we name these two problems as the Demonstration Set Generation and Batch Covering problems and provide their detailed definitions.

V-A Demonstration Set Generation


Definition. Given a Question Set M𝑀Mitalic_M containing all questions to be queried, an unlabeled demonstration pool Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and a non-negative distance threshold t𝑡titalic_t, we need to select a subset of demonstrations DsDusubscript𝐷𝑠subscript𝐷𝑢D_{s}\subset D_{u}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⊂ italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, satisfying qMfor-all𝑞𝑀\forall q\in M∀ italic_q ∈ italic_M, exists at least one dDs,𝚍𝚒𝚜𝚝(q,d)<tformulae-sequence𝑑subscript𝐷𝑠𝚍𝚒𝚜𝚝𝑞𝑑𝑡d\in D_{s},\mathtt{dist}(q,d)<titalic_d ∈ italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , typewriter_dist ( italic_q , italic_d ) < italic_t. The goal is to minimize the size of selected Demonstration Set |Ds|subscript𝐷𝑠|D_{s}|| italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT |.


NP-hard Proof Sketch. We can prove the Demonstration Set Generation Problem to be NP-hard by a reduction from the Set Cover Problem, which is proven to be NP-hard [32].

An instance of Set Cover Problem (SCP) encompasses a universe of items U𝑈Uitalic_U, a collection V={S1,S2,S3,,Sm}𝑉subscript𝑆1subscript𝑆2subscript𝑆3subscript𝑆𝑚V=\{S_{1},S_{2},S_{3},...,S_{m}\}italic_V = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } of subsets of U𝑈Uitalic_U, we need to find a subset-collection V*Vsuperscript𝑉𝑉V^{*}\subset Vitalic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ⊂ italic_V such that each element in U𝑈Uitalic_U is covered by at least one subset in V*superscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. The goal is to minimize the number of selected subsets |V*|superscript𝑉|V^{*}|| italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT |.

We reduce SCP to our problem. We show that for any instance (U,V)𝑈𝑉(U,V)( italic_U , italic_V ) of SCP, we can create a corresponding instance of our problem based on (U,V)𝑈𝑉(U,V)( italic_U , italic_V ) in polynomial time. First, We translate the set U𝑈Uitalic_U of universal items into the set M𝑀Mitalic_M of questions. Then, Given items ujUsubscript𝑢𝑗𝑈u_{j}\in Uitalic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_U, if ujSisubscript𝑢𝑗subscript𝑆𝑖u_{j}\in S_{i}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we add a demonstration djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the unlabeled demonstration pool Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and set the distance between djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be 00 (less than t𝑡titalic_t). Finally, given the above reduction, we can deduce that the objective of finding the minimum number of subsets in V𝑉Vitalic_V that cover all items in U𝑈Uitalic_U in SCP is equivalent to the objective of our problem, which is to find the minimum number of demonstrations in Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT that cover all questions in M𝑀Mitalic_M.

Algorithm 1 Demonstration Set Generation/Batch Covering
0:  Set of questions Q𝑄Qitalic_Q, set of demonstrations D𝐷Ditalic_D, nondecreasing value function f𝑓fitalic_f, weight function w𝑤witalic_w.
0:  set of selected demonstrations Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.
1:  Dssubscript𝐷𝑠D_{s}\leftarrow\varnothingitalic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← ∅
2:  while fQ(Ds)fQ(D)subscript𝑓𝑄subscript𝐷𝑠subscript𝑓𝑄𝐷f_{Q}(D_{s})\neq f_{Q}(D)italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_D ) do
3:     dargmaxdDfQ(Ds{d})fQ(Ds)w(d)𝑑subscript𝑑𝐷subscript𝑓𝑄subscript𝐷𝑠𝑑subscript𝑓𝑄subscript𝐷𝑠𝑤𝑑d\leftarrow\mathop{\arg\max}\limits_{d\in D}\frac{f_{Q}(D_{s}\cup\{d\})-f_{Q}(% D_{s})}{w(d)}italic_d ← start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_d ∈ italic_D end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∪ { italic_d } ) - italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_w ( italic_d ) end_ARG
4:     DsDs{d}subscript𝐷𝑠subscript𝐷𝑠𝑑D_{s}\leftarrow D_{s}\cup\{d\}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∪ { italic_d }
5:  end while


Greedy Algorithm. To efficiently address the Demonstration Set Generation Problem, we propose a greedy-based algorithm. To start with, we define a non-decreasing value function fQ(Ds)=i=1|M|zisubscript𝑓𝑄subscript𝐷𝑠superscriptsubscript𝑖1𝑀subscript𝑧𝑖f_{Q}(D_{s})=\sum_{i=1}^{|M|}z_{i}italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to measure the value of intermediate demonstration set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where for qiQsubscript𝑞𝑖𝑄q_{i}\in Qitalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Q, zi=1subscript𝑧𝑖1z_{i}=1italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 if mindjDs𝚍𝚒𝚜𝚝(qi,dj)<tsubscriptsubscript𝑑𝑗subscript𝐷𝑠𝚍𝚒𝚜𝚝subscript𝑞𝑖subscript𝑑𝑗𝑡\min_{d_{j}\in D_{s}}\mathtt{dist}(q_{i},d_{j})<troman_min start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT typewriter_dist ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_t, otherwise, zi=0subscript𝑧𝑖0z_{i}=0italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. Generally, the value function calculates the number of covered questions by Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Then, taking the value function f𝑓fitalic_f, set of questions M𝑀Mitalic_M, and an unlabeled demonstration set Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as input, we iteratively select the most efficient demonstration. Efficiency is defined by the ratio of the incremental value a demonstration contributes to the intermediate Demonstration Set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT relative to its weight. For the Demonstration Set Generation Problem, we set the weights of all demonstrations to be 1, since selecting any demonstration brings us equivalent cost. The pseudo-code is shown in Algorithm 1.

We first initialize the demonstration set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to an empty set (line 1). Then we determine whether the value of intermediate set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT meets the value of full unlabeled demonstration pool Dusubscript𝐷𝑢D_{u}italic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (line 2) which is probably equaled to |M|𝑀|M|| italic_M | with a large enough pool size. If not, we will iteratively select the most efficient demonstration and add it to the intermediate demonstration set (lines 3similar-to\sim4). Otherwise, the algorithm ends and outputs the selected demonstration set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (line 5).

Assuming that the optimal sum of Demonstration Set Generation Problem is OPT𝑂𝑃𝑇OPTitalic_O italic_P italic_T and the final sum of our greedy algorithm is ans*𝑎𝑛superscript𝑠ans^{*}italic_a italic_n italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, we have ans*HkOPT𝑎𝑛superscript𝑠subscript𝐻𝑘𝑂𝑃𝑇ans^{*}\leq H_{k}\cdot OPTitalic_a italic_n italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ≤ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ italic_O italic_P italic_T, where Hk=i=1k1i,k=maxdiDsfQ({di})formulae-sequencesubscript𝐻𝑘superscriptsubscript𝑖1𝑘1𝑖𝑘subscriptsubscript𝑑𝑖subscript𝐷𝑠subscript𝑓𝑄subscript𝑑𝑖H_{k}=\sum_{i=1}^{k}\frac{1}{i},k=\max_{d_{i}\in D_{s}}f_{Q}(\{d_{i}\})italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_i end_ARG , italic_k = roman_max start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( { italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ). A complete proof can be found in [33].

For Demonstration Set Generation problem, by setting a target function and designing a greedy algorithm to optimize it, we can generate an effective solution, that is, selecting a small number of demonstrations to cover all the questions to be queried, thereby greatly reducing the labeling cost.

TABLE II: Statistics of Datasets.
Dataset Domain # Attr. # Pairs # Matches
Walmart-Amazon (WA) Electronics 5555 10,2421024210,24210 , 242 962962962962
Abt-Buy (AB) Product 3333 9575957595759575 1028102810281028
Amazon-Google (AG) Software 3333 11,4601146011,46011 , 460 1,16711671,1671 , 167
DBLP-Scholar (DS) Citation 4444 28,7072870728,70728 , 707 5,34753475,3475 , 347
DBLP-ACM (DA) Citation 4444 12,3631236312,36312 , 363 2,22022202,2202 , 220
Fodors-Zagats (FZ) Restaurant 6666 946946946946 110110110110
iTunes-Amazon (IA) Music 8888 532532532532 132132132132
Beer Beer 4444 450450450450 68686868

V-B Batch Covering

Next, based on the generated Demonstration Set, we will allocate relevant demonstrations to each batch, so as to covering all the questions in the batch. At this stage, we ask a question: Is there further optimization space when allocating demonstrations? To answer this question, we consider an example of a Question Set M={q1,q2,q3,q4}𝑀subscript𝑞1subscript𝑞2subscript𝑞3subscript𝑞4M=\{q_{1},q_{2},q_{3},q_{4}\}italic_M = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } and a labeled Demonstration Set {d1,d2}subscript𝑑1subscript𝑑2\{d_{1},d_{2}\}{ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. We have d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT covers q1,q2,q3subscript𝑞1subscript𝑞2subscript𝑞3q_{1},q_{2},q_{3}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT covers q2,q3,q4subscript𝑞2subscript𝑞3subscript𝑞4q_{2},q_{3},q_{4}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. Given a batch Bi={q2,q3}subscript𝐵𝑖subscript𝑞2subscript𝑞3B_{i}=\{q_{2},q_{3}\}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, we need to allocate demonstrations to cover all questions in Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It can be seen that, at this time, whether allocating d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can cover all questions in the batch. Therefore, although we only consider covering each question once when generating the Demonstration Set, there is still room for choice when allocating demonstrations for each batch.


Definition. Given a batch Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of questions Bi={q}subscript𝐵𝑖𝑞B_{i}=\{q\}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_q }, a generated demonstration set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and a non-negative distance threshold t𝑡titalic_t, we need to select a set of demonstrations DiDsubscript𝐷𝑖𝐷D_{i}\subset Ditalic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊂ italic_D, satisfying qBifor-all𝑞subscript𝐵𝑖\forall q\in B_{i}∀ italic_q ∈ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, exists at least one dDi𝑑subscript𝐷𝑖d\in D_{i}italic_d ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that 𝚍𝚒𝚜𝚝(q,d)<t𝚍𝚒𝚜𝚝𝑞𝑑𝑡\mathtt{dist}(q,d)<ttypewriter_dist ( italic_q , italic_d ) < italic_t. The goal is to minimize the weight of selected demonstrations dDiw(d)subscript𝑑subscript𝐷𝑖𝑤𝑑\sum_{d\in D_{i}}w(d)∑ start_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w ( italic_d ).

We define the weights of demonstrations as token numbers, and the goal of our problem is to find a demonstration set to cover the batch with minimum token assumption.


NP-hard Proof Sketch. The batch covering problem is obviously a special case of the set cover problem when we set the weight of all demonstrations to be 1. Following the proof in section V-A, we can create a corresponding instance of batch covering problem based on any instance of SCP. Besides, since we set all the weights to be 1, the objective of our problem becomes dDi1=|Di|subscript𝑑subscript𝐷𝑖1subscript𝐷𝑖\sum_{d\in D_{i}}1=|D_{i}|∑ start_POSTSUBSCRIPT italic_d ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 = | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, which is equivalent to that of SCP. Thus, we can prove the batch covering problem as an NP-hard problem by reducing it from the NP-hard set cover problem.


Greedy Algorithm. We again use Algorithm 1 to address the Batch Covering Problem. We use the same value function defined in section V-A and define the weights of demonstrations as token numbers. Taking the value function f𝑓fitalic_f, batch Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of questions, the generated Demonstration Set Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and weight function w𝑤witalic_w as input, the algorithm will output the allocated demonstration set Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for batch Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

This greedy algorithm yields an approximation ratio of ln|Bi|lnln|Bi|+Ω(1)subscript𝐵𝑖subscript𝐵𝑖Ω1\ln|B_{i}|-\ln\ln|B_{i}|+\Omega(1)roman_ln | italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | - roman_ln roman_ln | italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + roman_Ω ( 1 ). A complete proof can be found in  [33].

For Batch Covering Problem, by defining the weights of demonstrations as token numbers and formulating it as Weighted Set Cover Problem, we can generate an effective solution with the minimum sum of tokens of batch prompts, thereby reducing the interfacing API cost.

VI Experiments

This section evaluates our batch prompting framework BatchER investigated in this paper. Specifically, we first present the experimental setup in Section VI-A, and then conduct experiments to answer the following key questions:


Exp-1: How does Batch Prompting compare with Standard Prompting? (Section VI-B)


Exp-2: What are effective strategies in our design space of question batching and demonstration selection? (Section VI-C)


Exp-3: How does our proposed BatchER framework compare with PLM-based approaches to ER? (Section VI-D)


Exp-4: How does our proposed BatchER framework compare with LLM-based approaches to ER? (Section VI-E)


Exp-5: What is performance of our BatchER framework given various underlying LLMs? (Section VI-F)


Exp-6: What is performance of our BatchER framework given different feature extractors? (Section VI-G)

VI-A Experimental Setup


Datasets. We evaluate our proposed batch prompting framework BatchER using well-adopted benchmarking datasets from Magellan benchmark [34], which range from a variety of domains, such as product, software, and citation. Table II provides detailed statistics of the datasets. Specifically, each dataset contains entities from two relational tables with multiple attributes, and a set of labeled matching/non-matching entity pairs. Take the Amazon-Google (AG) dataset as an example: it contains software products from Amazon and Google with three attributes (title, manufacturer, price), and has 11,4601146011,46011 , 460 entity pairs where 1,16711671,1671 , 167 pairs are matches. For fair comparison, the set of labeled entity pairs is split into train, validation and test sets with a ratio of 3:1:1, which is consistent with existing ER studies [5, 1, 35].


Evaluation Metrics. In this paper, we evaluate the performance of ER approaches on both Accuracy and Cost.

(1) Matching Accuracy. Following existing ER studies [35, 1, 3, 2], we use F1 score to measure the matching accuracy of an ER approach. Specifically, let 𝚃𝙿𝚃𝙿{\tt TP}typewriter_TP, 𝙵𝙿𝙵𝙿{\tt FP}typewriter_FP, 𝙵𝙽𝙵𝙽{\tt FN}typewriter_FN denote the number of true positives (i.e., matching pairs correctly identified), false positives (non-matching pairs incorrectly identified) and false negatives (matching-pairs incorrectly omitted) respectively. Then, we can respectively compute Precision and Recall as 𝙿=𝚃𝙿/(𝚃𝙿+𝙵𝙿)𝙿𝚃𝙿𝚃𝙿𝙵𝙿{\tt P}={\tt TP}/({\tt TP}+{\tt FP})typewriter_P = typewriter_TP / ( typewriter_TP + typewriter_FP ) and 𝚁=𝚃𝙿/(𝚃𝙿+𝙵𝙽)𝚁𝚃𝙿𝚃𝙿𝙵𝙽{\tt R}={\tt TP}/({\tt TP}+{\tt FN})typewriter_R = typewriter_TP / ( typewriter_TP + typewriter_FN ), and derive F1 score as harmonic mean of Precision and Recall, i.e., 𝙵𝟷=2𝙿𝚁/(𝙿+𝚁)𝙵𝟷2𝙿𝚁𝙿𝚁{\tt F1}=2\cdot{\tt P}\cdot{\tt R}/({\tt P}+{\tt R})typewriter_F1 = 2 ⋅ typewriter_P ⋅ typewriter_R / ( typewriter_P + typewriter_R ).

(2) Monetary Cost. We evaluate an approach by considering its incurred monetary cost, which consists of two parts.

  • API Cost measures how much an approach pays for calling the API of a proprietary LLMs (e.g., GPT-3.5 and GPT-4). In particular, the API is priced per token. For example, according to the pricing of GPT API services222https://openai.com/pricing, GPT-4 incurs $0.01currency-dollar0.01\$0.01$ 0.01 / 1K tokens for input texts.

  • Labeling Cost measures how much an approach pays for labeling entity pairs to prepare demonstrations. To calculate the cost, we refer to the latest rates on the crowdsourcing platform, Amazon Mechanical Turk (AMT) 333https://www.mturk.com/ for text data labeling, which is $0.08currency-dollar0.08\$0.08$ 0.08 per labeling task. Following the existing crowdsourcing approach to ER [36], we group ten entity pairs into one labeling task and ask the crowd to label them in batch. Based on this, we estimate the cost of labeling one entity pair as $0.008currency-dollar0.008\$0.008$ 0.008.


Baselines. We consider two types of baselines. The first type is the SOTA PLM-based approaches to ER, including Ditto [1], JointBert [2] and RobEM [3]. The other type is the LLM-based approaches [11] to ER via in-context learning, equipped with manually designed prompts. We briefly describe the methods.

(1) Ditto [1] is a well-recognized PLM-based approach to ER, which utilizes pre-trained language model RoBerta [29] and employs labeled entity pairs for fine-tuning. We use the code and default setting of Ditto in its original paper [1].

(2) JointBert [2] is a dual-objective training method for BERT that combines binary matching and multi-class classification for entity matching. We use the code provided from [37]. We select the uncased base versions of BERT for JointBert and set all the hyper-parameters as default as in the original paper.

(3) RobEM [3] is a recent work that investigates the robustness of PLM-based ER methods with varying data distributions and identifies data imbalance as a critical issue. To solve this, it proposes simple yet effective modifications to enhance PLMs and achieves superior performance on ER. We run its original code from [38] and keep all the setting as default.

(4) ManualPrompt [11] is a pioneering initiative that uses LLMs (GPT-3) for ER as well as other data wrangling tasks. Similar to our work, it also employs in-context learning to answer ER questions. However, the key difference is that ManualPrompt utilizes standard prompting (i.e., asking questions one by one) and manually designed demonstrations. We reproduce the results of ManualPrompt by using the prompts published by its original paper [11].


Implementation Details. We briefly present the implementation details of our proposed framework as follows.

(1) Batch Prompting. We implement the design choices in Table I for question batching and demonstration selection, and compare them on both matching accuracy and monetary cost. For question batching, we set the batch size to 8, which ensures that none of the design choices exceeds the maximum token limit of LLMs’ text input, and employ the DBSCAN algorithm [27] for question clustering. For fair comparison of demonstration selection strategies (i.e., fixed, Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch and Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question), we choose 8888 demonstrations for each batch. For our covering-based strategy, we calculate the threshold t𝑡titalic_t by first computing the distances between all questions and then taking the 8888-th percentile as t𝑡titalic_t since 8888-th percentile can achieve great balance between cost and accuracy: with smaller t𝑡titalic_t, the labeling cost will become larger while larger t𝑡titalic_t will degrade the matching accuracy.

(2) Large Language Models. In our experiments, we use GPT-3.5-turbo-0301, or GPT-3.5-03 for short, as the default LLM, where 0301 means that the model version was finalized on March 1st. In particular, according to the guideline of OpenAI444https://platform.openai.com/docs/api-reference/completions, we set the temperature parameter of GPT-3.5-03 as 0.01. Moreover, we also investigate other proprietary LLMs, GPT-3.5-turbo-0613 (or GPT-3.5-06 for short) and GPT-4-1106-preview (or GPT-4 for short), as well as a very recent open-source LLM, LLama2-chat-70B [39].

VI-B Comparing Batch Prompting with Standard Prompting

Exp-1: How does Batch Prompting compare with Standard Prompting? We conduct experiments to compare batch prompting with standard prompting on matching accuracy and monetary cost. For fair comparison, we use the same 8888 fixed demonstrations, which are selected randomly, for both approaches. In this case, we only need to consider the API cost, as labeling costs of both approaches are the same. Moreover, we run the experiments for three times, and compute mean and standard variance of the obtained F1 scores.

The experimental results are reported in Table III. We can see that, batch prompting significantly outperforms standard prompting on both accuracy and cost. First, batch prompting improves F1 score by 1.3%-30.6% on all datasets except Beer. The reason that batch prompting performs worse than standard prompting on the Beer dataset is that the dataset is very small (with only 91919191 pairs for testing), and the two methods actually output very similar matching results. Moreover, we can also observe that batch prompting is more stable than standard prompting, i.e., achieving much smaller standard variance. Second, compared with standard prompting, batch prompting can achieve 4x-7x cost saving on API callings.

TABLE III: Comparing Batching Promting with Standard Prompting on Matching Accuracy and API Cost (The best results are bolded).
Dataset  Metric  Standard Prompting  Batch Prompting
WA F1 67.54±8.08subscript67.54plus-or-minus8.0867.54_{\pm 8.08}67.54 start_POSTSUBSCRIPT ± 8.08 end_POSTSUBSCRIPT 78.92±0.32subscript78.92plus-or-minus0.32\bm{78.92}_{\pm 0.32}bold_78.92 start_POSTSUBSCRIPT ± 0.32 end_POSTSUBSCRIPT
API ($) 1.431.431.431.43 0.330.33\bm{0.33}bold_0.33
AB F1 65.70±10.81subscript65.70plus-or-minus10.8165.70_{\pm 10.81}65.70 start_POSTSUBSCRIPT ± 10.81 end_POSTSUBSCRIPT 85.79±1.01subscript85.79plus-or-minus1.01\bm{85.79}_{\pm 1.01}bold_85.79 start_POSTSUBSCRIPT ± 1.01 end_POSTSUBSCRIPT
API ($) 1.101.101.101.10 0.240.24\bm{0.24}bold_0.24
AG F1 53.72±3.88subscript53.72plus-or-minus3.8853.72_{\pm 3.88}53.72 start_POSTSUBSCRIPT ± 3.88 end_POSTSUBSCRIPT 61.07±0.83subscript61.07plus-or-minus0.83\bm{61.07}_{\pm 0.83}bold_61.07 start_POSTSUBSCRIPT ± 0.83 end_POSTSUBSCRIPT
API ($) 1.291.291.291.29 0.290.29\bm{0.29}bold_0.29
DS F1 75.08±6.03subscript75.08plus-or-minus6.0375.08_{\pm 6.03}75.08 start_POSTSUBSCRIPT ± 6.03 end_POSTSUBSCRIPT 80.79±1.72subscript80.79plus-or-minus1.72\bm{80.79}_{\pm 1.72}bold_80.79 start_POSTSUBSCRIPT ± 1.72 end_POSTSUBSCRIPT
API ($) 5.315.315.315.31 1.221.22\bm{1.22}bold_1.22
DA F1 85.96±4.45subscript85.96plus-or-minus4.4585.96_{\pm 4.45}85.96 start_POSTSUBSCRIPT ± 4.45 end_POSTSUBSCRIPT 92.10±0.88subscript92.10plus-or-minus0.88\bm{92.10}_{\pm 0.88}bold_92.10 start_POSTSUBSCRIPT ± 0.88 end_POSTSUBSCRIPT
API ($) 2.932.932.932.93 0.630.63\bm{0.63}bold_0.63
FZ F1 89.95±3.67subscript89.95plus-or-minus3.6789.95_{\pm 3.67}89.95 start_POSTSUBSCRIPT ± 3.67 end_POSTSUBSCRIPT 94.13±1.11subscript94.13plus-or-minus1.11\bm{94.13}_{\pm 1.11}bold_94.13 start_POSTSUBSCRIPT ± 1.11 end_POSTSUBSCRIPT
API ($) 0.190.190.190.19 0.040.04\bm{0.04}bold_0.04
IA F1 90.59±0.94subscript90.59plus-or-minus0.9490.59_{\pm 0.94}90.59 start_POSTSUBSCRIPT ± 0.94 end_POSTSUBSCRIPT 91.75±0.84subscript91.75plus-or-minus0.84\bm{91.75}_{\pm 0.84}bold_91.75 start_POSTSUBSCRIPT ± 0.84 end_POSTSUBSCRIPT
API ($) 0.060.060.060.06 0.010.01\bm{0.01}bold_0.01
Beer F1 91.11±2.22subscript91.11plus-or-minus2.22\bm{91.11}_{\pm 2.22}bold_91.11 start_POSTSUBSCRIPT ± 2.22 end_POSTSUBSCRIPT 88.31±2.60subscript88.31plus-or-minus2.6088.31_{\pm 2.60}88.31 start_POSTSUBSCRIPT ± 2.60 end_POSTSUBSCRIPT
API ($) 0.070.070.070.07 0.010.01\bm{0.01}bold_0.01
Refer to caption
(a) The WA Dataset
Refer to caption
(b) The AB Dataset
Figure 6: Comparing Batch Prompting and Standard Prompting on Recall, Precision and F1, where the two methods are denoted as “Standard” and “Batch” respectively.

While it is intuitive that batch prompting can save cost, it is somewhat surprising that it can also significantly improve the accuracy. Thus, we conduct a detailed analysis to report Precision and Recall on WA and AB datasets, as shown in Figure 6. We can see batch prompting achieves much higher Precision than standard prompting, while their Recall scores are comparable. This is mainly attributed to the batching mechanism, where the LLM can refer to not only the provided demonstrations, but also the answers generated for previous questions within the same batch. This may help the LLM to identify some key characteristics that are useful to differentiate the entities. For example, on the WA dataset, batch prompting can help the LLM to focus on a critical attribute “𝚖𝚘𝚍𝚎𝚕𝚗𝚘𝚖𝚘𝚍𝚎𝚕𝚗𝚘{\tt modelno}typewriter_modelno”, and enable the LLM to understand entities with different “𝚖𝚘𝚍𝚎𝚕𝚗𝚘𝚖𝚘𝚍𝚎𝚕𝚗𝚘{\tt modelno}typewriter_modelno” tend to be non-matching pairs.

Finding 1: Batch prompting can not only bring 4x-7x cost saving, but also achieve higher and more stable matching accuracy than standard prompting.

VI-C Exploring Design Space of Batch Prompting for ER

TABLE IV: Exploring the Design Space of Three Question Batching Methods and Four Demonstration Selection Methods (The best results are bolded and the second best results are underlined).
Dataset Metric Random Question Batching Similarity-based Question Batching Diversity-based Question Batching
Fix Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question Cover Fix Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question Cover Fix Topk𝑘kitalic_k-𝖻𝖺𝗍𝖼𝗁𝖻𝖺𝗍𝖼𝗁\mathsf{batch}sansserif_batch Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question Cover
WA F1 78.92 79.15 79.06 78.64 73.50 77.43 78.30 76.43 79.24 78.87 80.18 80.66
API ($) 0.33 0.34 0.35 0.30 0.34 0.34 0.35 0.24 0.35 0.34 0.34 0.28
Label ($) 0.06 11.53 12.63 0.34 0.06 14.15 12.63 0.34 0.06 13.30 12.63 0.34
AB F1 85.79 86.24 86.79 85.71 85.19 85.65 87.02 87.16 85.03 86.38 87.91 88.38
API ($) 0.24 0.23 0.24 0.21 0.24 0.23 0.24 0.20 0.24 0.23 0.24 0.20
Label ($) 0.06 10.86 6.07 0.28 0.06 10.86 6.07 0.28 0.06 11.21 6.07 0.28
AG F1 61.07 61.82 61.90 60.69 58.90 60.74 60.96 60.62 60.24 57.85 64.57 62.16
API ($) 0.29 0.30 0.30 0.25 0.30 0.30 0.30 0.25 0.29 0.30 0.30 0.25
Label ($) 0.06 14.20 9.70 0.23 0.06 14.09 9.70 0.23 0.06 13.84 9.69 0.23
DS F1 80.79 82.49 83.55 82.36 76.44 73.78 77.09 75.59 79.07 79.80 83.46 83.70
API ($) 1.22 1.27 1.28 1.13 1.31 1.27 1.29 1.04 1.27 1.15 1.28 1.12
Label ($) 0.06 35.38 27.94 0.31 0.06 35.92 28.24 0.31 0.06 35.96 28.24 0.31
DA F1 92.10 93.00 93.62 92.32 91.59 92.42 92.44 92.06 92.27 94.21 94.28 94.96
API ($) 0.63 0.62 0.63 0.54 0.62 0.62 0.63 0.50 0.62 0.62 0.63 0.53
Label ($) 0.06 15.50 14.61 0.32 0.06 15.50 14.61 0.32 0.06 15.09 14.61 0.32
FZ F1 94.13 93.33 95.24 93.33 95.24 90.48 93.02 92.68 93.02 88.37 95.24 100.00
API ($) 0.04 0.04 0.03 0.03 0.04 0.04 0.04 0.03 0.04 0.04 0.04 0.03
Label ($) 0.06 1.18 1.27 0.30 0.06 1.25 1.32 0.30 0.06 1.18 1.27 0.30
IA F1 91.75 94.74 94.55 92.59 92.59 94.34 96.30 92.86 88.00 94.55 98.17 96.43
API ($) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
Label ($) 0.06 0.60 0.56 0.16 0.06 0.69 0.56 0.16 0.06 0.42 0.56 0.16
Beer F1 88.31 76.92 81.48 89.66 85.71 84.62 81.48 88.89 92.86 89.66 89.66 96.55
API ($) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
Label ($) 0.06 0.65 0.66 0.14 0.06 0.68 0.66 0.14 0.06 0.64 0.62 0.14

Exp-2: What are effective strategies in our design space of question batching and demonstration selection? We explore the design space shown in Table I by comparing the 12 combinations of three question batching methods and four demonstration selection methods. From the experimental results reported in Table IV, we have the following observations.


Evaluation on question batching. As reported in Table IV, the diversity-based question batching achieves the highest overall F1 scores. Moreover, it is interesting to see that the similarity-based question batching performs the worst on matching accuracy, even achieving lower F1 scores than the random question batching. This is because the questions within a batch is very similar, thus making the LLM difficult to differentiate entities by comparing different questions. Consequently, the LLM tends to produce identical answers for various questions, leading to degradation of matching accuracy. On the other hand, we can see that different question batching strategies have similar results on API cost and labeling cost, given varying demonstration selection methods. The reason is straightforward since prompts of different question batching strategies have similar amounts of tokens.


Evaluation on demonstration selection. Observing Table IV again, we can see that Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question and our covering-based strategy (denoted as Cover) outperform other strategies on accuracy, while the F1 scores of these two strategies are comparable. For example, under diversity-based batching, Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question yields the highest F1 score on 2222 datasets, while Cover is the best on the remaining 6666 datasets. This is because both Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question and Cover aim to select relevant demonstrations for each individual question within a batch, which is helpful for the LLM to understand varying cases of ER.

On the other hand, Cover is much more cost-effective than Topk𝑘kitalic_k-𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇𝗊𝗎𝖾𝗌𝗍𝗂𝗈𝗇\mathsf{question}sansserif_question on demonstration labeling, e.g., brings 10x-100x labeling cost savings on the former five large datasets and 5x savings on the latter three small datasets. The results validate the effectiveness of our covering-based mechanism: by selecting a minimal set of demonstrations that cover all questions in a batch, we can significantly reduce the number of required demonstrations, and thus save the labeling cost.

Finding 2: The design choice that combines Diversity-based Question Batching and our Covering-based Demonstration Selection is the most favorable, i.e., achieving the highest accuracy while incurring the lowest cost.

VI-D Comparing with PLM-based Approaches to ER

Refer to caption
(a) WA
Refer to caption
(b) AB
Refer to caption
(c) AG
Refer to caption
(d) DS
Refer to caption
(e) DA
Refer to caption
(f) FZ
Refer to caption
(g) IA
Refer to caption
(h) Beer
Figure 7: Comparing our Batching Prompting framework BatchER with existing PLM-based approaches to ER.

Exp-3: How does our BatchER framework compare with PLM-based approaches to ER? We compare our framework with the PLM-based approaches mentioned in Section VI-A, by varying the size of training set for these approaches. Note that we use the best design choices shown in Table IV, i.e., Diversity-based Question Batching and Covering-based Demonstration Selection, as the default setting.

Figure 7 shows the experimental results on the eight datasets, where the results of our framework are represented as red solid lines. Not surprisingly, our framework is much more cost-effective than Ditto [1], JointBert [2] and RobEM [3]. For example, on the WA, AB and AG datasets, the three PLM-based methods require at least 2000 training samples to achieve a similar F1 score of our framework. In contrast, our framework requires no more than 50 labeled samples on all the datasets. According to our cost calculation method in Section VI-A, the monetary cost incurred by these PLM-based approaches is about 300x-400x larger than our overall cost (i.e., API cost plus labeling cost). Furthermore, we also observe that once models like RobEM catching up with the F1 score of our framework, additional training samples do not substantially increase the performance; on some datasets (e.g., FA, IA and Beer), even the entire training set is insufficient for the baselines to reach the F1 score of our framework.

Finding 3: With much less labeled data, our batch prompting framework achieves competitive performance with PLM-based method trained with hundreds of or even thousands of labeled matching/non-matching entity pairs.

VI-E Comparing with Manual Prompting for ER

TABLE V: Comparing Batching Prompting with Manual Prompting (The best results are bolded).
Dataset  Metric  Manual Prompting  Batch Prompting
WA F1 82.63 80.66
API ($) 1.40 0.28
AG F1 65.40 62.16
API ($) 1.65 0.25
DS F1 70.44 83.70
API ($) 5.87 1.12
DA F1 94.90 94.96
API ($) 2.65 0.53
FZ F1 97.67 100
API ($) 0.14 0.03
IA F1 98.11 96.43
API ($) 0.05 0.01
Beer F1 92.23 96.55
API ($) 0.05 0.01

Exp-4: How does our BatchER framework compare with LLM-based approaches to ER? We compare our framework with the existing LLM-based approach [11], equipped with manually designed prompts, including hand-picked demonstrations. The results are reported in Table V. The reason for the absence of a comparison for the Abt-Buy dataset in the Table V is that ManualPrompt approach [11] is not tested on this dataset. We can see that, with only 20% of the API cost, our batch prompting framework can achieve comparable F1 score, compared with the ManualPrompt approach. In particular, on four datasets (DS, DA, FZ, Beer), our framework even outperforms ManualPrompt. The results implies that batch prompting, despite requiring cost of labeling selected demonstrations, may still be more practical than ManualPrompt, which requires domain experts for prompt designing.

Finding 4: Our automatic batch prompting framework achieves comparable or even better F1 scores with manual prompting methods for LLMs, with much less API cost.

VI-F Evaluation on Different Underlying LLMs

TABLE VI: Evaluating Different Underlying LLMs on Matching Accuracy and API Cost (The best results are bolded and the second best results are underlined).
Dataset  Metric  GPT-3.5-03  GPT-3.5-06  GPT-4
WA F1 80.66 80.32 81.22
API ($) 0.28 0.28 2.81
AB F1 88.38 69.08 85.22
API ($) 0.20 0.20 2.02
AG F1 62.16 52.40 64.06
API ($) 0.25 0.25 2.52
DS F1 83.70 65.94 89.48
API ($) 1.12 1.12 11.24
DA F1 94.96 91.29 96.04
API ($) 0.53 0.53 5.27
FZ F1 100.00 92.68 100.00
API ($) 0.03 0.03 0.32
IA F1 96.43 92.31 94.34
API ($) 0.01 0.01 0.09
Beer F1 96.55 92.31 96.30
API ($) 0.01 0.01 0.11

Exp-5: What is performance of our approaches given various underlying LLMs? We evaluate the performance of BatchER on various underlying LLMs, including two versions of GPT-3.5 and GPT-4, which are mentioned in Section VI-A. Note that we also evaluate the well-known open-source LLM, Llama2 [40]. However, we find that Llama2 is not suitable for batch prompting: When prompted to answer multiple questions, Llama2 fails to produce any output in most cases. Thus, we omit the results of Llama2.

The experimental results are shown in Table VI. First, considering matching accuracy, GPT-4 achieves the best results on five datasets, demonstrating its superior capability on text comprehension and task solving. Moreover, we also find GPT-3.5-03 is comparable to GPT-4. Specifically, GPT-3.5-03 achieves the second highest F1 overall and the largest F1 difference from GPT-4 is less than 6.4%. Second, as per the latest pricing, the token pricing of GPT-4 is 10x higher than GPT-3.5, leads to considerably high API costs. To summarize, the results show that GPT-3.5-03 achieves the best trade-off between matching accuracy and monetary cost, making it a more favorable choice for practical applications.

Finding 5: As the underlying LLM of BatchER, GPT3.5-03 achieves the best trade-off between matching accuracy and monetary cost.

VI-G Evaluation on Different Feature Extractors

TABLE VII: Evaluating Different Feature Extractors on Matching Accuracy (The best results are bolded).
Dataset Structure-aware Semantics-based
BatchER-LR BatchER-JAC BatchER-SEM
WA 80.66 78.05 78.66
AB 88.38 84.23 87.06
AG 62.16 59.90 59.20
DS 83.70 81.27 80.91
DA 94.96 92.70 90.36
FZ 100.00 93.62 95.24
IA 96.43 90.57 90.91
Beer 96.55 89.66 91.67

Exp-6: What is performance of our approaches given different feature extractors? We examine the performance of BatchER using different Feature Extractors described in Section III-B, namely BatchER-LR, BatchER-JAC, and BatchER-SEM. The former two feature extractor use Structure-aware Feature Extractor based on Levenshtein Ratio (LR) and Jaccard Similarity (JAC). The latter uses Semantics-based Feature Extractor based on SBERT embedding. Since their monetary cost is close, we only compare these three variants on F1 scores on the eight datasets.

As shown in Table VII, BatchER-LR achieves the best performance on all the datasets while BatchER-JAC and BatchER-SEM achieve comparative results. This results validates that stucture-aware feature extractor can better capture the relevance between entity pairs in the ER scenario. Moreover, compared with BatchER-JAC, BatchER-LR is more sensitivity to string order and its superior precision in quantifying the similarity between two strings. For instance, considering two strings “listen” and “silent”, the similarity score calculated using LR is 0.5, whereas with JAC, it is 0.89. This clearly demonstrates the former is better effectiveness in quantifying the similarity between the two strings, thus is more effective to generate feature vectors for entity pairs.

Finding 6: The structure-aware feature extractor is preferred for measuring distances among entity pairs in ER.

VII Related Work

PLM-based Methods for Entity Resolution. Entity resolution is a popular data integration task that has been widely studied for decades. With the rise of deep learning, some approaches [41] leverage pre-trained word embeddings to improve the ER performance. However, these methods mainly use the non-contextual embeddings without considering the downstream tasks. Therefore, recent studies [1, 2, 4, 5] have focused on using Transformer-based PLMs to produce contextualized embeddings based on fine-tuning over downstream tasks. To be specific, Ditto [1] regards ER as a sequence-pair classification problem via Transformer, where domain knowledge is injected to further improve the performance. JointBERT [2] adopts a dual-objective training paradigm for BERT. Specifically, besides predicting matching/non-matching pairs, JointBERT also incorporates a multi-class classification task to predict the entity identifier for each entity description of a pair. DADER [5] focuses on leveraging the domain adaptation technique: given a labeled source dataset, it trains an ER model for another target dataset by aligning features of both datasets based on PLMs. Based on PLMs, Unicorn [4] focuses on building a unified framework for data matching tasks, including ER. Unicorn uses a unified encoder for any pair of data to be predicted, and a mixture-of-experts module to align the semantics of multiple tasks. Although the above PLMs-based approaches can achieve a relatively good performance, they need plenty of labeled pairs for supervision, which are often expensive to acquire.


LLM-based Methods for Entity Resolution. With the size of pre-training data and model parameters scales, large-scale language models (LLMs) have gained an emergent capability called In-Context Learning (ICL) to learn from a few demonstrations without explicit model update [6, 42]. Recent studies [11, 12, 26] have focused on utilizing LLMs to tackle ER with less labeled pairs for supervision. Narayan et al. [11] are among the first to explore the capability of GPT3 [6] for ER with manually designed demonstrations, which achieves remarkable performance compared with PLM-based methods. Since manual demonstrations require professional prompting engineering knowledge, Peeters et al. [12] propose to select relevant demonstrations based on 𝙺𝙽𝙽𝙺𝙽𝙽\mathtt{KNN}typewriter_KNN retrieval algorithm, where Jaccard similarity is utilized to measure the relevance. Moreover, Zhang et al. [26] consider batch prompting for ER, which employs a straightforward random batching strategy with manually designed demonstrations. Although question batching and demonstration selection have been considered in existing studies, these studies mainly rely on domain experts or develop heuristics for these two problem, and have not explored the combination of different demonstration selection and batching strategies. Compared to them, we utilize the power of ICL and propose a comprehensive framework BatchER. We explore a design space to evaluate the performance of different design choices, and propose a covering-based demonstration selection strategy that effectively balances the trade-off between accuracy and cost.


In-Context Learning for Data Management. LLMs are capable to capture rich linguistic patterns and generate coherent text [43, 6, 39], which have shown great success in a wide range of NLP tasks  [22, 16, 15]. ICL is an emergent capability of LLMs that enables the model to learn from few demonstrations without explicit gradient update [42]. Recently, researchers have studied to leverage ICL to solve data management tasks, such as data discovery [44], data cleaning and integration [11], and data labeling [45], and also study how to batch questions and select demonstrations. BatchPrompt [23] proposes to group multiple questions into one batch and query LLMs to answer one batch in an interface. In addition, both relevance-based [31, 46] and diversity-based [47, 48] strategies are proposed for demonstration selection. Compared with these studies, as far as we know, we are the first to develop the batch prompting technique tailored to the ER task, and design new methods, such as covering-based demonstration selection and structure-aware feature extraction, which are shown to be effective for ER.

VIII Conclusion

In this paper we have introduced a cost-effective batch prompting framework BatchER for entity resolution, and explored the effectiveness of BatchER under different design choices. We also devised a covering-based demonstration selection strategy that achieves effective balance between accuracy and cost. We have conducted extensive experiments to evaluate different combinations of the choices in the design space with insightful empirical findings, as summarized using the six findings in Section VI. These findings imply that our BatchER framework is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data, but also LLM-based methods with manually designed prompting. We also provided guidance for selecting appropriate design choices for batch prompting.

References

  • [1] Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan, “Deep entity matching with pre-trained language models,” Proc. VLDB Endow., vol. 14, no. 1, pp. 50–60, 2020.
  • [2] R. Peeters and C. Bizer, “Dual-objective fine-tuning of bert for entity matching,” Proc. VLDB Endow., vol. 14, no. 10, p. 1913–1921, 2021.
  • [3] M. Akbarian Rastaghi, E. Kamalloo, and D. Rafiei, “Probing the robustness of pre-trained language models for entity matching,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, p. 3786–3790.
  • [4] J. Tu, J. Fan, N. Tang, P. Wang, G. Li, X. Du, X. Jia, and S. Gao, “Unicorn: A unified multi-tasking model for supporting matching tasks in data integration,” Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–26, 2023.
  • [5] J. Tu, J. Fan, N. Tang, P. Wang, C. Chai, G. Li, R. Fan, and X. Du, “Domain adaptation for deep entity resolution,” in Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 443–457.
  • [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in NeurIPS 2020, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
  • [7] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” arXiv preprint arXiv:2202.12837, 2022.
  • [8] J. Chen, L. Chen, and T. Zhou, “It takes one to tango but more make trouble? in-context training with different number of demonstrations,” arXiv preprint arXiv:2303.08119, 2023.
  • [9] L. Gao, A. Chaudhary, K. Srinivasan, K. Hashimoto, K. Raman, and M. Bendersky, “Ambiguity-aware in-context learning with large language models,” arXiv preprint arXiv:2309.07900, 2023.
  • [10] X. Wang, Y. Wang, C. Xu, X. Geng, B. Zhang, C. Tao, F. Rudzicz, R. E. Mercer, and D. Jiang, “Investigating the learning behaviour of in-context learning: A comparison with supervised learning,” arXiv preprint arXiv:2307.15411, 2023.
  • [11] A. Narayan, I. Chami, L. J. Orr, and C. Ré, “Can foundation models wrangle your data?” Proc. VLDB Endow., vol. 16, no. 4, pp. 738–746, 2022. [Online]. Available: https://www.vldb.org/pvldb/vol16/p738-narayan.pdf
  • [12] R. Peeters and C. Bizer, “Entity matching using large language models,” CoRR, vol. abs/2310.11244, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.11244
  • [13] O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022.   Association for Computational Linguistics, 2022, pp. 2655–2671. [Online]. Available: https://doi.org/10.18653/v1/2022.naacl-main.191
  • [14] Y. Zhang, K. Zhou, and Z. Liu, “What makes good examples for visual in-context learning?” arXiv preprint arXiv:2301.13670, 2023.
  • [15] X. Li, K. Lv, H. Yan, T. Lin, W. Zhu, Y. Ni, G. Xie, X. Wang, and X. Qiu, “Unified demonstration retriever for in-context learning,” arXiv preprint arXiv:2305.04320, 2023.
  • [16] S. Agrawal, C. Zhou, M. Lewis, L. Zettlemoyer, and M. Ghazvininejad, “In-context examples selection for machine translation,” in ACL, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds.   Association for Computational Linguistics, 2023, pp. 8857–8873. [Online]. Available: https://doi.org/10.18653/v1/2023.findings-acl.564
  • [17] G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas, “Blocking and filtering techniques for entity resolution: A survey,” ACM Computing Surveys (CSUR), vol. 53, no. 2, pp. 1–42, 2020.
  • [18] S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, and A. Doan, “Deep learning for blocking in entity matching: a design space exploration,” Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 2459–2472, 2021.
  • [19] C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao, “Collaborem: a self-supervised entity matching framework using multi-features collaboration,” IEEE Transactions on Knowledge and Data Engineering, 2021.
  • [20] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity,” arXiv preprint arXiv:2104.08786, 2021.
  • [21] Y. Chen, C. Zhao, Z. Yu, K. McKeown, and H. He, “On the relation between sensitivity and accuracy in in-context learning,” arXiv preprint arXiv:2209.07661, 2022.
  • [22] Z. Wan, F. Cheng, Z. Mao, Q. Liu, H. Song, J. Li, and S. Kurohashi, “Gpt-re: In-context learning for relation extraction using large language models,” arXiv preprint arXiv:2305.02105, 2023.
  • [23] Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with large language model apis,” arXiv preprint arXiv:2301.08721, 2023.
  • [24] M. Luo, X. Xu, Z. Dai, P. Pasupat, M. Kazemi, C. Baral, V. Imbrasaite, and V. Y. Zhao, “Dr. icl: Demonstration-retrieved in-context learning,” arXiv preprint arXiv:2305.14128, 2023.
  • [25] K. Margatina, T. Schick, N. Aletras, and J. Dwivedi-Yu, “Active learning principles for in-context learning with large language models,” arXiv preprint arXiv:2305.14264, 2023.
  • [26] H. Zhang, Y. Dong, C. Xiao, and M. Oyamada, “Large language models as data preprocessors,” arXiv preprint arXiv:2308.16361, 2023.
  • [27] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, E. Simoudis, J. Han, and U. M. Fayyad, Eds., 1996, pp. 226–231.
  • [28] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
  • [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [30] V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8.   Soviet Union, 1966, pp. 707–710.
  • [31] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, “What makes good in-context examples for gpt-3?” in Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, Ireland and Online, May 27, 2022.   Association for Computational Linguistics, 2022, pp. 100–114. [Online]. Available: https://doi.org/10.18653/v1/2022.deelio-1.10
  • [32] K. Bernhard and J. Vygen, “Combinatorial optimization: Theory and algorithms,” Springer, Third Edition, 2005., 2008.
  • [33] P. Slavík, “A tight analysis of the greedy algorithm for set cover,” in Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 435–441.
  • [34] A. Doan, P. Konda, P. S. G. C., Y. Govind, D. Paulsen, K. Chandrasekhar, P. Martinkus, and M. Christie, “Magellan: toward building ecosystems of entity matching solutions,” Commun. ACM, vol. 63, no. 8, pp. 83–91, 2020. [Online]. Available: https://doi.org/10.1145/3405476
  • [35] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra, “Deep learning for entity matching: A design space exploration,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 19–34.
  • [36] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder: Crowdsourcing entity resolution,” Proc. VLDB Endow., vol. 5, no. 11, pp. 1483–1494, 2012. [Online]. Available: http://vldb.org/pvldb/vol5/p1483\_jiannanwang\_vldb2012.pdf
  • [37] (2021) Code of jointbert. [Online]. Available: https://github.com/wbsg-uni-mannheim/jointbert
  • [38] (2022) Code of robem. [Online]. Available: https://github.com/makbn/robem
  • [39] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [40] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” CoRR, vol. abs/2307.09288, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.09288
  • [41] M. Ebraheem, S. Thirumuruganathan, S. R. Joty, M. Ouzzani, and N. Tang, “Distributed representations of tuples for entity resolution,” Proc. VLDB Endow., vol. 11, no. 11, pp. 1454–1467, 2018. [Online]. Available: http://www.vldb.org/pvldb/vol11/p1454-ebraheem.pdf
  • [42] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  • [43] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large language models are few-shot clinical information extractors,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022.
  • [44] M. Kayali, A. Lykov, I. Fountalis, N. Vasiloglou, D. Olteanu, and D. Suciu, “CHORUS: foundation models for unified data discovery and exploration,” CoRR, vol. abs/2306.09610, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.09610
  • [45] N. Guan, K. Chen, and N. Koudas, “Can large language models design accurate label functions?” CoRR, vol. abs/2311.00739, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.00739
  • [46] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate empathetic dialogues? A novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation,” in Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022.   International Committee on Computational Linguistics, 2022, pp. 669–683. [Online]. Available: https://aclanthology.org/2022.coling-1.56
  • [47] I. Levy, B. Bogin, and J. Berant, “Diverse demonstrations improve in-context compositional generalization,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023.   Association for Computational Linguistics, 2023, pp. 1401–1422. [Online]. Available: https://doi.org/10.18653/v1/2023.acl-long.78
  • [48] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith, and T. Yu, “Selective annotation makes language models better few-shot learners,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.   OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=qY1hlv7gwg
ABJRU5ErkJggg==" alt="[LOGO]">