\noptcrule

SetCSE: Set Operations using Contrastive
Learning of Sentence Embeddings

Kang Liu
Independent Researcher
[email protected]
Abstract

Taking inspiration from Set Theory, we introduce SetCSE, an innovative information retrieval framework. SetCSE employs sets to represent complex semantics and incorporates well-defined operations for structured information querying under the provided context. Within this framework, we introduce an inter-set contrastive learning objective to enhance comprehension of sentence embedding models concerning the given semantics. Furthermore, we present a suite of operations, including SetCSE intersection, difference, and operation series, that leverage sentence embeddings of the enhanced model for complex sentence retrieval tasks. Throughout this paper, we demonstrate that SetCSE adheres to the conventions of human language expressions regarding compounded semantics, provides a significant enhancement in the discriminatory capability of underlying sentence embedding models, and enables numerous information retrieval tasks involving convoluted and intricate prompts which cannot be achieved using existing querying methods.

1 Introduction

Recent advancements in universal sentence embedding models (Lin et al., 2017; Chen et al., 2018; Reimers & Gurevych, 2019; Feng et al., 2020; Wang & Kuo, 2020; Gao et al., 2021; Chuang et al., 2022; Zhang et al., 2022; Muennighoff, 2022; Jiang et al., 2022) have greatly improved natural language information retrieval tasks like semantic search, fuzzy querying, and question answering (Yang et al., 2019; Shao et al., 2019; Bonial et al., 2020; Esteva et al., 2020; Sen et al., 2020). Notably, these models and solutions have been primarily designed and evaluated on the basis of single-sentence queries, prompts, or instructions. However, both within the domain of linguistic studies and in everyday communication, the expression and definition of complex or intricate semantics frequently entail the use of multiple examples and sentences “collectively” (Kreidler, 1998; Harel & Rumpe, 2004; Riemer, 2010). In order to express these semantics in a natural and comprehensive way, and search information for in a straightforward manner based on the provided context, we propose Set Operations using Contrastive Learning of Sentence Embeddings (SetCSE), a novel query framework inspired by Set Theory (Cantor, 1874; Johnson-Laird, 2004). Within this framework, each set of sentences is presented to represent a semantic. The proposed inter-set contrastive learning empowers language models to better differentiate provided semantics. Furthermore, the well-defined SetCSE operations provide simple syntax to query information structurally based on those sets of sentences.

Refer to caption
Figure 1: The illustration of inter-set contrastive learning and SetCSE query framework.

As illustrated in Figure 1, SetCSE framework contains two major steps, the first is to fine-tune sentence embedding models based on inter-set contrastive learning objective, and the other is to retrieve sentences using SetCSE operations. The inter-set contrastive learning aims to reinforce underlying models to learn contextual information and differentiate between different semantics conveyed by sets. An in-depth introduction of this novel learning objective can be found in Section 3. The SetCSE operations contain SetCSE intersection, SetCSE difference, and SetCSE operation series (as shown in Figure 1 Step 2), where the first two enable the “selection” and “deselection” of sentences based on single criteria, and the serial operations allow for extracting sentences following complex queries. The definitions and properties of SetCSE operations can be found in Section 4.

Besides the illustration of this framework, Figure 1 also provides an example showing how SetCSE can be leveraged to analyze S&P 500 companies stance on Environmental, Social, and Governance (ESG) issues through their public earning calls, which can play an important role in company growth forecasting (Utz, 2019; Hong et al., 2022). The concepts of ESG are hard to convey in single sentences, which creates difficulties for extracting related information using existing sentence retrieval method. However, utilizing SetCSE framework, one can easily express those concepts in sets of sentences, and find information related to “using technology to solve social issues, while neglecting its potential negative impact” in simple syntax. More details on this example can be found in Section 6.

This paper presents SetCSE in detail. Particularly, we highlight the major contributions as follows:

  1. 1.

    The employment of sets to represent complex semantics is in alignment with the intuition and conventions of human language expressions regarding compounded semantics.

  2. 2.

    Extensive evaluations reveal that SetCSE enhances language model semantic comprehension by approximately 30% on average.

  3. 3.

    Numerous real-world applications illustrate that the well-defined SetCSE framework enables complex information retrieval tasks that cannot be achieved using existing search methods.

2 Related Work

Set theory for word representations. In Computational Semantics, set theory is used to model lexical semantics of words and phrases (Blackburn & Bos, 2003; Fox, 2010). An example of this is the WordNet Synset (Fellbaum, 1998; Bird et al., 2009), where the word dog is a component of the synset {dog, domestic dog, Canis familiaris}. Formal Semantics (Cann, 1993; Partee et al., 2012) employs sets to systematically represent linguistic expressions. Furthermore, researchers have explored the use of set-theoretic operations on word embeddings to interpret the relationships between words and enhance embedding qualities (Zhelezniak et al., 2019; Bhat et al., 2020; Dasgupta et al., 2021). More details on the aforementioned and comparison with our work are included in Appendix A.

Sentence embedding models and contrastive learning. The sentence embedding problem is extensively studied in the area of nautral language processing (Kiros et al., 2015; Hill et al., 2016; Conneau et al., 2017; Logeswaran & Lee, 2018; Cer et al., 2018; Reimers & Gurevych, 2019). Recent work has shown that fine-tuning pre-trained language models with contrastive learning objectives achieves state-of-the-art results without even using labeled data (Srivastava et al., 2014; Giorgi et al., 2020; Yan et al., 2021; Gao et al., 2021; Chuang et al., 2022; Zhang et al., 2022; Mai et al., 2022), where contrastive learning aims to learn meaningful representations by pulling semantically close embeddings together and pushing apart non-close ones (Hadsell et al., 2006; Chen et al., 2020).

3 Inter-Set Contrastive Learning

The learning objective within SetCSE aims to distinguish sentences from different semantics. Thus, we adopt contrastive learning framework as in Chen et al. (2020), and consider the sentences from different sets as negative pairs. Let hmsubscripth𝑚\text{h}_{m}h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and hnsubscripth𝑛\text{h}_{n}h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the embedding of sentences m𝑚mitalic_m and n𝑛nitalic_n, respectively, and sim(hm,hn)simsubscripth𝑚subscripth𝑛\text{sim}(\text{h}_{m},\text{h}_{n})sim ( h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denote the cosine similarity hmhnhmhnsuperscriptsubscripth𝑚topsubscripth𝑛delimited-∥∥subscripth𝑚delimited-∥∥subscripth𝑛\frac{\text{h}_{m}^{\top}\text{h}_{n}}{\lVert\text{h}_{m}\rVert\cdot\lVert% \text{h}_{n}\rVert}divide start_ARG h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∥ h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ ⋅ ∥ h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ end_ARG. For N𝑁Nitalic_N number of sets, Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,N𝑖1𝑁i=1,\dots,Nitalic_i = 1 , … , italic_N, where each Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent a semantic, the inter-set loss inter-setsubscriptinter-set\mathcal{L}_{\text{inter-set}}caligraphic_L start_POSTSUBSCRIPT inter-set end_POSTSUBSCRIPT is defined as:

inter-set=i=1Ni,where i=mSilog(nSiesim(hm,hn)/τ).formulae-sequencesubscriptinter-setsuperscriptsubscript𝑖1𝑁subscript𝑖where subscript𝑖subscript𝑚subscript𝑆𝑖subscript𝑛subscript𝑆𝑖superscript𝑒simsubscripth𝑚subscripth𝑛𝜏\mathcal{L}_{\text{inter-set}}=\sum_{i=1}^{N}{\ell}_{i},\quad\text{where }\;{% \ell}_{i}=\sum_{m\in S_{i}}\log\left(\sum_{n\notin S_{i}}e^{\text{sim}(\text{h% }_{m},\text{h}_{n})/{\tau}}\right).caligraphic_L start_POSTSUBSCRIPT inter-set end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , where roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( ∑ start_POSTSUBSCRIPT italic_n ∉ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT sim ( h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT ) . (1)

Specifically, isubscript𝑖{\ell}_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the inter-set loss of Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with respect to other sets, and τ𝜏\tauitalic_τ is for temperature setting.

As one can see, strictly following Equation 1, the number of negative pairs will grow quadratically with N𝑁Nitalic_N. In practice, we can randomly pick a subset of the combination pairs with certain size to avoid this problem.

Our evaluations find that the above learning objective can effectively fine-tune sentence embedding models to distinguish different semantics. More details on the evaluation can be found in Section 5.

4 SetCSE Operations

In order to define the SetCSE operations, we first quantify “semantic closeness”, i.e., semantic similarity, of a sentence to a set of sentences. This closeness is measured by the similarity between the sentence embedding to the set embeddings.

Definition 1.

The semantic similarity, SIM(x,S)SIM𝑥𝑆\text{SIM}(x,S)SIM ( italic_x , italic_S ), between sentence x𝑥xitalic_x and set of sentences S𝑆Sitalic_S is defined as:

SIM(x,S)1|S|kSsim(hx,hk),SIM𝑥𝑆1𝑆subscript𝑘𝑆simsubscripth𝑥subscripth𝑘\text{SIM}(x,S)\coloneqq\frac{1}{|S|}\sum_{k\in S}\text{sim}(\text{h}_{x},% \text{h}_{k}),SIM ( italic_x , italic_S ) ≔ divide start_ARG 1 end_ARG start_ARG | italic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ italic_S end_POSTSUBSCRIPT sim ( h start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (2)

where sentence k𝑘kitalic_k represents sentences in S𝑆Sitalic_S, and h denotes the sentence embedding.

4.1 Operation Definitions

For the sake of readability, we first define the calculation of series of SetCSE intersection and difference, and then derive the simpler case where only single SetCSE intersection or difference operation is involved.

Definition 2.

For a given series of SetCSE operations AB1BN C1  CM𝐴subscript𝐵1 subscript𝐵𝑁subscript𝐶1subscript𝐶𝑀A\cap B_{1}\cap\dotsb\cap B_{N}\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to% 3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt% \hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C_{1}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}\dotsb\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C_{M}italic_A ∩ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, the result is an ordered set on A𝐴Aitalic_A, denoted as (A,)𝐴precedes-or-equals(A,\preceq)( italic_A , ⪯ ), where the order relationship precedes-or-equals\preceq is defined as

xyif and only ifi=1NSIM(x,Bi)j=1MSIM(x,Cj)i=1NSIM(y,Bi)j=1MSIM(y,Cj),precedes-or-equals𝑥𝑦if and only ifsubscriptsuperscript𝑁𝑖1SIM𝑥subscript𝐵𝑖subscriptsuperscript𝑀𝑗1SIM𝑥subscript𝐶𝑗subscriptsuperscript𝑁𝑖1SIM𝑦subscript𝐵𝑖subscriptsuperscript𝑀𝑗1SIM𝑦subscript𝐶𝑗x\preceq y\;\text{if and only if}\;\sum^{N}_{i=1}\text{SIM}(x,B_{i})-\sum^{M}_% {j=1}\text{SIM}(x,C_{j})\leq\sum^{N}_{i=1}\text{SIM}(y,B_{i})-\sum^{M}_{j=1}% \text{SIM}(y,C_{j}),italic_x ⪯ italic_y if and only if ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT SIM ( italic_x , italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT SIM ( italic_x , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT SIM ( italic_y , italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT SIM ( italic_y , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (3)

for all x𝑥xitalic_x and y𝑦yitalic_y in A𝐴Aitalic_A.

Remark. As one can see, the SetCSE operations AB1BN C1  CM𝐴subscript𝐵1 subscript𝐵𝑁subscript𝐶1subscript𝐶𝑀A\cap B_{1}\cap\dotsb\cap B_{N}\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to% 3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt% \hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C_{1}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}\dotsb\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C_{M}italic_A ∩ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT rank order the elements in A𝐴Aitalic_A by the similarity with sets Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and dissimilarity with Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In practice, when using SetCSE as a querying framework, one can rank the sentences in descending order and select the top ones which are semantically close to Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and different from Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Another observation is that a series of SetCSE operations is invariant to operation orders, in other words, we have AB C=A CB𝐴 𝐵𝐶 𝐴𝐶𝐵A\cap B\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C=A\mathbin{\mathchoice{\hbox{ \leavevmode% \hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0% .3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C\cap Bitalic_A ∩ italic_B italic_C = italic_A italic_C ∩ italic_B.

Following Definition 2, SetCSE intersection and difference are given by Lemma 1 and 2, respectively.

Lemma 1.

The SetCSE intersection AB𝐴𝐵A\cap Bitalic_A ∩ italic_B equals (A,)𝐴precedes-or-equals(A,\preceq)( italic_A , ⪯ ), where for all x,yA𝑥𝑦𝐴x,y\in Aitalic_x , italic_y ∈ italic_A,

xyif and only ifSIM(x,B)SIM(y,B).formulae-sequenceprecedes-or-equals𝑥𝑦if and only ifSIM𝑥𝐵SIM𝑦𝐵x\preceq y\quad\text{if and only if}\quad\text{SIM}(x,B)\leq\text{SIM}(y,B).italic_x ⪯ italic_y if and only if SIM ( italic_x , italic_B ) ≤ SIM ( italic_y , italic_B ) . (4)
Lemma 2.

The SetCSE difference A C 𝐴𝐶A\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Citalic_A italic_C equals (A,)𝐴precedes-or-equals(A,\preceq)( italic_A , ⪯ ), where for all x,yA𝑥𝑦𝐴x,y\in Aitalic_x , italic_y ∈ italic_A,

xyif and only ifSIM(x,C)SIM(y,C).formulae-sequenceprecedes-or-equals𝑥𝑦if and only ifSIM𝑥𝐶SIM𝑦𝐶x\preceq y\quad\text{if and only if}\quad\text{SIM}(x,C)\geq\text{SIM}(y,C).italic_x ⪯ italic_y if and only if SIM ( italic_x , italic_C ) ≥ SIM ( italic_y , italic_C ) . (5)

Remark. SetCSE intersection or difference does not satisfy the commutative law, in other words, ABBA𝐴𝐵𝐵𝐴A\cap B\neq B\cap Aitalic_A ∩ italic_B ≠ italic_B ∩ italic_A, and A CC A 𝐴𝐶 𝐶𝐴A\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C\neq C\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Aitalic_A italic_C ≠ italic_C italic_A. The advantage and limitation of the properties mentioned in Remarks are discussed in Appendix B.

4.2 Algorithm

Combining Section 3 and the above in Section 4, we present the complete algorithm for SetCSE operations in Algorithm 1. As one can see, the algorithm contains mainly two steps, where the first step is to fine-tune sentence embedding model by minimizing inter-set loss inter-setsubscriptinter-set\mathcal{L}_{\text{inter-set}}caligraphic_L start_POSTSUBSCRIPT inter-set end_POSTSUBSCRIPT, and the second one is to rank sentences using order relationship in Definition 2.

Algorithm 1 SetCSE Operation 𝑨𝑩𝟏𝑩𝑵\fgebackslash𝑪𝟏\fgebackslash\fgebackslash𝑪𝑴𝑨subscript𝑩1bold-⋯subscript𝑩𝑵\fgebackslashsubscript𝑪1\fgebackslashbold-⋯\fgebackslashsubscript𝑪𝑴\bm{A\cap B_{1}\cap\dotsb\cap B_{N}}\fgebackslash\;\bm{C_{1}}\;\fgebackslash% \bm{\dotsb}\fgebackslash\;\bm{C_{M}}bold_italic_A bold_∩ bold_italic_B start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT bold_∩ bold_⋯ bold_∩ bold_italic_B start_POSTSUBSCRIPT bold_italic_N end_POSTSUBSCRIPT bold_italic_C start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT bold_⋯ bold_italic_C start_POSTSUBSCRIPT bold_italic_M end_POSTSUBSCRIPT
1:Input: Sets of sentences A𝐴Aitalic_A, B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \dots, BNsubscript𝐵𝑁B_{N}italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \dots, CMsubscript𝐶𝑀C_{M}italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, sentence embedding model ϕitalic-ϕ\phiitalic_ϕ
2:Fine-tune model ϕitalic-ϕ\phiitalic_ϕ by minimizing inter-setsubscriptinter-set\mathcal{L}_{\text{inter-set}}caligraphic_L start_POSTSUBSCRIPT inter-set end_POSTSUBSCRIPT w.r.t. B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \dots, BNsubscript𝐵𝑁B_{N}italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \dots, CMsubscript𝐶𝑀C_{M}italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, denote it as ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
3:for sentence x𝑥xitalic_x in A𝐴Aitalic_A do
4:     Compute i=1NSIM(x,Bi)j=1MSIM(x,Cj)subscriptsuperscript𝑁𝑖1SIM𝑥subscript𝐵𝑖subscriptsuperscript𝑀𝑗1SIM𝑥subscript𝐶𝑗\sum^{N}_{i=1}\text{SIM}(x,B_{i})-\sum^{M}_{j=1}\text{SIM}(x,C_{j})∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT SIM ( italic_x , italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT SIM ( italic_x , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where all embeddings are induced by ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
5:end for
6:Form (A,)𝐴precedes-or-equals(A,\preceq)( italic_A , ⪯ ) and rank sentences in A𝐴Aitalic_A in descending order

5 Evaluation

In this section, we present the performance evaluation of SetCSE intersection and difference. The evaluation of series of SetCSE operations are presented in details in Appendix C.4.

To cover a diverse range of semantics, we employee the following datasets in this section: AG News Title and Description (AGT and AGD) (Zhang et al., 2015), Financial PhraseBank (FPB) (Malo et al., 2014), Banking77 (Casanueva et al., 2020), and Facebook Multilingual Task Oriented Dataset (FMTOD) (Schuster et al., 2018).

We consider an extensive list of models for generating sentence embeddings, including encoder-only Transformer models such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019b), their fine-tuned versions, such as SimCSE (Gao et al., 2021), DiffCSE (Chuang et al., 2022), and MCSE (Zhang et al., 2022), which are for sentence embedding problems, Contriever (Izacard et al., 2021), which is for information retrieval; the decoder-only SGPT-125M model (Muennighoff, 2022) is also included. In addition, conventional techniques as such TFIDF, BM25, and DPR (Karpukhin et al., 2020) are considered as well.

5.1 SetCSE Intersection

Suppose a labeled dataset S𝑆Sitalic_S has N𝑁Nitalic_N distinct semantics, and Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of sentences with the i𝑖iitalic_i-th semantic. For SetCSE intersection performance evaluation, the experiment is set up as follows:

  1. 1.

    In each Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, randomly select nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT of sentences, denoted as Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and concatenate remaining sentences in all Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as U𝑈Uitalic_U. Regard Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s as example sets and U𝑈Uitalic_U as the evaluation set.

  2. 2.

    For each semantic i𝑖iitalic_i, conduct UQi𝑈subscript𝑄𝑖U\cap Q_{i}italic_U ∩ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT following Algorithm 1, and select the top |Si|nsamplesubscript𝑆𝑖subscript𝑛sample|S_{i}|-n_{\text{sample}}| italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | - italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT sentences. View the i𝑖iitalic_i-th semantic as the prediction of the selected sentences and evaluate accuracy and F1 against ground truth.

  3. 3.

    As a control group, repeat Step 2 while omitting the model fine-tuning in Algorithm 1.

Throughout this paper, each experiment is repeated 5 times to minimize effects of randomness. The hyperparameters are selected as nsample=20subscript𝑛sample20n_{\text{sample}}=20italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT = 20, τ=0.05𝜏0.05\tau=0.05italic_τ = 0.05, and train epoch equals 60, which are based on fine-tuning results presented in Section 7 and Appendix C.

  AG News-T AG News-D FPB Banking77 FMTOD   
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1   
  Existing Model Set Intersect BM25 24.90 24.90 25.02 25.02 33.40 38.91 41.25 41.32 37.59 41.19   
DPR 25.00 25.00 25.00 25.00 33.33 38.81 41.30 41.35 38.33 42.87   
TFIDF 42.02 42.02 52.36 52.43 56.39 53.40 83.37 83.32 89.98 89.75   
BERT 43.83 43.74 52.37 52.05 55.78 54.21 53.35 53.34 80.64 79.59   
RoBERTa 40.75 40.78 54.58 54.40 54.69 53.07 75.10 64.49 72.60 70.84   
Contriever 48.95 48.63 58.85 58.08 54.41 51.67 59.63 59.81 75.10 75.09   
SGPT 34.88 34.90 34.97 35.02 52.83 52.89 37.35 37.44 79.80 78.86   
SimCSE-BERT 55.28 55.09 68.07 67.38 56.13 53.79 82.69 82.60 90.01 89.93   
SimCSE-RoBERTa 49.68 49.72 60.76 60.64 66.11 64.87 84.90 84.81 93.43 93.47   
DiffCSE-BERT 49.94 49.95 61.64 61.31 50.88 47.78 83.02 82.87 91.61 91.42   
DiffCSE-RoBERTa 46.29 46.46 46.61 46.65 61.71 60.05 87.31 87.22 83.06 82.95   
MCSE-BERT 49.98 49.91 68.79 68.14 54.01 50.89 77.35 77.21 93.56 92.49   
MCSE-RoBERTa 46.32 46.29 57.10 56.88 55.96 53.32 85.80 85.69 94.39 94.30   
  SetCSE Intersect BERT 70.47 70.32 87.24 87.19 71.65 71.01 95.06 95.06 98.04 98.04   
RoBERTa 75.87 75.71 88.30 88.26 73.76 73.09 83.59 83.46 99.39 99.39   
Contriever 72.88 72.77 83.97 83.99 67.83 67.59 94.20 94.22 97.03 97.05   
SGPT 36.64 36.63 36.01 36.02 54.13 54.73 41.88 41.93 86.94 86.65   
SimCSE-BERT 77.24 77.22 89.48 89.46 83.59 83.44 97.84 97.84 99.63 99.63   
SimCSE-RoBERTa 79.56 79.57 89.97 89.97 85.48 85.25 98.33 98.33 99.44 99.44   
DiffCSE-BERT 76.43 76.45 78.31 78.30 80.93 80.84 98.24 98.25 99.79 99.79   
DiffCSE-RoBERTa 78.02 78.04 88.63 88.62 83.89 83.75 98.49 98.49 97.89 97.87   
MCSE-BERT 75.04 63.96 88.77 88.75 84.03 83.97 97.76 97.76 99.53 99.53   
MCSE-RoBERTa 78.18 78.21 89.23 89.22 86.21 86.08 98.65 98.65 98.66 98.65   
  Ave. Improvement 56% 57% 46% 47% 43% 50% 27% 28% 12% 12%   
 
Table 1: Evaluation results for SetCSE intersection. As illustrated, the average improvements on accuracy and F1 are 39% and 37%, respectively.
Refer to caption
Refer to caption
Figure 2: The t-SNE plots of sentence embeddings induced by existing language models and the SetCSE fine-tuned ones for AGT dataset. As illustrated, the model awareness of different semantics are significantly improved.
Refer to caption
Refer to caption
Figure 3: The t-SNE plots of sentence embeddings induced by existing language models and the SetCSE fine-tuned ones for FPB dataset. As illustrated, the model awareness of different semantics are significantly improved.

The detailed experiment results can be found in Table 1, where “SetCSE Intersection” and “Existing Model Set Intersection” represent results in Step 2 and 3, respectively. To illustrate the performance in a more intuitive manner, we include the t-SNE (Van der Maaten & Hinton, 2008) plots of the sentence embeddings, as shown in Figures 2 and 3 (refer to Section C for t-SNE plots of AGD, Banking77 and FMTOD datasets). As one can see, on average, the SetCSE framework improves performance of intersection by 38%, indicating a significant increase on semantic awareness. Moreover, the encoder-based models perform better than the decoder-based SGPT. This phenomenon and potential future works are discussed in detail in Appendix C.3.

5.2 SetCSE Difference

Suppose a labeled dataset S𝑆Sitalic_S has N𝑁Nitalic_N distinct semantics, and Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of sentences with the i𝑖iitalic_i-th semantic. Similar to Section 5.1, the evaluation on SetCSE difference is set up as follows:

  1. 1.

    In each Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, randomly select nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT of sentences, denoted by Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and concatenate remaining sentences in all Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as U𝑈Uitalic_U.

  2. 2.

    For each semantic i𝑖iitalic_i, conduct U Qi 𝑈subscript𝑄𝑖U\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Q_{i}italic_U italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT following Algorithm 1, and select the top ji(|Sj|nsample)subscript𝑗𝑖subscript𝑆𝑗subscript𝑛sample\sum_{j\neq i}(|S_{j}|-n_{\text{sample}})∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT ( | italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | - italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT ) sentences, which are supposed have different semantics than i𝑖iitalic_i. Label the selected sentences as “not i𝑖iitalic_i”, relabel ground truth semantics other than i𝑖iitalic_i to “not i𝑖iitalic_i” as well. Evaluate prediction accuracy and F1 against relabeled ground truth.

  3. 3.

    As a control group, repeat Step 2 while omitting the model fine-tuning in Algorithm 1.

Results for above can be found in Table 2. Similar to Section 5.1, we also observed significant accuracy and F1 improvements. Specifically, the average improvement across all experiments is 18%.

  AG News-T AG News-D FPB Banking77 FMTOD   
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1   
  Existing Model Set Difference BM25 57.34 57.47 59.95 57.57 69.22 60.77 68.76 68.14 73.97 71.74   
DPR 56.50 54.45 56.98 54.76 66.67 58.06 68.70 68.06 71.67 68.43   
TFIDF 23.76 32.31 26.30 36.49 40.13 51.33 32.37 47.87 38.58 54.48   
BERT 71.39 59.49 75.63 65.18 79.05 70.69 77.84 68.15 89.99 85.42   
RoBERTa 70.35 58.11 77.15 67.22 77.90 69.46 75.10 64.49 86.41 80.44   
Contriever 75.24 64.62 79.93 71.06 77.07 68.86 79.53 70.52 87.75 82.52   
SGPT 67.85 54.87 67.85 54.86 76.71 67.34 69.35 56.84 89.95 85.38   
SimCSE-BERT 77.64 67.87 84.04 76.77 78.06 71.14 91.92 88.07 90.30 86.21   
SimCSE-RoBERTa 74.84 64.09 77.40 67.68 83.05 76.57 92.59 89.05 96.71 95.11   
DiffCSE-BERT 74.97 64.27 80.82 72.30 75.44 68.34 91.84 87.95 95.80 93.79   
DiffCSE-RoBERTa 71.64 59.87 78.40 68.96 80.86 73.68 93.43 90.27 96.59 94.92   
MCSE-BERT 74.72 63.96 83.24 75.68 78.46 71.23 88.43 83.03 96.66 95.03   
MCSE-RoBERTa 73.07 61.70 77.75 68.02 78.87 71.20 92.65 89.13 97.10 95.68   
  SetCSE Difference BERT 87.39 81.54 92.92 89.55 84.91 78.14 97.22 95.87 99.42 99.14   
RoBERTa 89.35 84.36 85.12 78.50 86.86 80.85 94.31 91.73 99.70 99.55   
Contriever 85.93 79.56 92.95 89.61 83.50 76.27 95.12 92.78 98.94 98.42   
SGPT 67.99 55.04 68.36 55.53 77.73 68.63 70.59 58.44 93.39 90.21   
SimCSE-BERT 88.62 83.30 94.74 92.20 91.80 87.99 99.04 98.56 99.81 99.72   
SimCSE-RoBERTa 89.78 84.98 94.99 92.57 92.74 89.37 99.29 98.93 99.72 99.58   
DiffCSE-BERT 88.22 82.72 94.69 92.13 90.47 86.07 99.04 98.56 99.90 99.85   
DiffCSE-RoBERTa 89.01 83.87 94.31 91.58 91.94 88.22 99.14 98.72 98.95 98.43   
MCSE-BERT 88.33 82.89 93.84 90.89 91.96 88.21 98.94 98.41 99.76 99.64   
MCSE-RoBERT 89.86 85.09 94.36 91.65 93.19 90.00 99.14 98.72 99.63 99.45   
  Ave. Improvement 19% 31% 18% 28% 15% 21% 12% 19% 6% 8%   
 
Table 2: Evaluation results for SetCSE difference. As illustrated, the average improvements on accuracy and F1 are 14% and 21%, respectively.

As mentioned, the evaluation of SetCSE series of operations can be found in Appendiex C.4. The extensive evaluations indicate that SetCSE significantly enhances model discriminatory capabilities, and yields positive results in SetCSE intersection and SetCSE difference operations.

6 Application

As mentioned, SetCSE offers two significant advantages in information retrieval. One is its ability to effectively represent involved and sophisticated semantics, while the other is its capability to extract information associated with these semantics following complicated prompts. The former is achieved by expressing semantics with sets of sentences or phrases, and the latter is enabled by series of SetCSE intersection and difference operations. For instances, operation AB1BN C1  CM𝐴subscript𝐵1 subscript𝐵𝑁subscript𝐶1subscript𝐶𝑀A\cap B_{1}\cap\dotsb\cap B_{N}\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to% 3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt% \hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C_{1}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}\dotsb\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}C_{M}italic_A ∩ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT essentially means “to distinguish the difference between B1,,BN,C1,,CMsubscript𝐵1subscript𝐵𝑁subscript𝐶1subscript𝐶𝑀B_{1},\dotsb,B_{N},C_{1},\dotsb,C_{M}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_C start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, and to find sentences in A𝐴Aitalic_A that contains semantics in Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s while different from semantics Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s.”

In this section, we showcase in detail these advantages through three important natural language processing tasks, namely, complex and intricate semantic search, data annotation through active learning, and new topic discovery. The datasets considered cover various domains, including financial analysis, legal service, and social media analysis. For more use cases and examples that are enabled by SetCSE, one can refer to Appendix D.

6.1 Complex and Intricate Semantic Search

In many real-world information retrieval tasks, there is a need to search for sentences with or without specific semantics that are hard to convey in single phrases or sentences. In these cases, existing querying methods based on single-sentence prompt are of limited use. By employing SetCSE, one can readily represent those semantics. Furthermore, SetCSE also supports expressing convoluted prompts via its operations and simple syntax. These advantages are illustrated through the following financial analysis example.

In recent years, there has been an increasing interest in leveraging a company’s Environmental, Social, and Governance (ESG) stance to forecast its growth and sustainability (Utz, 2019; Hong et al., 2022). Brokerage firms and mutual fund companies have even begun offering financial products such as Exchange-Traded Funds (ETFs) that adhere to companies’ ESG investment strategies (Kanuri, 2020; Rompotis, 2022) (more details on ESG are included in Appendix D.3). Notably, there is no definitive taxonomy for the term (CFA Institute, 2023), and lists of key topics are often used to illustrate these concepts (refer to Table 3(a)).

The intricate nature of ESG concepts makes it challenging to analyze ESG information through publicly available textual data, e.g., S&P 500 earnings calls (Qin & Yang, 2019). However, within the SetCSE framework, these concepts can be readily represented by their example topics. Combined with several other semantics, one can effortlessly extract company earnings calls related to convoluted concepts such as “using technology to solve Social issues, while neglecting its potential negative impact,” and “investing in Environmental development projects,” through simple operations. Table 3 provides a detailed presentation of the corresponding SetCSE operations and results.

  Set A𝐴Aitalic_A - Environmental {{\{{climate change, carbon emission reduction, water pollution, air pollution, renewable energy}}\}}
Set B𝐵Bitalic_B - Social {{\{{diversity inclusion, community relations, customer satisfaction, fair wages, data security}}\}}
Set C𝐶Citalic_C - Governance {{\{{ethical practices, transparent accounting, business integrity, risk management, compliance}}\}}
Set D𝐷Ditalic_D - New Tech {{\{{machine learning, artificial intelligence, robotics, generative model, neutral networks}}\}}
Set E𝐸Eitalic_E - Danger {{\{{personal privacy breach, wrongful disclosure, pose threat, misinformation, unemployment}}\}}
Set F𝐹Fitalic_F - Invest {{\{{strategic investement, growth investment, strategic plan, invest, investment}}\}}
 
(a) Example topics for defining ESG (Investopedia, 2023) and example phrases for other semantics.
{adjustwidth}

-0cm   Operation: 𝑿𝑩𝑫 𝑬𝑿𝑩 𝑫𝑬\bm{X\cap B\cap D\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}E}bold_italic_X bold_∩ bold_italic_B bold_∩ bold_italic_D bold_italic_E /*find sentence about ‘‘use tech to influence social issues positively’’*/   Results: We’re now using machine learning in most of our integrity work to keep our community safe. But we know we also have a responsibility to deliver these fundamental technical and advances to fulfill the promise of bringing people closer together. Our data and technology combined with specialized consulting experience help organization transition to a digital future while ensuring their workforce thrives. And you’ll see us integrating advances in machine learning so that customers can get better satisfaction  

(b) Search for sentences related to“using technology to solve Social issues, while neglecting its potential negative impact,” utilizing a serial of three SetCSE operations.
{adjustwidth}

-0cm   Operation: 𝑿𝑨𝑭𝑿𝑨𝑭\bm{X\cap A\cap F}bold_italic_X bold_∩ bold_italic_A bold_∩ bold_italic_F /*find sentence about ‘‘invest in environmental development’’*/   Results: We also continue to make progress on the $1.5 billion of undefined renewable prejects, which are included in our capital forecast. To that end, our growth initiatives beyond the projects under construction have been focused on investments in natural gas and renewable projects with long term. I would note that the $9.7 billion plan includes the natural gas storage as well as the UP generation investment that I just discussed.  

(c) Search for sentences related to “investing in Environmental development projects,” via simple SetCSE syntax.
Table 3: Demonstration of complex and intricate semantics search using SetCSE serial operations, through the example of analyzing S&P 500 company ESG stance leveraging earning calls transcripts.

6.2 Data Annotation and Active Learning

Suppose building a classification model from scratch, and only an unlabeled dataset is present. Denote the unlabeled dataset as X𝑋Xitalic_X, and each class as i𝑖iitalic_i, i=1,,N𝑖1𝑁i=1,\dotsb,Nitalic_i = 1 , ⋯ , italic_N. One quick solution is to use SetCSE as a filter to extract sentences that are semantically close to example set Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each i𝑖iitalic_i, and then conduct a through human annotation, where the filtering is conducted using XSi𝑋subscript𝑆𝑖X\cap S_{i}italic_X ∩ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. More interestingly, SetCSE supports uncertainty labeling in active learning framework (Settles, 2009; Gui et al., 2020), where the unlabeled items near a decision boundary between two classes i𝑖iitalic_i and j𝑗jitalic_j can be found using XSiSj𝑋subscript𝑆𝑖subscript𝑆𝑗X\cap S_{i}\cap S_{j}italic_X ∩ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

We use Law Stack Exchange (LSE) (Li et al., 2022) dataset to validate the above data annotation strategy. Categories in this dataset include “copyright”, “criminal law”, “contract law”, etc. Table 4(a) presents the sentences selected based on similarity with the example sets, whereas Table 4(b) shows the sentences that are on the decision boundaries between “copyright” and “criminal law”. The latter indeed are difficult to categorize at first glance, hence labeling those items following the active learning framework would definitely increase efficiency in data annotation.

{adjustwidth}

-0cm   Operation: 𝑿𝑺𝟏𝑿subscript𝑺1\bm{X\cap S_{1}}bold_italic_X bold_∩ bold_italic_S start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT /*find sentences related to ‘‘copyright’’*/   Results: Who owns a copyright on a scanned work? How does copyright on Recipes work? Is OCRed text automatically copyright?   Operation: 𝑿𝑺𝟐𝑿subscript𝑺2\bm{X\cap S_{2}}bold_italic_X bold_∩ bold_italic_S start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT /*find sentences related to ‘‘criminal law’’*/   Results: Giving Someone Money Because of a Criminal Act? What are techniques used in law to robustly incentivize people to tell the truth? Canada - how long can a person be under investigation?  

(a) Extract sentences close to “copyright” or “criminal law” categories for further human annotation.
{adjustwidth}

-0cm   Operation: 𝑿𝑺𝟏𝑺𝟐𝑿subscript𝑺1subscript𝑺2\bm{X\cap S_{1}\cap S_{2}}bold_italic_X bold_∩ bold_italic_S start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT bold_∩ bold_italic_S start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT /*find sentences on decision boundary of ‘‘copyright’’ and ‘‘criminal law’’*/   Results: Is there any country or state where the intellectual author of a homicide has twice or more the penalty than the physical author? Would police in the US have any alternative for handling a confiscated computer with a hidden partition? Is there a criminal database for my city Calgary Audio fingerprinting legal issues Is reading obscene written material online illegal in the UK?  

(b) Following uncertainty labeling strategy in active learning framework, find sentences on the decision boundary between “copyright” and “criminal law” categories with the help of SetCSE serial operations.
Table 4: Demonstration of LSE dataset annotation and active learning utilizing SetCSE.

6.3 New Topic Discovery

The task of new topic discovery (Blei & Lafferty, 2006; AlSumait et al., 2008; Chen et al., 2019) emerges when a dataset of interest is evolving over time. This can include tasks such as monitoring customer product reviews, collecting feedback for a currently airing TV series, or identifying trending public perception of specific stocks, among others. Suppose we have a unlabeled dataset X𝑋Xitalic_X, and N𝑁Nitalic_N identified topics, the SetCSE operation for new topic extraction would be X T1  TN 𝑋subscript𝑇1subscript𝑇𝑁X\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{1}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}\dots\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{N}italic_X italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, where Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT stands for the set of example sentences for topic i𝑖iitalic_i.

We use the Twitter Stance Evaluation datasets (Barbieri et al., 2020; Mohammad et al., 2018; Barbieri et al., 2018; Van Hee et al., 2018; Basile et al., 2019; Zampieri et al., 2019; Rosenthal et al., 2017; Mohammad et al., 2016) to illustrate the new topic discovery application. Specifically, we select “abortion”, “etheism”, and “feminist” as the existing topics, denoted as T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and use X T1 T2 T3 𝑋subscript𝑇1subscript𝑇2subscript𝑇3X\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{1}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{2}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{3}italic_X italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to find sentences with new topics. In the created evaluation dataset, the only other topic is “climate”. As shown in Table 5, the top sentences extracted are indeed all related to this topic.

{adjustwidth}

-0cm   Operation: 𝑿 𝑻𝟏 𝑻𝟐 𝑻𝟑 𝑿subscript𝑻1subscript𝑻2subscript𝑻3\bm{X\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{1}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{2}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}T_{3}}bold_italic_X bold_italic_T start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT bold_italic_T start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT bold_italic_T start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT /*find sentences not related to ‘‘abortion’’, ‘‘etheism’’, or ‘‘feminist’’*/   Results: @user Weather patterns evolving very differently over the last few years.. It’s so cold and windy here in Sydney On a scale of 1 to 10 the air quality in Whistler is a 35. #wildfires #BCwildfire #SemST Look out for the hashtag #UKClimate2015 for news today on how the UK is doing in both reducing emissions and adapting to #SemST Second heatwave hits NA NW pop** up everywhere  

Table 5: Demonstration of new topic discovery on Twitter leveraging SetCSE serial operations.

7 Discussion

In this section, we provide quantitative justification of using sets to represent semantics, and comparison between SetCSE intersection and supervised learning. In addition, the performance of the embedding models post the context-specific inter-set contrastive learning are evaluated and presented in Appendix E.3.

Although the idea of expressing sophisticated semantics using sets instead of single sentences aligns with our intuition, its quantitative justification needs to be provided. We conduct experiments in Section 5, with nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT range from 1111 to 30303030, where nsample=1subscript𝑛sample1n_{\text{sample}}=1italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT = 1 corresponds to querying by single sentences. The accuracy and F1 of those experiments can be found in Figures 4 and 8. As one can see, using sets (nsample>1subscript𝑛sample1n_{\text{sample}}>1italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT > 1) significantly improves querying performance. While nsample=20subscript𝑛sample20n_{\text{sample}}=20italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT = 20 would be sufficient to provide positive results in most of the cases.

Refer to caption
(a) SetCSE intersection performance.
Refer to caption
(b) SetCSE difference performance.
Figure 4: SetCSE operation performances on AGT dataset for different values of nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT.

We also compare the SetCSE intersection with supervised classification, where the latter regards sample sentences Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s in Section 5.1 as training data, and predicts the semantics of U𝑈Uitalic_U. Results of this evaluation can be found in Appendix E.2 and Table 13. As one can see, the performances are on par, while supervised learning results cannot be used in complex sentence querying tasks.

8 Conclusion and Future Work

Taking inspiration from Set Theory, we introduce a novel querying framework named SetCSE, which employs sets to represent complex semantics and leverages its defined operations for structurally retrieving information. Within this framework, an inter-set contrastive learning objective is introduced. The efficacy of this learning objective in improving the discriminatory capability of the underlying sentence embedding models is demonstrated through extensive evaluations. The proposed SetCSE operations exhibit significant adaptability and utility in advancing information retrieval tasks, including complex semantic search, active learning, new topic discovery, and more.

Although we present comprehensive results in evaluation and application sections, there is still an unexplored avenue regarding testing SetCSE performance in various benchmark information retrieval tasks, applying the framework to larger embedding models for further performance improvement, and potentially incorporating LoRA into the framework (Hu et al., 2021). Additionally, we aim to create a SetCSE application interface that enables quick sentence extraction through its straightforward syntax.

Acknowledgement

We express gratitude to Di Xu, Cong Liu, Yu-Ching Shih, and Hsi-Wei Hsieh for their enlightening discussions. Additionally, we extend our thanks to the anonymous area chair and reviewers for their constructive comments and suggestions.

References

  • Abiri et al. (2017) Ahmad Abiri, Omeed Paydar, Anna Tao, Megan LaRocca, Kang Liu, Bradley Genovese, Robert Candler, Warren S Grundfest, and Erik P Dutson. Tensile strength and failure load of sutures for robotic surgery. Surgical endoscopy, 31:3258–3270, 2017.
  • Agirre & Edmonds (2007) Eneko Agirre and Philip Edmonds. Word sense disambiguation: Algorithms and applications, volume 33. Springer Science & Business Media, 2007.
  • Agirre et al. (2012) Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp.  385–393, 2012.
  • Agirre et al. (2013) Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (*SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pp.  32–43, 2013.
  • Agirre et al. (2014) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp.  81–91, 2014.
  • Agirre et al. (2015) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp.  252–263, 2015.
  • Agirre et al. (2016) Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics), 2016.
  • Alavian et al. (2018) Pooya Alavian, Yongsoon Eun, Kang Liu, Semyon M Meerkov, and Liang Zhang. The (α𝛼\alphaitalic_α, β𝛽\betaitalic_β)-precise estimates of mtbf and mttr: Definitions, calculations, and effect on machine efficiency and throughput evaluation in serial production lines. URL: http://web. eecs. umich. edu/~ smm/publications/mtbf_mttr_estimates. pdf, 2018.
  • Alavian et al. (2019) Pooya Alavian, Yongsoon Eun, Kang Liu, Semyon M Meerkov, and Liang Zhang. The (α𝛼\alphaitalic_α, β𝛽\betaitalic_β)-precise estimates of mtbf and mttr: Definitions, calculations, and induced effect on machine efficiency evaluation. IFAC-PapersOnLine, 52(13):1004–1009, 2019.
  • Alavian et al. (2020) Pooya Alavian, Yongsoon Eun, Kang Liu, Semyon M Meerkov, and Liang Zhang. The (α𝛼\alphaitalic_α, β𝛽\betaitalic_β)-precise estimates of mtbf and mttr: Definition, calculation, and observation time. IEEE Transactions on Automation Science and Engineering, 18(3):1469–1477, 2020.
  • Alavian et al. (2022) Pooya Alavian, Yongsoon Eun, Kang Liu, Semyon M Meerkov, and Liang Zhang. The (αXsubscript𝛼𝑋\alpha_{X}italic_α start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, βXsubscript𝛽𝑋\beta_{X}italic_β start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT)-precise estimates of production systems performance metrics. International Journal of Production Research, 60(7):2230–2253, 2022.
  • AlSumait et al. (2008) Loulwah AlSumait, Daniel Barbará, and Carlotta Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In 2008 eighth IEEE international conference on data mining, pp.  3–12. IEEE, 2008.
  • Arvidsson & Dumay (2022) Susanne Arvidsson and John Dumay. Corporate ESG reporting quantity, quality and performance: Where to now for environmental policy and practice? Business Strategy and the Environment, 31(3):1091–1110, 2022.
  • Barbieri et al. (2018) Francesco Barbieri, Jose Camacho-Collados, Francesco Ronzano, Luis Espinosa-Anke, Miguel Ballesteros, Valerio Basile, Viviana Patti, and Horacio Saggion. Semeval 2018 task 2: Multilingual emoji prediction. In Proceedings of The 12th International Workshop on Semantic Evaluation, pp.  24–33, 2018.
  • Barbieri et al. (2020) Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke, and Leonardo Neves. TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification. In Proceedings of Findings of EMNLP, 2020.
  • Basile et al. (2019) Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp.  54–63, Minneapolis, Minnesota, USA, 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2007. URL https://www.aclweb.org/anthology/S19-2007.
  • Bevilacqua et al. (2021) Michele Bevilacqua, Tommaso Pasini, Alessandro Raganato, and Roberto Navigli. Recent trends in word sense disambiguation: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conference on Artificial Intelligence, Inc, 2021.
  • Bhat et al. (2020) Siddharth Bhat, Alok Debnath, Souvik Banerjee, and Manish Shrivastava. Word embeddings as tuples of feature probabilities. In Proceedings of the 5th Workshop on Representation Learning for NLP, pp.  24–33, 2020.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
  • Blackburn & Bos (2003) Patrick Blackburn and Johan Bos. Computational semantics. Theoria: An International Journal for Theory, History and Foundations of Science, pp.  27–45, 2003.
  • Blei & Lafferty (2006) David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pp.  113–120, 2006.
  • Bonial et al. (2020) Claire Bonial, Stephanie Lukin, David Doughty, Steven Hill, and Clare Voss. Infoforager: Leveraging semantic search with amr for covid-19 research. In Proceedings of the Second International Workshop on Designing Meaning Representations, pp.  67–77, 2020.
  • Cai et al. (2020) Xingyu Cai, Jiaji Huang, Yuchen Bian, and Kenneth Church. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2020.
  • Cann (1993) Ronnie Cann. Formal semantics: an introduction. Cambridge University Press, 1993.
  • Cantor (1874) Georg Cantor. Ueber eine eigenschaft des inbegriffs aller reellen algebraischen zahlen. Journal für die reine und angewandte Mathematik, 77:258–262, 1874.
  • Casanueva et al. (2020) Iñigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.longhoe.net/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
  • Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder for english. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp.  169–174, 2018.
  • CFA Institute (2023) CFA Institute. ESG Investing and Analysis — cfainstitute.org. https://www.cfainstitute.org/en/rpc-overview/esg-investing, 2023.
  • Chen et al. (2019) Junyang Chen, Zhiguo Gong, and Weiwen Liu. A nonparametric model for online topic discovery with word embeddings. Information Sciences, 504:32–47, 2019.
  • Chen et al. (2018) Qian Chen, Zhen-Hua Ling, and Xiaodan Zhu. Enhancing sentence embedding with generalized pooling. arXiv preprint arXiv:1806.09828, 2018.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  1597–1607. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/chen20j.html.
  • Chierchia & McConnell-Ginet (1990) Gennaro Chierchia and Sally McConnell-Ginet. Meaning and grammar: An introduction to semantics. 1990.
  • Chuang et al. (2022) Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Wen-tau Yih, Yoon Kim, and James Glass. Diffcse: Difference-based contrastive learning for sentence embeddings. arXiv preprint arXiv:2204.10298, 2022.
  • Chui et al. (2023) Michael Chui, Eric Hazan, Roger Roberts, Alex Singla, and Kate Smaje. The economic potential of generative ai. McKinsey & Company, 2023.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364, 2017.
  • Dasgupta et al. (2021) Shib Sankar Dasgupta, Michael Boratko, Siddhartha Mishra, Shriya Atmakuri, Dhruvesh Patel, Xiang Lorraine Li, and Andrew McCallum. Word2box: Capturing set-theoretic semantics of words using box embeddings. arXiv preprint arXiv:2106.14361, 2021.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Eltaief (2022) Abir Eltaief. Abirate/english quotes Datasets at Hugging Face -huggingface.co. https://huggingface.co/datasets/Abirate/english_quotes, 2022.
  • Esteva et al. (2020) Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin, Dragomir Radev, and Richard Socher. Co-search: Covid-19 information retrieval with semantic search, question answering, and abstractive summarization. arXiv preprint arXiv:2006.09595, 2020.
  • Ethayarajh (2019) Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512, 2019.
  • Eun et al. (2022) Yongsoon Eun, Kang Liu, and Semyon M Meerkov. Production systems with cycle overrun: modelling, analysis, improvability and bottlenecks. International Journal of Production Research, 60(2):534–548, 2022.
  • Fellbaum (1998) Christiane Fellbaum. WordNet: An electronic lexical database. MIT press, 1998.
  • Feng et al. (2020) Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852, 2020.
  • Fox (2010) Chris Fox. Computational semantics. The Handbook of Computational Linguistics and Natural Language Processing, pp.  394–428, 2010.
  • Friede et al. (2015) Gunnar Friede, Timo Busch, and Alexander Bassen. ESG and financial performance: aggregated evidence from more than 2000 empirical studies. Journal of sustainable finance & investment, 5(4):210–233, 2015.
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
  • Giorgi et al. (2020) John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659, 2020.
  • Gui et al. (2020) Tao Gui, Jiacheng Ye, Qi Zhang, Zhengyan Li, Zichu Fei, Yeyun Gong, and Xuan**g Huang. Uncertainty-aware label refinement for sequence labeling. arXiv preprint arXiv:2012.10608, 2020.
  • Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant map**. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  1735–1742. IEEE, 2006.
  • Harel & Rumpe (2004) David Harel and Bernhard Rumpe. Meaningful modeling: what’s the semantics of "semantics"? Computer, 37(10):64–72, 2004.
  • Henisz et al. (2019) Witold Henisz, Tim Koller, and Robin Nuttall. Five ways that ESG creates value, Nov 2019.
  • Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483, 2016.
  • Hong et al. (2022) Xiangjun Hong, Xian Lin, Laitan Fang, Yuchen Gao, and Ruipeng Li. Application of machine learning models for predictions on cross-border merger and acquisition decisions with ESG characteristics from an ecosystem and sustainable development perspective. Sustainability, 14(5):2838, 2022.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Ide & Véronis (1998) Nancy Ide and Jean Véronis. Introduction to the special issue on word sense disambiguation: the state of the art. Computational linguistics, 24(1):1–40, 1998.
  • Investopedia (2023) Investopedia. What Is Environmental, Social, and Governance (ESG) Investing? - investopedia.com. https://www.investopedia.com/terms/e/environmental-social-and-governance-esg-criteria.asp, 2023.
  • Ismael (2022) Rami Ismael. Rami/multi-label-class-github-issues-text-classification Datasets at Hugging Face - huggingface.co. https://huggingface.co/datasets/Rami/multi-label-class-github-issues-text-classification, 2022.
  • Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  • Jain et al. (2023) Nihal Jain, Dejiao Zhang, Wasi Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, et al. Contraclm: Contrastive learning for causal language model. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6436–6459, 2023.
  • Jian et al. (2022) Yiren Jian, Chongyang Gao, and Soroush Vosoughi. Contrastive learning for prompt-based few-shot language learners. arXiv preprint arXiv:2205.01308, 2022.
  • Jiang et al. (2022) Ting Jiang, Jian Jiao, Shaohan Huang, Zihan Zhang, Deqing Wang, Fuzhen Zhuang, Furu Wei, Haizhen Huang, Denvy Deng, and Qi Zhang. Promptbert: Improving bert sentence embeddings with prompts. arXiv preprint arXiv:2201.04337, 2022.
  • Johnson-Laird (2004) Philip N Johnson-Laird. The history of mental models. In Psychology of reasoning, pp.  189–222. Psychology Press, 2004.
  • Kanuri (2020) Srinidhi Kanuri. Risk and return characteristics of environmental, social, and governance (ESG) equity ETFs. The Journal of Beta Investment Strategies, 11(2):66–75, 2020.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  • Kazekami (2020) Sachiko Kazekami. Mechanisms to improve labor productivity by performing telework. Telecommunications Policy, 44(2):101868, 2020.
  • Khan (2022) Muhammad Arif Khan. ESG disclosure and firm performance: A bibliometric and meta analysis. Research in International Business and Finance, 61:101668, 2022.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. Advances in neural information processing systems, 28, 2015.
  • Kreidler (1998) Charles W Kreidler. Introducing english semantics. Psychology Press, 1998.
  • Lewis (1997) David Lewis. Reuters-21578 Text Categorization Collection. UCI Machine Learning Repository, 1997. DOI: https://doi.org/10.24432/C52G6M.
  • Li et al. (2022) Jonathan Li, Rohan Bhambhoria, and Xiaodan Zhu. Parameter-efficient legal domain adaptation. In Proceedings of the Natural Legal Language Processing Workshop 2022, pp.  119–129, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.nllp-1.10.
  • Li et al. (2021) Ting-Ting Li, Kai Wang, Toshiyuki Sueyoshi, and Derek D Wang. ESG: Research progress and future prospects. Sustainability, 13(21):11663, 2021.
  • Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
  • Liu (2021) K Liu. The (α𝛼\alphaitalic_α, β𝛽\betaitalic_β)-precision theory for production system monitoring and improvement. PhD thesis, Ph. D. thesis, The University of Michigan, 2021.
  • Liu et al. (2019a) Kang Liu, Nan Li, Ilya Kolmanovsky, and Anouck Girard. A vehicle routing problem with dynamic demands and restricted failures solved using stochastic predictive control. In 2019 American Control Conference (ACC), pp.  1885–1890. IEEE, 2019a.
  • Liu et al. (2020) Qi Liu, Matt J Kusner, and Phil Blunsom. A survey on contextual embeddings. arXiv preprint arXiv:2003.07278, 2020.
  • Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b.
  • Logeswaran & Lee (2018) Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893, 2018.
  • Mai et al. (2022) Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing, 2022.
  • Malo et al. (2014) Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796, 2014.
  • McCarthy (2009) Diana McCarthy. Word sense disambiguation: An overview. Language and Linguistics compass, 3(2):537–558, 2009.
  • Mohammad et al. (2016) Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp.  31–41, 2016.
  • Mohammad et al. (2018) Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. Semeval-2018 task 1: Affect in tweets. In Proceedings of the 12th international workshop on semantic evaluation, pp.  1–17, 2018.
  • Muennighoff (2022) Niklas Muennighoff. SGPT: GPT sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022.
  • Navigli (2009) Roberto Navigli. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1–69, 2009.
  • Partee et al. (2012) Barbara BH Partee, Alice G ter Meulen, and Robert Wall. Mathematical methods in linguistics, volume 30. Springer Science & Business Media, 2012.
  • Partee (2005) Barbara H Partee. Formal semantics. In Lectures at a workshop in Moscow. http://people. umass. edu/partee/RGGU_2005/RGGU05_formal_semantics. htm, 2005.
  • Portner & Partee (2008) Paul H Portner and Barbara H Partee. Formal semantics: The essential readings. John Wiley & Sons, 2008.
  • Qin & Yang (2019) Yu Qin and Yi Yang. What you say and how you say it matters: Predicting stock volatility using verbal and vocal cues. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  390–401, Florence, Italy, July 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P19-1038.
  • Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  • Reiser & Tucker (2019) Dana Brakman Reiser and Anne Tucker. Buyer beware: variation and opacity in ESG and ESG index funds. Cardozo L. Rev., 41:1921, 2019.
  • Riemer (2010) Nick Riemer. Introducing semantics. Cambridge University Press, 2010.
  • Rompotis (2022) Gerasimos G Rompotis. The ESG ETFs in the UK. Journal of Asset Management, 23(2):114–129, 2022.
  • Rosenthal et al. (2017) Sara Rosenthal, Noura Farra, and Preslav Nakov. Semeval-2017 task 4: Sentiment analysis in twitter. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pp.  502–518, 2017.
  • Schuster et al. (2018) Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. Cross-lingual transfer learning for multilingual task oriented dialog. arXiv preprint arXiv:1810.13327, 2018.
  • Sen et al. (2020) Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi Dalmia, Greg Stager, Ashish Mittal, Diptikalyan Saha, and Karthik Sankaranarayanan. Athena++ natural language querying for complex nested sql queries. Proceedings of the VLDB Endowment, 13(12):2747–2759, 2020.
  • Settles (2009) Burr Settles. Active learning literature survey. 2009.
  • Shao et al. (2019) Taihua Shao, Yupu Guo, Honghui Chen, and Zepeng Hao. Transformer-based neural network for answer selection in question answering. IEEE Access, 7:26146–26156, 2019.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • Stevenson & Wilks (2003) Mark Stevenson and Yorick Wilks. Word sense disambiguation. The Oxford handbook of computational linguistics, 249:249, 2003.
  • Utz (2019) Sebastian Utz. Corporate scandals and the reliability of ESG assessments: Evidence from an international sample. Review of Managerial Science, 13:483–511, 2019.
  • Van der Maaten & Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Van Hee et al. (2018) Cynthia Van Hee, Els Lefever, and Véronique Hoste. Semeval-2018 task 3: Irony detection in english tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, pp.  39–50, 2018.
  • Wang & Kuo (2020) Bin Wang and C-C Jay Kuo. Sbert-wk: A sentence embedding method by dissecting bert-based word models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2146–2157, 2020.
  • Woo & Tan (2022) Leonard Woo and Daniel Tan. Considering ESG in business valuation, Jun 2022.
  • Yan et al. (2021) Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv preprint arXiv:2105.11741, 2021.
  • Yang et al. (2019) Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307, 2019.
  • Zampieri et al. (2019) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). In Proceedings of the 13th International Workshop on Semantic Evaluation, pp.  75–86, 2019.
  • Zhang et al. (2022) Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A Hedderich, and Dietrich Klakow. Mcse: Multimodal contrastive learning of sentence embeddings. arXiv preprint arXiv:2204.10931, 2022.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
  • Zhelezniak et al. (2019) Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Francesco Moramarco, Jack Flann, and Nils Y Hammerla. Don’t settle for average, go for the max: fuzzy sets and max-pooled word vectors. arXiv preprint arXiv:1904.13264, 2019.
\doparttoc\faketableofcontents

APPENDIX

\parttoc

Appendix A Related Work

A.1 Set Theory in Formal Semantics

At its core, Formal Semantics aims to create precise, rule-based systems that capture the meaning of language constructs, from words and phrases to complex sentences and discourse (Chierchia & McConnell-Ginet, 1990; Cann, 1993; Partee, 2005; Portner & Partee, 2008; Partee et al., 2012). Set theory plays a pivotal role in achieving this goal. In particular, we highlight the following contributions of Set Theory to Formal Semantics mentioned in Portner & Partee (2008):

Semantic Representation. Set theory is used to represent the meanings of words and phrases. Individual elements of sets can represent various semantic entities, such as objects, actions, or properties. For example, the set dog might represent the concept of a dog, while the set run represents the action of running.

Compositionality. One of the fundamental principles in Formal Semantics is compositionality, which states that the meaning of a complex expression is determined by the meanings of its parts and how they are combined.

Predicate Logic. Set theory often integrates with predicate logic to represent relationships and quantification in natural language. Predicate logic allows for the representation of propositions, and set theory complements this by representing the sets of entities that satisfy these propositions.

A.2 Set Operations for Word Interpretation and Embedding Improvement

Section 2 highlights several works that have employed set operations on word embeddings to interpret relationships between words, leading to quantitative and qualitative improvements in word embedding qualities (Zhelezniak et al., 2019; Bhat et al., 2020; Dasgupta et al., 2021).

The novelty of the SetCSE framework, which utilizes embeddings and set-theoretic operations, is specifically addressed in relation to the aforementioned works below:

  1. 1.

    SetCSE employs sentence embeddings for semantic representation and information retrieval, diverging from the prior focus of the mentioned works on using and improving word embeddings.

  2. 2.

    SetCSE utilizes sets of sentences and its learning mechanism to recognize and represent complex and intricate semantics for information querying. This approach differs from previous works, which did not consider the collective use of words to represent complex semantics.

  3. 3.

    SetCSE integrates set-theoretic operations for expressing complex queries in practical sentence retrieval tasks, distinguishing it from previous works that used set operations to uncover word relationships.

Appendix B SetCSE Operations

B.1 Properties of SetCSE Operations

Note that, as per Definition 2, the output of SetCSE serial operations, AB1BN D  DM𝐴subscript𝐵1 subscript𝐵𝑁𝐷subscript𝐷𝑀A\cap B_{1}\cap\dots\cap B_{N}\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to% 3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt% \hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}D\mathbin{\mathchoice{\hbox{ \leavevmode% \hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0% .3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}\dots\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}D_{M}italic_A ∩ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_B start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_D … italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, forms an ordered set of elements in A𝐴Aitalic_A. Hence, these operations aren’t strictly equivalent to the operations defined in Set Theory (Cantor, 1874; Johnson-Laird, 2004), lacking certain properties of the latter, such as the commutative law. Despite this asymmetry, the definitions within the SetCSE framework offer several advantages:

  • It is intuitive to borrow the concepts of intersection and difference operations to describe the “selection” and “deselection” of sentences with certain semantics.

  • Serving as a querying framework, SetCSE is designed to retrieve information from a set of sentences following certain queries. And the proposed SetCSE operation syntax aligns well with its purpose. For instance, the SetCSE serial operations AB C𝐴 𝐵𝐶A\cap B\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Citalic_A ∩ italic_B italic_C means “finding sentences in the set A𝐴Aitalic_A that contains the semantics B𝐵Bitalic_B but not C𝐶Citalic_C.

Appendix C Evaluation

C.1 Hyperparameter Optimization

The effect of temperature parameter τ𝜏\tauitalic_τ and training epoch can be found in Tables 6 and 7, respectively. The effect of using different nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT to represent semantics is discussed in Section 7. In particular, when optimizing for τ𝜏\tauitalic_τ and training epoch, we consider SimCSE-BERT model, AGT dataset and SetCSE intersection operation.

  τ𝜏\tauitalic_τ 0.001 0.01 0.05 0.1 1
  Acc 77.14 77.31 78.29 77.42 77.54
F1 77.11 77.29 78.27 77.40 77.52
 
Table 6: Effects of different temperature τ𝜏\tauitalic_τ for SetCSE intersection on AGT dataset.
  Epoch 20 30 40 50 60 70 80 90
  Acc 71.40 74.63 75.76 75.82 78.27 79.25 79.74 80.47
F1 71.42 74.54 75.66 75.70 78.22 79.16 79.72 80.41
 
Table 7: Effects of different training epoch for SetCSE intersection on AGT dataset.

C.2 The t-SNE Plots of Sentence Embeddings

As previously mentioned, to illustrate the SetCSE framework performance in a more intuitive manner, we include the t-SNE (Van der Maaten & Hinton, 2008) plots of the sentence embeddings regarding all dataset considered. As one can see, the improvements on AGT, AGD and FPB datasets are significant, while the improvements on FMTOD is smaller, since for the latter, the underlying semantics are distinctive already.

Refer to caption
Refer to caption
Figure 5: The t-SNE plots of sentence embeddings induced by existing language models and the SetCSE fine-tuned ones for AGD dataset. As illustrated, the model awareness of different semantics are significantly improved.
Refer to caption
Refer to caption
Figure 6: The t-SNE plots of sentence embeddings induced by existing language models and the SetCSE fine-tuned ones for Banking77 dataset, where “Intent 1”, “Intent 2” and “Intent 3” are “card payment fee charged”, “direct debit payment not recognised” and “balance not updated after cheque or cash deposit”, respectively. As illustrated, the improvements of model awareness of different semantics can be observed.
Refer to caption
Refer to caption
Figure 7: The t-SNE plots of sentence embeddings induced by existing language models and the SetCSE fine-tuned ones for FMTOD dataset, where “Intent 1”, “Intent 2” and “Intent 3” are “find weather”, “set alarm” and “set reminder”, respectively. As illustrated, the improvements of model awareness of different semantics are not as prominent as the ones with other datasets, which aligns with Table 1 results.

C.3 Discussion on SGPT Performance

In Section 5, our evaluations demonstrate that the decoder-only SGPT-125M performs less effectively compared to encoder-based models of similar sizes, both before and after inter-set contrastive learning stages. This observation aligns with findings from other studies that compare embeddings produced by BERT-based models and GPT in benchmark word embedding tasks (Ethayarajh, 2019; Liu et al., 2020; Cai et al., 2020).

Since our evaluation indicates that SGPT benefits less from inter-set fine-tuning, future studies may consider other contrastive learning methods (Jian et al., 2022; Jain et al., 2023) to enhance the context awareness and discriminatory capabilities of decoder-based models.

C.4 Evaluation for SetCSE Serial Operations

In this section, we evaluate the performance of SetCSE serial operations. Specifically, we consider the following three serial operations:

  • Series of two SetCSE intersection operations.

  • Series of two SetCSE difference operations.

  • Series of SetCSE intersection and difference operations.

We utilize multi-label datasets to conduct the SetCSE serial operations experiment. To encompass diverse contexts, we consider the following multi-label datasets and their semantics:

  • GitHub Issue (GitHub) (Ismael, 2022) — “help wanted” (H), “docs” (D).

  • English Quotes (Quotes) (Eltaief, 2022) — “inspirational” (I), “love” (L), “life” (F).

  • Reuters-21578 (Reuters) (Lewis, 1997) — “ship” (S), “grain” (G), “crude” (C).

C.4.1 Evaluation of SetCSE Intersection Series

Suppose a multi-label dataset S𝑆Sitalic_S has N𝑁Nitalic_N semantics, where Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of sentences with the i𝑖iitalic_i-th semantic, and each sentence in S𝑆Sitalic_S contains several semantics in the set of {1,,N}1𝑁\{1,\dots,N\}{ 1 , … , italic_N }. For evaluating two serial SetCSE intersections, the experiment is set up as follows:

  1. 1.

    For Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, randomly select nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT of sentences, denoted as Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and concatenate remaining sentences in all Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as U𝑈Uitalic_U. Regard Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s as example sets and U𝑈Uitalic_U as the evaluation set.

  2. 2.

    Select two sample sets Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, i,j{1,,N}𝑖𝑗1𝑁i,j\in\{1,\dots,N\}italic_i , italic_j ∈ { 1 , … , italic_N }, and conduct UQiQj𝑈subscript𝑄𝑖𝑄𝑗U\cap Q_{i}\cap Qjitalic_U ∩ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_Q italic_j following Algorithm 1. Select the top |Ui,j|subscript𝑈𝑖𝑗|U_{i,j}|| italic_U start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | from the results of serial operations, where Ui,jUsubscript𝑈𝑖𝑗𝑈U_{i,j}\subseteq Uitalic_U start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⊆ italic_U denotes the set of sentences containing semantics i𝑖iitalic_i and j𝑗jitalic_j. Predict the selected sentences containing semantics i𝑖iitalic_i and j𝑗jitalic_j, and compare against ground truth to compute accuracy and F1.

  3. 3.

    As a control group, repeat Step 2 while omitting the model fine-tuning in Algorithm 1.

  4. 4.

    To compare with the performance of single SetCSE operation, conduct experiment in Subsection 5.2 for semantics i𝑖iitalic_i and j𝑗jitalic_j on U𝑈Uitalic_U.

The parameters utilized in the experiments within this section remain consistent with those employed in Section 5.1. Detailed results pertaining to the above experiment can be found in Table 8. For instance, within the “GitHub-HD” column, the results are presented utilizing the GitHub dataset, with semantics i𝑖iitalic_i and j𝑗jitalic_j designated as “help wanted” and “docs”, respectively. Notably, the SetCSE framework showcases a 26% improvement in the performance of serial intersections. Additionally, it is observed that the accuracy and F1 scores of two consecutive SetCSE intersections closely approximate the product of the accuracy and F1 scores from two separate SetCSE intersections, respectively.

  GitHub-HD Quotes-FL Quotes-FI Reuters-SG Reuters-SC   
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1   
  Existing Model Single Intersection BERT 71.93 75.53 83.02 86.00 91.12 92.00 86.94 89.60 91.77 92.57   
RoBERTa 71.68 75.30 83.85 86.90 90.71 91.68 87.50 90.01 90.26 91.37   
Contriever 75.81 78.67 84.85 87.86 92.15 92.83 79.30 81.85 92.81 93.42   
SimCSE-BERT 74.20 74.62 89.80 91.87 90.63 91.60 92.27 93.40 90.36 91.49   
DiffCSE-BERT 74.21 77.47 88.52 93.76 90.53 91.23 92.90 93.08 90.75 91.50   
MCSE-BERT 74.56 77.75 90.08 91.93 90.51 91.49 91.32 92.27 91.07 92.02   
  SetCSE Single Intersection BERT 87.27 86.44 92.70 93.84 93.16 93.70 96.38 96.68 95.92 96.12   
RoBERTa 87.77 89.16 90.93 92.54 94.08 94.54 95.18 95.68 95.88 96.12   
Contriever 92.34 93.38 92.76 93.85 94.92 95.22 95.47 95.96 93.42 94.01   
SimCSE-BERT 93.77 94.48 92.58 93.74 92.09 92.82 96.72 96.95 95.89 96.11   
DiffCSE-BERT 92.11 94.10 92.57 93.01 91.23 92.67 95.84 96.21 94.66 94.97   
MCSE-BERT 91.67 92.91 92.15 93.42 92.46 93.13 96.93 97.19 95.78 96.10   
  Existing Model Serial Intersections BERT 33.54 36.73 54.37 57.00 78.21 79.94 57.85 59.14 79.75 81.78   
RoBERTa 38.10 41.45 49.17 52.42 79.72 81.37 57.73 62.60 78.09 80.44   
Contriever 47.78 53.37 51.35 54.48 83.05 84.47 57.78 59.36 82.77 84.23   
SimCSE-BERT 48.33 51.02 65.47 68.57 78.17 80.52 72.82 74.14 79.05 81.19   
DiffCSE-BERT 49.71 51.66 65.79 68.33 79.33 82.80 69.33 73.43 80.02 82.43   
MCSE-BERT 48.67 51.01 66.20 67.84 79.69 81.70 69.54 73.94 81.15 82.92   
  SetCSE Serial Intersections BERT 58.33 62.28 69.15 73.88 83.90 85.23 85.53 86.74 92.09 92.41   
RoBERTa 53.66 58.67 64.32 67.32 85.95 87.07 84.66 86.00 92.38 92.62   
Contriever 71.74 75.52 71.32 74.24 89.55 90.11 82.22 84.07 84.25 85.66   
SimCSE-BERT 76.93 79.53 72.22 74.54 88.22 89.68 90.12 90.63 92.55 92.87   
DiffCSE-BERT 75.26 76.16 70.31 73.62 84.08 85.75 88.85 90.77 90.59 91.97   
MCSE-BERT 76.83 77.43 71.48 72.50 82.99 84.49 88.39 89.30 91.05 91.77   
  Ave. Improvement 55% 51% 22% 22% 8% 7% 38% 34% 13% 11%   
 
Table 8: Evaluation results for series of two SetCSE intersection operations. As illustrated, the average improvements on accuracy and F1 are 27% and 25%, respectively.
C.4.2 Evaluation of SetCSE Difference Series

Suppose a multi-label dataset S𝑆Sitalic_S has N𝑁Nitalic_N semantics, where Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of sentences with the i𝑖iitalic_i-th semantic, and each sentence in S𝑆Sitalic_S contains several semantics in the set of {1,,N}1𝑁\{1,\dots,N\}{ 1 , … , italic_N }. For evaluating two serial SetCSE intersections, the experiment is set up as follows:

  1. 1.

    For Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, randomly select nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT of sentences, denoted as Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and concatenate remaining sentences in all Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as U𝑈Uitalic_U. Regard Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s as example sets and U𝑈Uitalic_U as the evaluation set.

  2. 2.

    Select two sample sets Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, i,j{1,,N}𝑖𝑗1𝑁i,j\in\{1,\dots,N\}italic_i , italic_j ∈ { 1 , … , italic_N }, and conduct U Qi Qj 𝑈subscript𝑄𝑖𝑄𝑗U\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Q_{i}\mathbin{\mathchoice{\hbox{ % \leavevmode\hbox to3.6pt{\vbox to6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.% 3pt\lower-0.3pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{% }{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Qjitalic_U italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Q italic_j following Algorithm 1. Select the top |Ui¯,j¯|subscript𝑈¯𝑖¯𝑗|U_{\bar{i},\bar{j}}|| italic_U start_POSTSUBSCRIPT over¯ start_ARG italic_i end_ARG , over¯ start_ARG italic_j end_ARG end_POSTSUBSCRIPT | from the results of serial operations, where Ui¯,j¯Usubscript𝑈¯𝑖¯𝑗𝑈U_{\bar{i},\bar{j}}\subseteq Uitalic_U start_POSTSUBSCRIPT over¯ start_ARG italic_i end_ARG , over¯ start_ARG italic_j end_ARG end_POSTSUBSCRIPT ⊆ italic_U denotes the set of sentences that do not contain either semantics i𝑖iitalic_i or j𝑗jitalic_j. The selected sentences are predicted as not containing either semantics i𝑖iitalic_i or j𝑗jitalic_j, and accuracy and F1 are calculated against the ground truth.

  3. 3.

    As a control group, repeat Step 2 while omitting the model fine-tuning in Algorithm 1.

  4. 4.

    To compare with the performance of single SetCSE operation, conduct experiment in Subsection 5.1 for semantics i𝑖iitalic_i and j𝑗jitalic_j on U𝑈Uitalic_U.

The detailed experiment results can be found in Table 9. As one can see, the SetCSE framework improves performance of serial difference operations by 37%. Similarly to Section C.4.1, it is observed that the accuracy and F1 scores of two consecutive SetCSE difference operations closely approximate the product of the accuracy and F1 scores from two separate SetCSE difference operations, respectively.

  GitHub-HD Quotes-FL Quotes-FI Reuters-SG Reuters-SC   
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1   
  Existing Model Single Difference BERT 71.38 72.43 65.77 71.19 65.30 70.86 80.36 82.35 77.88 80.49   
RoBERTa 70.90 71.95 66.34 71.59 64.49 70.60 81.74 83.45 79.23 81.54   
Contriever 71.83 72.92 66.89 71.97 66.00 71.34 73.93 77.52 73.56 77.25   
SimCSE-BERT 71.45 72.73 72.78 76.32 68.61 71.80 83.79 85.10 81.27 83.07   
DiffCSE-BERT 71.49 72.08 73.78 75.71 68.84 71.23 85.94 86.09 83.06 84.51   
MCSE-BERT 71.34 72.56 72.22 75.91 68.37 70.91 87.40 88.21 84.74 86.01   
  SetCSE Single Difference BERT 81.88 82.64 75.69 78.65 71.15 73.64 96.96 97.01 96.15 96.23   
RoBERTa 83.16 84.28 71.87 75.91 69.66 72.61 96.03 96.13 95.25 95.38   
Contriever 85.27 86.11 79.63 81.85 73.91 75.66 95.58 95.69 94.09 94.27   
SimCSE-BERT 87.39 88.25 80.02 82.13 72.03 75.77 97.13 97.18 96.44 96.51   
DiffCSE-BERT 86.23 87.19 81.51 81.87 73.30 75.76 98.20 98.37 96.50 96.78   
MCSE-BERT 86.06 87.13 80.06 82.20 73.99 75.72 97.37 97.41 96.41 96.48   
  Existing Model Serial Difference BERT 41.20 42.36 42.31 45.69 48.67 51.32 57.58 58.96 58.74 61.24   
RoBERTa 42.06 45.07 42.93 46.27 49.02 51.61 61.66 62.61 60.10 62.91   
Contriever 42.61 45.07 38.60 42.06 51.70 53.77 56.28 58.01 56.39 57.36   
SimCSE-BERT 42.16 44.22 50.04 52.42 50.34 52.67 64.73 66.17 66.49 68.11   
DiffCSE-BERT 42.11 43.91 50.37 52.20 47.58 50.09 66.17 66.83 67.30 71.26   
MCSE-BERT 42.41 44.71 51.27 53.41 46.71 49.66 66.03 68.50 70.32 73.01   
  SetCSE Serial Difference BERT 51.13 52.23 48.48 50.93 63.47 65.26 91.84 91.99 91.10 91.18   
RoBERTa 61.15 62.26 45.60 48.26 56.31 56.55 90.34 90.58 91.71 91.85   
Contriever 65.04 69.46 53.22 54.85 69.58 71.23 89.10 89.43 90.24 90.48   
SimCSE-BERT 65.36 70.01 55.54 56.67 67.03 69.91 93.05 93.14 92.67 92.77   
DiffCSE-BERT 66.90 67.29 53.42 56.01 69.11 71.45 92.64 93.19 91.45 92.22   
MCSE-BERT 67.17 73.20 53.54 55.15 69.67 71.80 93.10 93.19 92.91 93.01   
  Ave. Improvement 47% 45% 15% 12% 32% 29% 50% 47% 49% 44%   
 
Table 9: Evaluation results for series of two SetCSE difference operations. As illustrated, the average improvements on accuracy and F1 are 38% and 35%, respectively.
C.4.3 Evaluation of SetCSE Intersection and Difference Series

Suppose a multi-label dataset S𝑆Sitalic_S has N𝑁Nitalic_N semantics, where Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of sentences with the i𝑖iitalic_i-th semantic, and each sentence in S𝑆Sitalic_S contains several semantics in the set of {1,,N}1𝑁\{1,\dots,N\}{ 1 , … , italic_N }. For evaluating two serial SetCSE intersections, the experiment is set up as follows:

  1. 1.

    For Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, randomly select nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT of sentences, denoted as Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and concatenate remaining sentences in all Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as U𝑈Uitalic_U. Regard Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s as example sets and U𝑈Uitalic_U as the evaluation set.

  2. 2.

    Select two sample sets Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, i,j{1,,N}𝑖𝑗1𝑁i,j\in\{1,\dots,N\}italic_i , italic_j ∈ { 1 , … , italic_N }, and conduct UQi Qj𝑈 subscript𝑄𝑖𝑄𝑗U\cap Q_{i}\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to6.6pt{% \pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}Qjitalic_U ∩ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Q italic_j following Algorithm 1. Select the top |Ui,j¯|subscript𝑈𝑖¯𝑗|U_{i,\bar{j}}|| italic_U start_POSTSUBSCRIPT italic_i , over¯ start_ARG italic_j end_ARG end_POSTSUBSCRIPT | from the results of serial operations, where Ui,j¯Usubscript𝑈𝑖¯𝑗𝑈U_{i,\bar{j}}\subseteq Uitalic_U start_POSTSUBSCRIPT italic_i , over¯ start_ARG italic_j end_ARG end_POSTSUBSCRIPT ⊆ italic_U denotes the set of sentences containing semantics i𝑖iitalic_i but not j𝑗jitalic_j. The selected sentences are predicted as the ones containing semantics i𝑖iitalic_i but not j𝑗jitalic_j, and the accuracy and F1 are calculated against the ground truth.

  3. 3.

    As a control group, repeat Step 2 while omitting the model fine-tuning in Algorithm 1.

  4. 4.

    To compare with the performance of single SetCSE operation, conduct experiment in Subsection 5.1 and 5.2 for semantics i𝑖iitalic_i and j𝑗jitalic_j on U𝑈Uitalic_U, respectively.

The detailed experiment results can be found in Table 9. As one can see, the SetCSE framework improves performance of serial difference operations by 35%. Similarly to Sections C.4.1 and C.4.2, we observe that the accuracy and F1 scores of two consecutive SetCSE intersections closely approximate the product of the accuracy and F1 scores from two separate SetCSE intersections, respectively.

  GitHub-HD Quotes-FL Quotes-FI Reuters-SG Reuters-SC   
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1   
  Existing Model Single Operation BERT 85.22 87.80 77.74 79.40 77.52 80.02 80.17 81.27 78.41 80.62   
RoBERTa 83.63 86.70 78.22 79.75 76.58 79.16 80.15 81.26 77.66 80.10   
Contriever 88.06 89.84 78.47 79.93 79.00 81.27 74.77 77.25 80.55 82.13   
SimCSE-BERT 86.82 88.94 92.50 93.14 86.54 88.87 94.45 94.81 87.07 89.59   
DiffCSE-BERT 86.18 88.34 92.33 93.37 86.04 88.21 94.61 95.42 88.43 90.76   
MCSE-BERT 86.63 88.80 91.83 92.59 86.64 89.18 93.77 94.22 89.08 91.08   
  SetCSE Single Operation BERT 91.27 92.31 94.25 94.68 89.47 91.50 97.80 97.86 97.62 97.74   
RoBERTa 88.99 90.53 93.56 94.13 92.04 93.43 97.47 97.58 97.31 97.48   
Contriever 90.35 91.60 93.85 94.32 92.30 93.70 97.84 97.91 92.78 93.97   
SimCSE-BERT 94.13 94.59 95.45 95.70 92.16 93.61 98.51 98.54 98.37 98.43   
DiffCSE-BERT 93.55 96.32 94.42 95.39 93.17 93.77 97.56 98.18 97.30 97.66   
MCSE-BERT 93.38 93.99 94.70 95.04 91.88 93.40 97.49 97.58 97.21 97.40   
  Existing Model Serial Operations BERT 59.40 66.64 62.85 66.36 59.60 61.76 52.36 54.26 63.62 65.40   
RoBERTa 56.84 62.84 63.46 66.82 58.89 60.80 58.18 58.68 59.94 61.36   
Contriever 59.43 63.81 63.91 67.17 57.57 59.15 58.96 61.53 62.90 69.13   
SimCSE-BERT 63.79 69.74 80.76 82.59 64.43 66.65 66.59 68.64 68.41 70.32   
DiffCSE-BERT 63.96 65.64 82.89 83.08 64.48 65.93 67.70 68.82 66.74 68.19   
MCSE-BERT 64.24 70.06 81.99 83.59 63.26 65.98 68.46 70.71 67.97 69.20   
  SetCSE Serial Operations BERT 76.43 79.23 86.11 87.16 82.33 83.44 91.58 91.73 90.05 90.55   
RoBERTa 70.38 74.52 84.49 85.86 86.43 88.02 90.80 91.08 88.50 89.25   
Contriever 74.00 77.35 85.18 86.32 87.73 88.46 91.67 91.85 90.47 92.31   
SimCSE-BERT 84.17 85.42 89.02 89.64 87.27 88.12 93.36 93.43 93.02 93.32   
DiffCSE-BERT 82.17 84.82 87.46 88.92 86.53 88.20 91.65 92.68 90.28 90.28   
MCSE-BERT 82.07 83.73 87.21 88.02 86.01 88.24 90.85 91.06 88.26 89.04   
  Ave. Improvement 27% 22% 24% 21% 41% 39% 52% 49% 41% 37%   
 
Table 10: Evaluation results for series of SetCSE intersection and difference operations. As illustrated, the average improvements on accuracy and F1 are 37% and 33%, respectively.

The experiments presented in this section indicate that, on average, SetCSE improves the performance of serial operations by 33%. Combining with the evaluation results of Section 5, we conclude that SetCSE significantly enhances model discriminatory capabilities, and yields positive results in SetCSE intersection, difference, and series of operations.

Appendix D Application

D.1 Error Analysis for Application Case Studies

To further illustrate the performance and stability of SetCSE serial operations in practical applications, we conduct an error analysis on the showcased examples in Section 6.1. Specifically, Tables 11(a) and 11(b) present the less preferable query results compared to Tables 3(b) and 3(c), respectively. It’s important to note that the presented results are not ranked among the top sentences in the corresponding SetCSE querying outputs. Instead, they are listed between the 30th to 50th positions within the sentences of the S&P500 earnings call dataset (Qin & Yang, 2019).

{adjustwidth}

-0cm   Operation: 𝑿𝑩𝑫 𝑬𝑿𝑩 𝑫𝑬\bm{X\cap B\cap D\mathbin{\mathchoice{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to3.6pt{\vbox to% 6.6pt{\pgfpicture\makeatletter\hbox{\hskip 0.3pt\lower-0.3pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{3.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{6.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to2.45pt{\vbox to% 4.45pt{\pgfpicture\makeatletter\hbox{\hskip 0.22499pt\lower-0.22499pt\hbox to0% .0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill% {0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }% \nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.45pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{2.0pt}{0.0% pt}\pgfsys@lineto{0.0pt}{4.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}{\hbox{ \leavevmode\hbox to1.9pt{\vbox to% 3.4pt{\pgfpicture\makeatletter\hbox{\hskip 0.2pt\lower-0.2pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to% 0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}{}}{{}}{} {{}{}}{}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}% \pgfsys@invoke{ }\pgfsys@roundcap\pgfsys@invoke{ }{}\pgfsys@moveto{1.5pt}{0.0% pt}\pgfsys@lineto{0.0pt}{3.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}}}E}bold_italic_X bold_∩ bold_italic_B bold_∩ bold_italic_D bold_italic_E /*find sentence about ‘‘use tech to influence social issues positively’’*/   Error Analysis: It is fairly incremental in terms of adding things like customer support, field application engineering, software support, given that we’re familiarizing people with our architecture. We had good bio-security to begin with, but we amped it up to an all-time high level of discipline and scrutiny, frankly, and it hasn’t stopped. We are in a unique position to combine the state-of-the-art online experience with the exceptional customer service our associates are known for. Finally, there is one commonality our customers have, it’s that they live in a hybrid IT world.  

(a) Error analysis for complex semantic search with the query “using technology to solve Social issues, while neglecting its potential negative impact.”
{adjustwidth}

-0cm   Operation: 𝑿𝑨𝑭𝑿𝑨𝑭\bm{X\cap A\cap F}bold_italic_X bold_∩ bold_italic_A bold_∩ bold_italic_F /*find sentence about ‘‘invest in environmental development’’*/   Error Analysis: But in terms of percentage growth, most of it’s going to come from gas we would expect. Although energy storage has significant potential for growth, at this point, we have not assumed any material contributions in our outlook. We expect the pace of reduction in loan balances to slow up as energy prices have stabilized and the rig count has increased.  

(b) Error analysis for complex semantic search with the query “investing in environmental development projects.”
Table 11: Error analysis of complex and intricate semantics search using SetCSE serial operations, through the example of analyzing S&P 500 company ESG stance leveraging earning calls transcripts.

Based on the findings in Table 11, we identify the following issues that may lead to less preferred results:

  • Certain words in the query results closely match sample sentences in terms of represented semantics, leading to their respective rankings. However, the entire sentences show less relevance to those specific semantics. For instance, word “energy” may align with the sample phrase “renewable energy” yet the entire sentence might be less associated with “Environmental issues”.

  • Some sentences are chosen due to their high relevance to a single presented semantic, while an intersection of multiple semantics is anticipated. For example, the sentence “We had good bio-security to begin with, but we amped it up to an all-time high level of discipline and scrutiny, frankly, and it hasn’t stopped. ” aligns closely with “Social issue” but displays less relevance to “new technology”.

It’s essential to note that the mentioned examples are not ranked among the top sentences in the query outputs, thereby naturally leading to errors. We aim to improve the assessment of the closeness between a sentence and sets of semantics to mitigate some of these aforementioned errors.

D.2 Additional Use Cases and Examples using SetCSE

Data Pre-labeling. Besides Section 6.2, one additional example of leveraging SetCSE to preprocess unlabeled data is included in Table 12, where Banking77 dataset is used.

{adjustwidth}

-0cm   Operation: XI1𝑋subscript𝐼1X\cap I_{1}italic_X ∩ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT /*find sentences related to ‘‘card payment fee charged’’*/   Results: Why was I charged an extra fee when using the card? How come I was charged an extra fee when paying with the card? I paid with my card and got charged an extra fee, what’s up with that Is it normal to be charged an extra fee when paying with my card?   Operation: XI2𝑋subscript𝐼2X\cap I_{2}italic_X ∩ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT /*find sentences related to ‘‘balance not updated after cheque or cash deposit’’*/   Results: How long should a cheque deposit take to show? My account hasn’t updated and I want to make sure everything is okay. My balance is not right. It has not been updated for the cash or cheque deposit. I made a cash deposit a few days ago and it’s still not reflected in my account. Do you know what might have happened? I attempted to deposit a cheque yesterday but the balance isn’t showing today. Is it still pending?  

Table 12: Demonstration of Banking77 dataset pre-labeling utilizing SetCSE. Specifically, I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote the sample sets for categories “card payment fee charged” and “balance not updated after cheque or cash deposit”.

Word Sense Disambiguation. Word Sense Disambiguation (WSD) is a fundamental task in natural language processing and computational linguistics. It refers to the process of determining the correct sense or meaning of a word when that word has multiple possible meanings or senses in a particular context (Agirre & Edmonds, 2007; Navigli, 2009; Bevilacqua et al., 2021). Using single prompt that contains these polysemies for information retrieval often yields unsatisfactory results, while one can use SetCSE to represent the exact meaning through multiple phrases or sentences and conduct information extraction.

D.3 Introduction to ESG

ESG stands for Environmental, Social, and Governance, and it is a framework used to evaluate and measure the sustainability and ethical practices of a company or organization. ESG criteria are used by investors, analysts, and stakeholders to assess how a company manages its impact on the environment, its relationships with society, and the quality of its corporate governance (Ide & Véronis, 1998; Stevenson & Wilks, 2003; McCarthy, 2009; Friede et al., 2015; Reiser & Tucker, 2019; Li et al., 2021; Khan, 2022; Arvidsson & Dumay, 2022). From a corporate standpoint, enhancing the ESG footprint is equally advantageous as investing directly in productivity and automation (Abiri et al., 2017; Alavian et al., 2018; 2019; Liu et al., 2019a; Kazekami, 2020; Alavian et al., 2020; Liu, 2021; Eun et al., 2022; Alavian et al., 2022; Chui et al., 2023), as it aids in talent attraction and fosters long-term sustainability (Henisz et al., 2019; Woo & Tan, 2022).

According to Investopedia (2023) and CFA Institute (2023), the term ESG includes but not limited to the following topics.

Environmental. This aspect focuses on a company’s environmental impact and its efforts to address sustainability challenges, its key topics include: Climate policies, Energy use, Waste, Pollution, Natural resource conservation, Treatment of animals.

Social. The social component of ESG centers on how a company manages its relationships with people and communities, its key factors include: Customer satisfaction, Data protection and privacy, Gender and diversity, Employee engagement, Community relations, Human rights, Labor standards.

Governance. This aspect focuses on the internal governance and management practices of a company, its key factors include: Board composition, Audit committee structure, Bribery and corruption, Executive compensation, Lobbying, Political contributions, Whistle-blower schemes.

Appendix E Discussion

E.1 Justification of Leveraging Sets to Represent Semantics

As previously mentioned, we conduct experiments in Section 5, with nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT range from 1111 to 30303030, where nsample=1subscript𝑛sample1n_{\text{sample}}=1italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT = 1 corresponds to querying by single sentences. The complete experiment results can be found in Figure 8.

Refer to caption
(a) SetCSE intersection performance on AGD dataset.
Refer to caption
(b) SetCSE difference performance on AGD dataset.
Refer to caption
(c) SetCSE intersection performance on Banking77 dataset..
Refer to caption
(d) SetCSE difference performance on Banking77 dataset.
Refer to caption
(e) SetCSE intersection performance on FPB dataset.
Refer to caption
(f) SetCSE difference performance on FPB dataset.
Refer to caption
(g) SetCSE intersection performance on FMTOD dataset.
Refer to caption
(h) SetCSE difference performance on FMTOD dataset.
Figure 8: SetCSE operation performances on AGD, Banking77, FPB, and FMTOD datasets for different values of nsamplesubscript𝑛samplen_{\text{sample}}italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT.

E.2 Comparison with Supervised Classification

We also compare the SetCSE intersection performance with supervised classification, which regards the sample sentences are training sets, and predicting the class of each queried sentence. For comparing the two mechanisms, we aslo control the training epochs as the same.

The detailed results for the same evaluation datasets considered in Section 5 are listed in Table 13. As one can see, the results are on par with the one for SetCSE intersection, while supervised classification can not be used for querying semantically different sentences and conducting subsequent querying tasks.

  AG News-T AG News-D FPB Banking77 FMTOD   
Acc F1 Acc F1 Acc F1 Acc F1 Acc F1   
    BERT 70.37 68.43 86.05 85.89 71.04 71.56 93.92 93.74 98.49 98.49   
RoBERTa 75.54 75.56 88.60 88.64 75.34 75.58 83.29 83.29 99.05 99.05   
Contriever 76.94 76.94 82.26 82.26 67.98 68.42 92.35 92.05 97.04 97.04   
SGPT 37.55 37.54 38.21 38.22 54.99 55.86 41.59 41.62 83.77 84.15   
SimCSE-BERT 77.77 77.72 86.14 86.17 82.52 82.53 98.63 98.54 99.31 99.31   
SimCSE-RoBERTa 78.01 78.01 87.82 87.73 85.19 85.16 98.41 98.41 97.63 97.63   
DiffCSE-BERT 74.11 74.05 88.49 88.48 82.61 82.66 98.45 98.45 99.66 99.66   
DiffCSE-RoBERTa 78.37 78.33 88.29 88.27 84.36 84.27 98.53 98.53 99.14 99.14   
MCSE-BERT 72.99 72.38 85.34 85.19 80.22 80.41 97.29 97.16 99.35 99.35   
MCSE-RoBERTa 75.94 76.02 88.14 88.13 86.76 86.78 98.26 98.29 99.16 99.16   
 
Table 13: Evaluation results for supervised classification (nsample=20subscript𝑛sample20n_{\text{sample}}=20italic_n start_POSTSUBSCRIPT sample end_POSTSUBSCRIPT = 20).

E.3 Benchmark NLU Task Performances Post Inter-Set Contrastive Learning

As mentioned in Section 5 and Appendix C, inter-set contrastive learning significantly enhances the embedding models’ awareness of presented semantics. However, our interest also lies in evaluating the model’s performance on general Natural Language Understanding (NLU) tasks after this context-specific fine-tuning. To this extend, we conduct evaluations on seven standard semantic textual similarity (STS) tasks (Agirre et al., 2012; 2013; 2014; 2015; 2016; Cer et al., 2017).

We first perform inter-set contrastive learning as described in the Section 5.1 experiment using the five considered datasets. Subsequently, we evaluate the model’s performance on STS tasks. The model used is SimCSE-BERT, and the training hyper-parameters are the same as the ones in Section 5.1. The Spearman’s correlation results for the STS tasks are presented in Table 14.

 
STS12 STS13 STS14 STS15 STS16 STS-B
 
SimCSE-BERT 69.03 77.48 79.21 83.22 81.74 83.09
 
SimCSE-BERT (AGT) 66.38 75.29 75.76 78.93 78.64 80.84
SimCSE-BERT (AGD) 64.57 73.18 74.57 75.58 74.01 79.14
SimCSE-BERT (FPB) 57.02 65.37 67.17 65.08 68.52 77.26
SimCSE-BERT (Banking77) 66.22 75.67 77.14 80.88 79.41 80.19
SimCSE-BERT (FMTOD) 66.39 72.84 76.82 80.31 78.95 80.12
 
Ave. Change -7% -7% -6% -8% -7% -4%
 
Table 14: Model performance on STS tasks post inter-set contrastive learning.

Notably, the application of inter-set contrastive learning exhibits no noteworthy adverse effect on the model’s performance in benchmark STS tasks. On average, the utilization of inter-set contrastive learning only minimally diminishes the model’s performance across STS tasks by 7% in Spearman’s correlation.