Efficient Document Ranking with Learnable Late Interactions

Ziwei Ji    Himanshu Jain    Andreas Veit    Sashank J. Reddi    Sadeep Jayasumana    Ankit Singh Rawat    Aditya Krishna Menon    Felix Yu    Sanjiv Kumar

Google
{ziweiji,himj,aveit,sashank,sadeep,ankitsrawat,}
{adityakmenon,felixyu,sanjivk} @google.com
Abstract

Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for predicting query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query-document embeddings; usually, the former has higher quality while the latter has lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality trade-offs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden over DE models. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25×0.25\times0.25 × storage compared to ColBERT.

1 Introduction

Refer to caption
(a) Cross Encoder (CE)
Refer to caption
(b) Dual Encoder (DE)
Refer to caption
(c) LITE
Figure 1: Illustration of different query-document relevance models. (a) CE models compute a joint query-document embedding by passing the concatenated query/document tokens through a single Transformer. (b) In DE models, query and document embeddings are computed separately with their respective Transformers and the relevance score is the dot product of these embeddings. (c) In the proposed LITE method, query and document token embeddings are computed similarly to DE, but instead of a dot product, we first compute the similarity matrix between each pair of query and document tokens, and pass this matrix through an MLP to produce the final relevance score.

Transformers (Vaswani et al., 2017) have emerged as a successful model for information retrieval problems, where the goal is to retrieve and rank relevant documents for a given query (Nogueira and Cho, 2019). Two families of Transformer-based models are popular: cross-encoder (CE) and dual-encoder (DE) models. Given a (query, document) pair, CE models operate akin to a BERT-style encoder (Devlin et al., 2019): the query and document are concatenated, and sent to a Transformer encoder which outputs a relevance score (cf. Figure 1(a)). CE models can learn complex query-document relationships, as they allow for cross-interaction between query and document tokens.

By contrast, DE models apply two separate Transformer encoders to the query and document, respectively, producing separate query and document embedding vectors (Reimers and Gurevych, 2019). The dot product of these two vectors is used as the final relevance score (cf. Figure 1(b)). Compared to CE models, DE models are usually less accurate (Hofstätter et al., 2020), since the only interaction between the query and document occurs in the final dot product. However, DE models have much lower latency, since all the document embedding vectors can be pre-computed offline.

Recently, late-interaction models have provided alternatives with a more favorable latency-quality trade-off compared to CE and DE models. Similarly to DE models, late-interaction models also use a two-Transformer structure, but they store more information and employ additional nonlinear operations to calculate the final score. In particular, let 𝐐P×L1𝐐superscript𝑃subscript𝐿1\mathbf{Q}\in\mathbb{R}^{P\times L_{1}}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐃P×L2𝐃superscript𝑃subscript𝐿2\mathbf{D}\in\mathbb{R}^{P\times L_{2}}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the query and document token embeddings output by the two Transformers, i.e., there are L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT query token embedding vectors and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT document token embedding vectors of dimension P𝑃Pitalic_P. DE models simply pool 𝐐𝐐\mathbf{Q}bold_Q and 𝐃𝐃\mathbf{D}bold_D into two vectors, and take the dot product. By contrast, ColBERT (Khattab and Zaharia, 2020) calculates the (token-wise) similarity matrix 𝐐𝐃superscript𝐐top𝐃\mathbf{Q}^{\top}\mathbf{D}bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D and computes the final score via a sum-max reduction imaxj(𝐐𝐃)i,j\sum_{i}\max_{j}(\mathbf{Q}^{\top}\mathbf{D})_{i,j}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

While the sum-max score reduction lets ColBERT achieve better accuracy than DE, it is unclear whether this hand-crafted reduction can capture arbitrary complex query-document interactions. Moreover, ColBERT can have higher latency than DE: calculating the similarity matrix 𝐐𝐃superscript𝐐top𝐃\mathbf{Q}^{\top}\mathbf{D}bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D requires L1L2subscript𝐿1subscript𝐿2L_{1}\cdot L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT dot products, while the DE model only requires one dot product. Additionally, to reduce online latency, ColBERT needs to pre-compute and store the Transformer embedding matrix 𝐃𝐃\mathbf{D}bold_D for each document Hofstätter et al. (2020); Santhanam et al. (2022). This can entail significant storage space if we decide to store a large number of document tokens, since there can be billions of documents in industry-scale information retrieval systems (Zhang and Rui, 2013; Overwijk et al., 2022). (See Section 2.3 for a detailed discussion.)

To reduce latency and storage cost, one may seek to store fewer document tokens, and/or reduce the dimension of each token embedding vector. However, it is unclear how these influence performance. In fact, such reduction can significantly hurt the accuracy of ColBERT, as we show in Section 4.4.

Contributions. In this work, we propose lightweight scoring with token einsum (LITE), which addresses the aforementioned shortcomings of existing late-interaction models. LITE applies a lightweight and learnable non-linear transformation on top of Transformer encoders, which corresponds to processing the (token-wise) similarity matrix 𝐒=𝐐𝐃𝐒superscript𝐐top𝐃\mathbf{S}=\mathbf{Q}^{\top}\mathbf{D}bold_S = bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D via shallow multi-layer perceptron (MLP) layers (cf. Figure 1(c) and Section 3). In particular, we focus on a separable LITE scorer which applies two shared MLPs to the rows and the columns of 𝐒𝐒\mathbf{S}bold_S (in that order), and then projects the resulting matrix to a single scalar.

Theoretically, we rigorously establish the expressive power of LITE: we show that LITE is a universal approximator of continuous scoring functions in 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance, even under tight storage constraints (cf. Theorem 3.1). To our knowledge, this is the first formal result about the approximation power of late-interaction methods. Further, we also construct a scoring function that cannot be approximated by a DE model with restricted embedding dimension (cf. Theorem 3.2).

Empirically, we show that LITE can systematically improve upon existing late-interaction methods like ColBERT on both in-domain benchmarks such as MS MARCO and Natural Questions (cf. Table 1), and out-of-domain benchmarks such as BEIR (cf. Table 2). Moreover, LITE can be much more accurate than ColBERT while having lower latency and storage cost (cf. Table 3).

2 Background

Given a query q𝒬𝑞𝒬q\in\mathscr{Q}italic_q ∈ script_Q, the goal of information retrieval (Mitra and Craswell, 2018) is to identify the set of relevant documents from some corpus 𝒟𝒟\mathscr{D}script_D. Typically, |𝒟|𝒟|\mathscr{D}|| script_D | is large (e.g., 𝒪(109)𝒪superscript109\mathcal{O}(10^{9})caligraphic_O ( 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT )), while the number of relevant documents is small (e.g., 𝒪(10)𝒪10\mathcal{O}(10)caligraphic_O ( 10 )). A classical strategy employs a two-phase approach: in the retrieval phase, for moderate K𝐾Kitalic_K (e.g., 𝒪(103)𝒪superscript103\mathscr{O}(10^{3})script_O ( 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )), one retrieves the top-K𝐾Kitalic_K documents based on a scoring function sret:𝒬×𝒟:subscript𝑠ret𝒬𝒟s_{\rm ret}\colon\mathscr{Q}\times\mathscr{D}\to\mathbb{R}italic_s start_POSTSUBSCRIPT roman_ret end_POSTSUBSCRIPT : script_Q × script_D → blackboard_R. These retrieved documents may potentially include some irrelevant documents. In the re-ranking phase, one applies srr:𝒬×𝒟:subscript𝑠rr𝒬𝒟s_{\rm rr}\colon\mathscr{Q}\times\mathscr{D}\to\mathbb{R}italic_s start_POSTSUBSCRIPT roman_rr end_POSTSUBSCRIPT : script_Q × script_D → blackboard_R to re-score the K𝐾Kitalic_K documents, and keep the top scoring ones.

While sretsubscript𝑠rets_{\rm ret}italic_s start_POSTSUBSCRIPT roman_ret end_POSTSUBSCRIPT and srrsubscript𝑠rrs_{\rm rr}italic_s start_POSTSUBSCRIPT roman_rr end_POSTSUBSCRIPT both score query-document relevance, they are often implemented via fundamentally different techniques. Efficiency is more important for sretsubscript𝑠rets_{\rm ret}italic_s start_POSTSUBSCRIPT roman_ret end_POSTSUBSCRIPT since we need to evaluate it over all documents; models such as TF-IDF and BM25 (Robertson and Zaragoza, 2009) and approximate nearest neighbor search (Guo et al., 2016b; Johnson et al., 2019; Guo et al., 2020) are used for this purpose. On the other hand, in the second phase we usually only need to re-score a few (K|𝒟|much-less-than𝐾𝒟K\ll|\mathscr{D}|italic_K ≪ | script_D |) documents, and thus we can usually get higher accuracy by using more expensive models for srrsubscript𝑠rrs_{\rm rr}italic_s start_POSTSUBSCRIPT roman_rr end_POSTSUBSCRIPT. In this work, we focus on re-ranking.

2.1 Cross- and Dual-Encoders

Transformers (Vaswani et al., 2017) have been explored for both retrieval and re-ranking. Given a finite set 𝒳𝒳\mathscr{X}script_X, a Transformer is a function T:𝒳LP×L:𝑇superscript𝒳𝐿superscript𝑃𝐿T\colon\mathscr{X}^{L}\to\mathbb{R}^{P\times L}italic_T : script_X start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT, where L𝐿Litalic_L is the sequence length and P𝑃Pitalic_P is the embedding size of each token in the sequence. A simplified Transformer network is introduced in Section 3.1 and used in our universal approximation results; for more details, we refer the readers to (Vaswani et al., 2017; Devlin et al., 2019).

To estimate query-document relevance via Transformers, one first tokenizes the query and document (e.g., using a SentencePiece tokeniser (Kudo and Richardson, 2018)) into q=(q1,,qL1)𝑞subscript𝑞1subscript𝑞subscript𝐿1q=(q_{1},\ldots,q_{L_{1}})italic_q = ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and d=(d1,,dL2)𝑑subscript𝑑1subscript𝑑subscript𝐿2d=(d_{1},\ldots,d_{L_{2}})italic_d = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). There are then two basic strategies. In cross-encoder (CE) models (Nogueira and Cho, 2019), we apply a single Transformer to the concatenation of q𝑞qitalic_q and d𝑑ditalic_d, and estimate relevance with learned weights 𝐰𝐰\mathbf{w}bold_w:

s(q,d)=𝐰𝗉𝗈𝗈𝗅(T(𝖼𝗈𝗇𝖼𝖺𝗍(q,d))),𝑠𝑞𝑑superscript𝐰top𝗉𝗈𝗈𝗅𝑇𝖼𝗈𝗇𝖼𝖺𝗍𝑞𝑑s(q,d)=\mathbf{w}^{\top}{\sf pool}(T({\sf concat}(q,d))),italic_s ( italic_q , italic_d ) = bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT sansserif_pool ( italic_T ( sansserif_concat ( italic_q , italic_d ) ) ) , (1)

where 𝗉𝗈𝗈𝗅𝗉𝗈𝗈𝗅{\sf pool}sansserif_pool denotes a pooling strategy by which we reduce a sequence of Transformer token embeddings into a single vector. CE models can often achieve high accuracy since they can take into account interactions between the query and document tokens in every Transformer layer. However, they can also be expensive at inference time: we need to compute (1) for all retrieved documents, each of which involves an expensive Transformer inference (see Section 4.4 for concrete evaluations).

By contrast, in dual-encoder (DE) models (Karpukhin et al., 2020), we apply separate Transformers T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to the query and document, and then compute

s(q,d)=𝗉𝗈𝗈𝗅(T1(q))𝗉𝗈𝗈𝗅(T2(d)).𝑠𝑞𝑑𝗉𝗈𝗈𝗅superscriptsubscript𝑇1𝑞top𝗉𝗈𝗈𝗅subscript𝑇2𝑑s(q,d)={\sf pool}(T_{1}(q))^{\top}{\sf pool}(T_{2}(d)).italic_s ( italic_q , italic_d ) = sansserif_pool ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT sansserif_pool ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ) . (2)

In practice, DE is usually less accurate than CE for re-ranking (Hofstätter et al., 2020), since the only interaction between the query and document is the final dot product. Using stronger T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can increase the accuracy of DE (Ni et al., 2021; Ma et al., 2023), but it is also more expensive. On the other hand, since all document embeddings 𝗉𝗈𝗈𝗅(T2(d))𝗉𝗈𝗈𝗅subscript𝑇2𝑑{\sf pool}(T_{2}(d))sansserif_pool ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ) can be pre-computed offline, DE has much lower latency than CE with the same embedding backbone.

Another idea is to apply an MLP to the concatenation of 𝗉𝗈𝗈𝗅(T1(q))𝗉𝗈𝗈𝗅subscript𝑇1𝑞{\sf pool}(T_{1}(q))sansserif_pool ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ) ) and 𝗉𝗈𝗈𝗅(T2(d))𝗉𝗈𝗈𝗅subscript𝑇2𝑑{\sf pool}(T_{2}(d))sansserif_pool ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ) (He et al., 2017). However, Rendle et al. (2020) claim that it may not be better than dot-product DE, partly because it is non-trivial to learn the dot-product operation with an MLP given the concatenated query-document embedding as the input.

2.2 Late-interaction scorers

Recently, there has been interest in late-interaction models. Similarly to DE models, such models also embed queries and documents separately into T1(q)subscript𝑇1𝑞T_{1}(q)italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ) and T2(d)subscript𝑇2𝑑T_{2}(d)italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ); however, they do not use pooling operations, but instead calculate dot products between all pairs of query and document token embeddings, and perform a non-linear score reduction. Formally, let us define query and document Transformer embeddings 𝐐=(𝐪1,,𝐪L1):=T1(q)P×L1𝐐subscript𝐪1subscript𝐪subscript𝐿1assignsubscript𝑇1𝑞superscript𝑃subscript𝐿1\mathbf{Q}=(\mathbf{q}_{1},\ldots,\mathbf{q}_{L_{1}}):=T_{1}(q)\in\mathbb{R}^{% P\times L_{1}}bold_Q = ( bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_q start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) := italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐃=(𝐝1,,𝐝L2):=T2(d)P×L2𝐃subscript𝐝1subscript𝐝subscript𝐿2assignsubscript𝑇2𝑑superscript𝑃subscript𝐿2\mathbf{D}=(\mathbf{d}_{1},\ldots,\mathbf{d}_{L_{2}}):=T_{2}(d)\in\mathbb{R}^{% P\times L_{2}}bold_D = ( bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_d start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) := italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and let 𝐒:=𝐐𝐃assign𝐒superscript𝐐top𝐃\mathbf{S}:=\mathbf{Q}^{\top}\mathbf{D}bold_S := bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D denote the similarity matrix. ColBERT (Khattab and Zaharia, 2020) then performs a non-linear sum-max reduction of 𝐒𝐒\mathbf{S}bold_S:

s(q,d)=i[L1]maxj[L2]𝐪i𝐝j.𝑠𝑞𝑑subscript𝑖delimited-[]subscript𝐿1subscript𝑗delimited-[]subscript𝐿2superscriptsubscript𝐪𝑖topsubscript𝐝𝑗s(q,d)=\sum\nolimits_{i\in[L_{1}]}\max\nolimits_{j\in[L_{2}]}\mathbf{q}_{i}^{% \top}\mathbf{d}_{j}.italic_s ( italic_q , italic_d ) = ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j ∈ [ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .

This non-linearity allows ColBERT to achieve better accuracy than DE. See (Luan et al., 2021) for a related model. Another similar approach is CEDR (MacAvaney et al., 2019), which uses multiple query-document similarity matrices (one for each layer) from pre-trained Transformers. For each query token, instead of only using the most aligned document token, Qian et al. (2022) suggest considering the top-k𝑘kitalic_k aligned document tokens.

Instead of using similarities between all pairs of query and document token embeddings, COIL (Gao et al., 2021) only considers pairs of query and document tokens that have the same token ID, while CITADEL (Li et al., 2022) further implements a dynamic lexical routing. Li et al. (2023) use sparse token representations that can achieve competitive accuracy compared to ColBERT while being much faster. Mysore et al. (2021) suggest using co-citations as supervision for training.

Late-interaction models have precedent in the classical IR literature. For example, DRMM (Guo et al., 2016a) scores (query, document) relevance using a feedforward network on top of count histogram features. On top of the query-document token similarity matrix based on Word2Vec, MatchPyramid (Pang et al., 2016) applies a convolutional network, while KNRM (Xiong et al., 2017) performs kernel-based pooling. ConvKNRM (Dai et al., 2018) further uses a convolutional network on top of learned token embeddings to produce contextual embeddings. There are also relevant models from the collaborative filtering literature, such as Dziugaite and Roy (2015).

2.3 Limitations of existing late-interaction scorers

Late-interaction scorers such as ColBERT may be used in both the retrieval and re-ranking phases. In this paper, we focus on the latter, which has been considered in several previous works, e.g., (Hofstätter et al., 2020; Santhanam et al., 2022; Ren et al., 2021). While ColBERT can yield a more favourable latency versus quality trade-off compared to DE and CE models, there are two important limitations for its use in re-ranking.

Limited expressivity of hand-crafted reductions. Although prior late-interaction models include more non-linearity compared with DE, they rely on hand-crafted score reductions, such as sum-max in ColBERT. It is unclear if these operations can capture arbitrary complex interactions among query and document tokens that define the true relevance.

Latency and storage overhead. Compared with CE, both DE and late-interaction models reduce latency by relying on pre-computed document (token) embeddings. For DE, this requires storing a single document embedding vector (after proper pooling, cf. (2)), and during online inference, we need to take one dot product. Unfortunately, for late-interaction models, the latency and storage cost can be much higher: suppose we use L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT query embedding vectors and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT document embedding vectors to calculate the similarity matrix, then the storage cost is L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT times larger than that of DE models111The document (token) index can be stored on disk, or in RAM. Storing in RAM significantly reduces latency, as we do not need to pay the cost of transferring embeddings from disk. Even if one were to store the index on disk, it is still of interest to reduce the total embedding size to reduce the storage and transfer cost/latency (which would scale linearly with embedding size). , and we need to take L1L2subscript𝐿1subscript𝐿2L_{1}L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT dot products to obtain the similarity matrix. It is unclear how various ways to reduce the latency and storage cost affect the model performance.

In the next section, we present LITE, a novel late-interaction scorer that addresses both aforementioned shortcomings: (1) LITE can provably approximate a broad class of ground truth scoring functions (cf. Theorem 3.1); and (2) it is more accurate than prior late-interaction methods on both in-domain and zero-shot tasks, and is amenable to latency and storage reduction with graceful degradation in model performance (cf. Section 4).

3 LITE scorers

We now introduce LITE scorers. Let 𝐒:=𝐐𝐃L1×L2assign𝐒superscript𝐐top𝐃superscriptsubscript𝐿1subscript𝐿2\mathbf{S}:=\mathbf{Q}^{\top}\mathbf{D}\in\mathbb{R}^{L_{1}\times L_{2}}bold_S := bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the similarity matrix which consists of the dot products of all query-document Transformer token embedding pairs. LITE models apply MLPs to reduce 𝐒𝐒\mathbf{S}bold_S to a scalar score. A natural option is to flatten 𝐒𝐒\mathbf{S}bold_S and then apply an MLP; we call this flattened LITE. On the other hand, in this paper we focus on another MLP model which we call separable LITE, motivated by separable convolution (Chollet, 2017) and MLP-Mixer (Tolstikhin et al., 2021): we first apply row-wise updates to 𝐒𝐒\mathbf{S}bold_S, then column-wise updates, and then a linear projection to get a scalar score. Formally, we first calculate 𝐒,𝐒′′L1×L2superscript𝐒superscript𝐒′′superscriptsubscript𝐿1subscript𝐿2\mathbf{S}^{\prime},\mathbf{S}^{{}^{\prime\prime}}\in\mathbb{R}^{L_{1}\times L% _{2}}bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_S start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as follows: for all 1iL11𝑖subscript𝐿11\leq i\leq L_{1}1 ≤ italic_i ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 1jL21𝑗subscript𝐿21\leq j\leq L_{2}1 ≤ italic_j ≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, let

𝐒i,:subscriptsuperscript𝐒𝑖:\displaystyle\mathbf{S}^{\prime}_{i,:}bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT =𝖫𝖭(σ(𝐖2𝖫𝖭(σ(𝐖1𝐒i,:+𝐛1))+𝐛2)),absent𝖫𝖭𝜎subscript𝐖2𝖫𝖭𝜎subscript𝐖1subscript𝐒𝑖:subscript𝐛1subscript𝐛2\displaystyle={\sf LN}(\sigma(\mathbf{W}_{2}{\sf LN}(\sigma(\mathbf{W}_{1}% \mathbf{S}_{i,:}+\mathbf{b}_{1}))+\mathbf{b}_{2})),= sansserif_LN ( italic_σ ( bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sansserif_LN ( italic_σ ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , (3)
𝐒:,j′′subscriptsuperscript𝐒′′:𝑗\displaystyle\mathbf{S}^{\prime\prime}_{:,j}bold_S start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT =𝖫𝖭(σ(𝐖4𝖫𝖭(σ(𝐖3𝐒:,j+𝐛3))+𝐛4)),absent𝖫𝖭𝜎subscript𝐖4𝖫𝖭𝜎subscript𝐖3subscriptsuperscript𝐒:𝑗subscript𝐛3subscript𝐛4\displaystyle={\sf LN}(\sigma(\mathbf{W}_{4}{\sf LN}(\sigma(\mathbf{W}_{3}% \mathbf{S}^{\prime}_{:,j}+\mathbf{b}_{3}))+\mathbf{b}_{4})),= sansserif_LN ( italic_σ ( bold_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT sansserif_LN ( italic_σ ( bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT bold_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) + bold_b start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ) , (4)

where 𝖫𝖭𝖫𝖭{\sf LN}sansserif_LN, σ𝜎\sigmaitalic_σ respectively denote layer-norm and ReLU. The final score is given by 𝐰𝗏𝖾𝖼(𝐒′′)superscript𝐰top𝗏𝖾𝖼superscript𝐒′′\mathbf{w}^{\top}{\sf vec}(\mathbf{S}^{\prime\prime})bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT sansserif_vec ( bold_S start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ).

Given the above definitions, it is natural to consider the expressivity of LITE. In particular, there are two fundamental questions: (1) Can we always approximate (continuous) scoring functions using LITE, even though LITE only has the similarity matrix as inputs and the original Transformer embeddings are lost? (2) Are LITE models more expressive than simpler models such as DE?

We answer these questions in the following: we show that LITE models are universal approximators of continuous scoring functions (cf. Theorem 3.1), while there exists a scoring function which cannot be approximated by a simple dot-product DE (cf. Theorem 3.2).

3.1 Universal approximation with LITE

We consider the Transformer architecture described by (Yun et al., 2020): it includes multiple encoding layers, each of them can be parameterized as 𝖠(𝐗)+𝖥𝖥(𝖠(𝐗)),𝖠𝐗𝖥𝖥𝖠𝐗{\sf A}(\mathbf{X})+{\sf FF}({\sf A}(\mathbf{X})),sansserif_A ( bold_X ) + sansserif_FF ( sansserif_A ( bold_X ) ) , where 𝐗P×L𝐗superscript𝑃𝐿\mathbf{X}\in\mathbb{R}^{P\times L}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT denotes the input, 𝖥𝖥𝖥𝖥{\sf FF}sansserif_FF denotes a feedforward network, and 𝖠(𝐗)𝖠𝐗{\sf A}(\mathbf{X})sansserif_A ( bold_X ) denotes an attention block:

𝐗+i=1H𝐖oi𝐖vi𝐗𝖲𝗈𝖿𝗍𝗆𝖺𝗑((𝐖ki𝐗)(𝐖qi𝐗)).𝐗superscriptsubscript𝑖1𝐻subscriptsuperscript𝐖𝑖osubscriptsuperscript𝐖𝑖v𝐗𝖲𝗈𝖿𝗍𝗆𝖺𝗑superscriptsubscriptsuperscript𝐖𝑖k𝐗topsubscriptsuperscript𝐖𝑖q𝐗\mathbf{X}+\sum_{i=1}^{H}\mathbf{W}^{i}_{\rm o}\mathbf{W}^{i}_{\rm v}\mathbf{X% }{\sf Softmax}((\mathbf{W}^{i}_{\rm k}\mathbf{X})^{\top}(\mathbf{W}^{i}_{\rm q% }\mathbf{X})).bold_X + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT bold_X sansserif_Softmax ( ( bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT bold_X ) ) .

Here 𝐖qi,𝐖ki,𝐖viC×Psubscriptsuperscript𝐖𝑖qsubscriptsuperscript𝐖𝑖ksubscriptsuperscript𝐖𝑖vsuperscript𝐶𝑃\mathbf{W}^{i}_{\rm q},\mathbf{W}^{i}_{\rm k},\mathbf{W}^{i}_{\rm v}\in\mathbb% {R}^{C\times P}bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT , bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT , bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_P end_POSTSUPERSCRIPT are query, key and value and projection matrices, 𝐖oiP×Csubscriptsuperscript𝐖𝑖osuperscript𝑃𝐶\mathbf{W}^{i}_{\rm o}\in\mathbb{R}^{P\times C}bold_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT are output projection matrices, and H,C𝐻𝐶H,Citalic_H , italic_C denotes the number of heads and dimension of each head. The 𝖲𝗈𝖿𝗍𝗆𝖺𝗑𝖲𝗈𝖿𝗍𝗆𝖺𝗑{\sf Softmax}sansserif_Softmax function is applied to each input column.

A Transformer network defined in the above way is permutation-equivariant (Yun et al., 2020, Claim 1): if we permute the input token sequence, then the output token sequence is permuted in the same way. If we want the network to distinguish between different orders of tokens, we can add a positional encoding matrix 𝐄P×L𝐄superscript𝑃𝐿\mathbf{E}\in\mathbb{R}^{P\times L}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT to the input 𝐗𝐗\mathbf{X}bold_X, and apply a Transformer network to 𝐗+𝐄𝐗𝐄\mathbf{X}+\mathbf{E}bold_X + bold_E.

As discussed in previous sections, in the late-interaction setting, we may need to store the whole Transformer output with shape P×L𝑃𝐿P\times Litalic_P × italic_L, which can be expensive. One solution is to apply a pooling function to reduce the number of tokens; we empirically study this method in Section 4.4, and in Theorem 3.1, we apply pooling functions to map the Transformer output in P×Lsuperscript𝑃𝐿\mathbb{R}^{P\times L}blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT to P×2superscript𝑃2\mathbb{R}^{P\times 2}blackboard_R start_POSTSUPERSCRIPT italic_P × 2 end_POSTSUPERSCRIPT, i.e., a sequence of two token embeddings. We show that two query tokens and two document tokens are enough for universal approximation.

Next, we define the scorers. Let σ,nsubscript𝜎𝑛\mathcal{F}_{\sigma,n}caligraphic_F start_POSTSUBSCRIPT italic_σ , italic_n end_POSTSUBSCRIPT denote the set of 2-layer ReLU networks with n𝑛nitalic_n-dimensional inputs and a scalar output:

σ,n:={𝐳𝐚σ(𝐖𝐳+𝐛)},assignsubscript𝜎𝑛𝐳superscript𝐚top𝜎𝐖𝐳𝐛\displaystyle\mathcal{F}_{\sigma,n}:=\left\{\mathbf{z}\to\mathbf{a}^{\top}% \sigma(\mathbf{W}\mathbf{z}+\mathbf{b})\right\},caligraphic_F start_POSTSUBSCRIPT italic_σ , italic_n end_POSTSUBSCRIPT := { bold_z → bold_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ ( bold_Wz + bold_b ) } ,

where σ𝜎\sigmaitalic_σ denotes the ReLU activation, 𝐳n𝐳superscript𝑛\mathbf{z}\in\mathbb{R}^{n}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, 𝐖m×n𝐖superscript𝑚𝑛\mathbf{W}\in\mathbb{R}^{m\times n}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, 𝐚,𝐛m𝐚𝐛superscript𝑚\mathbf{a},\mathbf{b}\in\mathbb{R}^{m}bold_a , bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and we allow m𝑚mitalic_m to be arbitrarily large. We first consider a class of flattened LITE scorers, including all two-layer ReLU networks on top of 𝐒𝐒\mathbf{S}bold_S that output a scalar score:

f:={𝐒f(𝗏𝖾𝖼(𝐒))|fσ,L1L2}.assignsubscriptfconditional-set𝐒𝑓𝗏𝖾𝖼𝐒𝑓subscript𝜎subscript𝐿1subscript𝐿2\displaystyle\mathcal{F}_{\rm f}:=\left\{\mathbf{S}\to f({\sf vec}(\mathbf{S})% )\middle|f\in\mathcal{F}_{\sigma,L_{1}\cdot L_{2}}\right\}.caligraphic_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT := { bold_S → italic_f ( sansserif_vec ( bold_S ) ) | italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_σ , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } .

For separable LITE, we consider a simplified version of (3) and (4), but without loss of generality, as described below: we first use a 2-layer ReLU network f1:L2:subscript𝑓1superscriptsubscript𝐿2f_{1}:\mathbb{R}^{L_{2}}\to\mathbb{R}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R to reduce every row of 𝐒𝐒\mathbf{S}bold_S to a single scalar, and thus transform 𝐒𝐒\mathbf{S}bold_S into a column vector; and then we apply another 2-layer ReLU network f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to reduce this column vector into a scalar. Formally,

s:={𝐒f2(f1(𝐒))|f1σ,L2,f2σ,L1},assignsubscriptsconditional-set𝐒subscript𝑓2subscript𝑓1𝐒formulae-sequencesubscript𝑓1subscript𝜎subscript𝐿2subscript𝑓2subscript𝜎subscript𝐿1\displaystyle\mathcal{F}_{\rm s}:=\left\{\mathbf{S}\to f_{2}(f_{1}(\mathbf{S})% )\middle|f_{1}\in\mathcal{F}_{\sigma,L_{2}},f_{2}\in\mathcal{F}_{\sigma,L_{1}}% \right\},caligraphic_F start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT := { bold_S → italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) ) | italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_σ , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_σ , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ,

where we let f1(𝐒)L1subscript𝑓1𝐒superscriptsubscript𝐿1f_{1}(\mathbf{S})\in\mathbb{R}^{L_{1}}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the result of applying f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to every row of 𝐒𝐒\mathbf{S}bold_S. Note that ssubscripts\mathcal{F}_{\rm s}caligraphic_F start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT is a subset of the function class defined by (3) and (4) (ignoring layer normalization).

Here is our universal approximation result.

Theorem 3.1 (Universal approximation with LITE).

Let s:(P×L1)×(P×L2):𝑠superscript𝑃subscript𝐿1𝑃subscript𝐿2s:\mathbb{R}^{(P\times L_{1})\times(P\times L_{2})}\to\mathbb{R}italic_s : blackboard_R start_POSTSUPERSCRIPT ( italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × ( italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT → blackboard_R denote a continuous scoring function with a compact support ΩΩ\Omegaroman_Ω and L1,L22subscript𝐿1subscript𝐿22L_{1},L_{2}\geq 2italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 2. For any {f,s}subscriptfsubscripts\mathcal{F}\in\{\mathcal{F}_{\rm f},\mathcal{F}_{\rm s}\}caligraphic_F ∈ { caligraphic_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT } and any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exist a scorer f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, and T1:P×L1P×2:subscript𝑇1superscript𝑃subscript𝐿1superscript𝑃2T_{1}:\mathbb{R}^{P\times L_{1}}\to\mathbb{R}^{P\times 2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × 2 end_POSTSUPERSCRIPT and T2:P×L2P×2:subscript𝑇2superscript𝑃subscript𝐿2superscript𝑃2T_{2}:\mathbb{R}^{P\times L_{2}}\to\mathbb{R}^{P\times 2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × 2 end_POSTSUPERSCRIPT, both of which consist of positional encodings, a Transformer and a pooling function, such that

Ω(f(T1(𝐗)T2(𝐘))s(𝐗,𝐘))2d(𝐗,𝐘)ϵ.subscriptΩsuperscript𝑓subscript𝑇1superscript𝐗topsubscript𝑇2𝐘𝑠𝐗𝐘2d𝐗𝐘italic-ϵ\displaystyle\int_{\Omega}(f(T_{1}(\mathbf{X})^{\top}T_{2}(\mathbf{Y}))-s(% \mathbf{X},\mathbf{Y}))^{2}{\rm d}(\mathbf{X},\mathbf{Y})\leq\epsilon.∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_f ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) ) - italic_s ( bold_X , bold_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d ( bold_X , bold_Y ) ≤ italic_ϵ .

The proof is given in Appendix B, and is based on the “contextual map**” techniques from (Yun et al., 2020). This result is non-trivial, since the input to LITE scorers is the similarity matrix based on only two query tokens and two document tokens; this means LITE models are universal approximators even under strong constraints on the total embedding size. In contrast, as we show in Theorem 3.2, if the total embedding size is less than PL𝑃𝐿P\cdot Litalic_P ⋅ italic_L, then a dot-product DE can have a large approximation error.

3.2 Non-universality of existing scorers

In addition to Theorem 3.1, even without positional encodings, in Theorem B.1 we show that LITE scorers are still universal approximators of arbitrary continuous scoring functions if we do not apply pooling. By contrast, without positional encodings, ColBERT can only represent permutation-equivariant ground-truth scoring functions, because the summation and maximum operations do not consider the order of input tokens. It is an open question if ColBERT is a universal approximator with positional encodings.

If we ask whether a dot-product DE can approximate arbitrary continuous functions, then we give a negative result.

Theorem 3.2 (Limitation of DE with restricted embedding dimension).

Suppose each query and document both have L2𝐿2L\geq 2italic_L ≥ 2 tokens. There exists a continuous ground-truth scoring function s𝑠sitalic_s supported on Ω:=[0,1]P×L×[0,1]P×LassignΩsuperscript01𝑃𝐿superscript01𝑃𝐿\Omega:=[0,1]^{P\times L}\times[0,1]^{P\times L}roman_Ω := [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT × [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT, such that if OPL1𝑂𝑃𝐿1O\leq P\cdot L-1italic_O ≤ italic_P ⋅ italic_L - 1, then for any map**s h1,h2:P×LO:subscript1subscript2superscript𝑃𝐿superscript𝑂h_{1},h_{2}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT that map queries and documents to O𝑂Oitalic_O-dimensional vectors respectively,

Ω(h1(𝐗)h2(𝐘)s(𝐗,𝐘))2d(𝐗,𝐘)120.subscriptΩsuperscriptsubscript1superscript𝐗topsubscript2𝐘𝑠𝐗𝐘2d𝐗𝐘120\int_{\Omega}(h_{1}(\mathbf{X})^{\top}h_{2}(\mathbf{Y})-s(\mathbf{X},\mathbf{Y% }))^{2}{\rm d}(\mathbf{X},\mathbf{Y})\geq\frac{1}{20}.∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) - italic_s ( bold_X , bold_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d ( bold_X , bold_Y ) ≥ divide start_ARG 1 end_ARG start_ARG 20 end_ARG .

Previously Menon et al. (2022) showed that if there is no constraint on the embedding dimension, then dot-product DE is a universal approximator of continuous functions. By contrast, here we show if the DE embedding dimension is less than PL𝑃𝐿P\cdot Litalic_P ⋅ italic_L, there could be a constant approximation error.

4 Experiments

We now evaluate the proposed LITE scorer on a few standard information retrieval benchmarks, where we confirm that LITE significantly improves accuracy over existing DE and late-interaction methods on both in-domain and out-of-domain tasks. Moreover, we show that LITE remains competitive as we reduce the latency and storage cost, and in particular, LITE can achieve higher accuracy than ColBERT with less latency and 0.25×0.25\times0.25 × storage cost.

4.1 Experimental setup

Datasets.

We evaluate scorers on both in-domain re-ranking on the MS MARCO (Nguyen et al., 2016) and Natural Questions (NQ; (Kwiatkowski et al., 2019)) datasets, and zero-shot re-ranking on the BEIR (Thakur et al., 2021) dataset.

Training.

For training on MS MARCO, we use the official training set of triplets (q,d+,d)𝑞subscript𝑑subscript𝑑(q,d_{+},d_{-})( italic_q , italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ), where document d+subscript𝑑d_{+}italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is relevant to query q𝑞qitalic_q while dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT is irrelevant. State-of-the-art methods on MS MARCO also use hard-negative mining (Qu et al., 2021; Santhanam et al., 2022); however, in this paper our focus is on comparing different late-interaction scorers, and thus we simply use the original triplet training data.

We use labels from a CE teacher model during training, as it has been observed that distillation can significantly improve performance (Santhanam et al., 2022; Menon et al., 2022). For MS MARCO, we use the scores from the T2 teacher released by Hofstätter et al. (2020). For the NQ dataset, we use a teacher model trained with 19 hard-negatives mined with BM25, following (Menon et al., 2022). For loss functions, we try the KL loss and the margin MSE loss (see Section A.2 for definitions of loss functions and more details of training).

Evaluation.

For MS MARCO, we use the standard Dev set and the TREC DL 19 and 20 test sets (Craswell et al., 2020, 2021). For NQ, we utilize the version of this dataset used in (Karpukhin et al., 2020), which consists of questions, positive passages containing the correct answer, and a collection of Wikipedia passages. Re-ranking metrics are reported on the Dev query set with 200 passages containing positives, 100 BM25 hard-negatives and up to 100 random negatives, following (Menon et al., 2022). We report MRR@10 (Radev et al., 2002) and nDCG@10 (Järvelin and Kekäläinen, 2002) scores.

For BEIR, following (Thakur et al., 2021), we take the scorers trained on MS MARCO and evaluate zero-shot transfer performance. Specifically, we report evaluation results on the 14 public datasets. Thakur et al. (2021) evaluate the CE model by first retrieving 100 documents using BM25, and then calculating the nDCG@10 score for CE re-ranking; we use the same procedure.

Models.

For the Transformer encoder, we start from a pretrained BERT model (Turc et al., 2019) which has 6 layers and 768 token dimension. For DE and late-interaction models, we let the query encoder and document encoder share weights. We use a query sequence length of 30 and a document sequence length of 200 with the Transformer. If we use all 200 document tokens to calculate the similarity matrix 𝐒𝐒\mathbf{S}bold_S, then 𝐒30×200𝐒superscript30200\mathbf{S}\in\mathbb{R}^{30\times 200}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT 30 × 200 end_POSTSUPERSCRIPT. In some experiments the document sequence length is reduced in the end to save latency and storage cost; we will specify the details later. More hyperparameter details are given in Appendix A.1.

4.2 In-domain re-ranking on MS MARCO and NQ

In Table 1, we report MRR@10 and nDCG@10 scores for different scorers on all datasets. When calculating the similarity matrix for ColBERT and LITE, we use the original sequence length (200) and token embedding dimension (768) of the Transformer encoder. We try both the KL loss and margin MSE loss and report the better results; more details can be found in Section A.3.

Table 1: MRR@10 and nDCG@10 scores. Separable LITE achieves the best in-domain results across all benchmarks.
MS MARCO DL 2019 DL 2020 NQ
Scorer MRR nDCG MRR nDCG MRR nDCG MRR nDCG
DE 0.355 0.413 0.861 0.744 0.842 0.723 0.699 0.611
ColBERT 0.383 0.442 0.878 0.753 0.860 0.731 0.756 0.689
Sep LITE 0.393 0.452 0.898 0.765 0.873 0.756 0.769 0.693

On MS MARCO, the T2 teacher (Hofstätter et al., 2020) has Dev MRR@10 of 0.399. A DE student can only achieve MRR@10 of 0.355. Both ColBERT and separable LITE can significantly reduce this gap, but separable LITE is much better than ColBERT (0.393 vs. 0.383). We also train a 6-layer, 768-dimensional CE student using distillation from the T2 teacher; it has MRR@10 of 0.395, which is only slightly better than separable LITE. Moreover, on TREC DL 19 and 20 datasets, separable LITE also achieves better MRR@10 and nDCG@10 scores than ColBERT.

These observations generalize to the NQ dataset as well: we find that late-interaction models are much better than DE, and separable LITE is much better than ColBERT.

We also try a few ablations, including using top-k𝑘kitalic_k aligned document tokens instead of top-1111 in ColBERT, and freezing the backbone and only fine-tuning the scorers. Separable LITE achieves better accuracy than ColBERT in all cases. See Section A.4 for details.

4.3 Zero-shot re-ranking on BEIR

Table 2 presents zero-shot transfer results with ColBERT and separable LITE (from Table 1) on 14 public datasets from BEIR (Thakur et al., 2021). We also include results for the 6-layer CE model mentioned above, which is trained in the same way as other late-interaction models. We can see that separable LITE achieves better zero-shot transfer than ColBERT on 11 out of 14 datasets. CE still gives better zero-shot transfer than separable LITE, but as we show below, CE has much higher latency (cf. Table 3).

Table 2: BEIR nDCG@10. Separable LITE is better than ColBERT on 11 out of 14 datasets.
Dataset ColBERT Sep LITE CE
T-COVID 0.761 0.763 0.771
NFCorpus 0.356 0.358 0.361
NQ 0.525 0.540 0.552
HotpotQA 0.685 0.681 0.728
FiQA-2018 0.330 0.336 0.346
ArguAna 0.433 0.424 0.519
Touché-2020 0.274 0.305 0.300
CQAD 0.363 0.374 0.378
Quora 0.767 0.839 0.832
DBPedia 0.410 0.434 0.438
SCIDOCS 0.155 0.164 0.167
FEVER 0.782 0.788 0.804
C-FEVER 0.190 0.213 0.232
SciFact 0.667 0.633 0.695

4.4 Results on MS MARCO with reduced latency and storage

As discussed previously, late-interaction methods may have higher latency and storage cost than DE. Suppose the Transformer encoders use L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT query tokens and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT document tokens of dimension P𝑃Pitalic_P, then DE only needs to take one dot product, while calculating the similarity matrix for late-interaction methods requires L1L2subscript𝐿1subscript𝐿2L_{1}L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT dot products. Moreover, to save online latency, we need to pre-compute and store one P𝑃Pitalic_P-dimensional document embedding vector for DE, while for late-interaction methods we might need to store a P×L2𝑃subscript𝐿2P\times L_{2}italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT embedding matrix. This increase in storage cost is significant in industry-scale information retrieval systems, since there can be billions of documents (Zhang and Rui, 2013; Overwijk et al., 2022).

One solution is to reduce P𝑃Pitalic_P and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to some smaller Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and L2superscriptsubscript𝐿2L_{2}^{\prime}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (by projection, pooling, etc.), and then store a P×L2superscript𝑃superscriptsubscript𝐿2P^{\prime}\times L_{2}^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT embedding matrix for each document. Correspondingly, for each query we use L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT embedding vectors of dimension Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and to calculate the similarity matrix, we need L1L2subscript𝐿1superscriptsubscript𝐿2L_{1}L_{2}^{\prime}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT dot products between Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-dimensional vectors. This can reduce both latency and storage; below we analyze how performance drops with such reduction, and show that separable LITE remains competitive compared to ColBERT.

Reducing the number of output document tokens.

Refer to caption
Figure 2: MS MARCO MRR with fewer document tokens.

Here, we keep the token dimension at 768 and reduce the number of output document tokens. The Transformer encoder outputs an embedding matrix 𝐃768×200𝐃superscript768200\mathbf{D}\in\mathbb{R}^{768\times 200}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT 768 × 200 end_POSTSUPERSCRIPT of 200 token embeddings, and we try to reduce the number of tokens either by directly taking average of adjacent columns (average pooling), or by applying a trainable linear projection to every row of 𝐃𝐃\mathbf{D}bold_D. We try both methods and find that separable LITE prefers learnable projection while ColBERT prefers average pooling. The results are shown in Figure 2, and we can see separable LITE is more accurate than ColBERT with reduced document sequence lengths.

Reducing token dimension.

Refer to caption
Figure 3: MS MARCO MRR with reduced token dimension.

Next, we fix the number of document tokens at 200, and reduce the dimension of each output token via learnable linear projections. The results are given in Figure 3. With different token dimension, separable LITE is always more accurate than ColBERT.

Achieving lower latency/storage than ColBERT using LITE.

If the size of pre-computed document embedding matrix is fixed, then LITE has higher latency than ColBERT since its MLP scorer is slower than sum-max. However, since LITE is robust to embedding size reduction, it can remain more accurate than ColBERT while being more time and space efficient by using fewer document tokens. The result is shown in Table 3, together with latency of other scorers studied before.

In Table 3, we evaluate the latency of scoring relevance between 1 query and 100 documents. For CE, we use the 6-layer distilled student and evaluate the total time to calculate the joint embeddings between the query and every document. For DE, ColBERT and separable LITE, we use models from Table 1; we pre-compute the document embeddings, and evaluate the query embedding generation and scoring time. For the “small separable LITE” model, we only store 50 tokens for each document, and we also use a small MLP (we let 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in (3) have shape (768,50)76850(768,50)( 768 , 50 )). In Table 3, small separable LITE only uses 0.25×0.25\times0.25 × storage space compared with ColBERT which stores 200 document token embeddings, and it also achieves lower latency while still being much more accurate than ColBERT (0.391 vs. 0.383). In Table 9, we show that small separate LITE is better than ColBERT on 8 out of 14 datasets. We can also see that the CE latency is 100×100\times100 × of the LITE latency, since CE cannot use offline pre-computation.

Table 3: Latency of different scorers.
Scorer Latency Storage MS MARCO
(in ms) MRR@10
CE (student) 10990 0×\times× 0.395
DE 42 1×\times× 0.355
ColBERT 62 200×\times× 0.383
Separable LITE 111 200×\times× 0.393
Small sep LITE 56 50×\times× 0.391

4.5 Comparison with KNRM

KNRM (Xiong et al., 2017) is one popular pre-Transformer scorer; it calculates the similarity matrix using Word2Vec embeddings, and then apply kernel pooling. It has been applied to MS MARCO in a few recent works (Khattab and Zaharia, 2020; Hofstätter et al., 2020); however, KNRM only achieves low accuracy, likely because the underlying encoders are non-pretrained shallow Transformers. In this work, we try to apply KNRM with the same pretrained BERT encoder as other scorers. We find that KNRM can achieve similar accuracy to ColBERT overall, but separable LITE is still better than KNRM over all benchmarks; see Section A.5 for details.

5 Conclusion

In this work, we propose LITE models that can provably approximate any continuous scoring functions. We also show that LITE outperforms prior late-interaction models in both in-domain and zero-shot reranking. In particular, LITE can achieve higher accuracy with less latency and storage cost.

Limitations

In our MS MARCO experiments, we only train our models using triplet data; by contrast, state-of-the-art models such as ColBERTv2 (Santhanam et al., 2022) use additional techniques such as hard-negative mining. One next step is to evaluate LITE with these techniques. Additionally, our proposed LITE model is suitable for the re-ranking phase of information retrieval. However, given that it is built on top of a factorized dual-encoder, can one also adapt it for use in the retrieval phase? For instance, one possibility could be to jointly train retrieval embeddings and the LITE model such that both the models share the same encoders. Such an analysis is also important and needed in future work.

Ethics Statement

LITE is a general technique that can improve relevance scoring accuracy compared with simple operations such as dot products, and we do not see potential risks. In particular, LITE is only a scoring module and does not generate harmful information. We do need to train the LITE scorer and fine tune the underlying Transformer encoder, which could have some environmental effect.

References

  • Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  • Craswell et al. [2020] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820, 2020.
  • Craswell et al. [2021] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. Overview of the trec 2020 deep learning track. arXiv preprint arXiv:2102.07662, 2021.
  • Cybenko [1989] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • Dai et al. [2018] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, page 126–134, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450355810. doi: 10.1145/3159652.3159659. URL https://doi.org/10.1145/3159652.3159659.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  • Dziugaite and Roy [2015] Gintare Karolina Dziugaite and Daniel M. Roy. Neural network matrix factorization. CoRR, abs/1511.06443, 2015. URL http://arxiv.longhoe.net/abs/1511.06443.
  • Funahashi [1989] Ken-Ichi Funahashi. On the approximate realization of continuous map**s by neural networks. Neural networks, 2(3):183–192, 1989.
  • Gao et al. [2021] Luyu Gao, Zhuyun Dai, and Jamie Callan. Coil: Revisit exact lexical match in information retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186, 2021.
  • Guo et al. [2016a] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, page 55–64, New York, NY, USA, 2016a. Association for Computing Machinery. ISBN 9781450340731.
  • Guo et al. [2016b] Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner product search. In Artificial intelligence and statistics, pages 482–490. PMLR, 2016b.
  • Guo et al. [2020] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3887–3896. PMLR, 2020.
  • He et al. [2017] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173–182, 2017.
  • Hofstätter et al. [2020] Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. CoRR, abs/2010.02666, 2020. URL https://arxiv.longhoe.net/abs/2010.02666.
  • Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • Järvelin and Kekäläinen [2002] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
  • Johnson et al. [2019] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  • Karpukhin et al. [2020] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics.
  • Khattab and Zaharia [2020] Omar Khattab and Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, page 39–48. Association for Computing Machinery, New York, NY, USA, 2020. ISBN 9781450380164.
  • Kudo and Richardson [2018] Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  • Kwiatkowski et al. [2019] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  • Li et al. [2022] Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. Citadel: Conditional token interaction via dynamic lexical routing for efficient and effective multi-vector retrieval. arXiv preprint arXiv:2211.10411, 2022.
  • Li et al. [2023] Minghan Li, Sheng-Chieh Lin, Xueguang Ma, and Jimmy Lin. Slim: Sparsified late interaction for multi-vector retrieval with inverted indexes. arXiv preprint arXiv:2302.06587, 2023.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Luan et al. [2021] Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345, 2021.
  • Ma et al. [2023] Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
  • MacAvaney et al. [2019] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. CEDR: Contextualized embeddings for document ranking. In SIGIR, 2019.
  • Menon et al. [2022] Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Reddi, and Sanjiv Kumar. In defense of dual-encoders for neural ranking. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15376–15400. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/menon22a.html.
  • Mitra and Craswell [2018] Bhaskar Mitra and Nick Craswell. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 13(1):1–126, 2018. ISSN 1554-0669. doi: 10.1561/1500000061.
  • Mysore et al. [2021] Sheshera Mysore, Arman Cohan, and Tom Hope. Multi-vector models with textual guidance for fine-grained scientific document similarity. arXiv preprint arXiv:2111.08366, 2021.
  • Nguyen et al. [2016] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016.
  • Ni et al. [2021] Jianmo Ni, Chen Qu, **g Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers. CoRR, abs/2112.07899, 2021. URL https://arxiv.longhoe.net/abs/2112.07899.
  • Nogueira and Cho [2019] Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT. CoRR, abs/1901.04085, 2019. URL http://arxiv.longhoe.net/abs/1901.04085.
  • Overwijk et al. [2022] Arnold Overwijk, Chenyan Xiong, and Jamie Callan. Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362, 2022.
  • Pang et al. [2016] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text matching as image recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  • Qian et al. [2022] Yujie Qian, **hyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, and Vincent Y Zhao. Multi-vector retrieval as sparse alignment. arXiv preprint arXiv:2211.01267, 2022.
  • Qu et al. [2021] Yingqi Qu, Yuchen Ding, **g Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 5835–5847. Association for Computational Linguistics, 2021.
  • Radev et al. [2002] Dragomir R. Radev, Hong Qi, Harris Wu, and Weiguo Fan. Evaluating web-based question answering systems. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain, May 2002. European Language Resources Association (ELRA).
  • Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.longhoe.net/abs/1908.10084.
  • Ren et al. [2021] Ruiyang Ren, Yingqi Qu, **g Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  • Rendle et al. [2020] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative filtering vs. matrix factorization revisited. In Proceedings of the 14th ACM Conference on Recommender Systems, pages 240–248, 2020.
  • Robertson and Zaragoza [2009] Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009. ISSN 1554-0669.
  • Santhanam et al. [2022] Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.272. URL https://aclanthology.org/2022.naacl-main.272.
  • Thakur et al. [2021] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
  • Tolstikhin et al. [2021] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. MLP-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34:24261–24272, 2021.
  • Turc et al. [2019] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962, 2019. URL http://arxiv.longhoe.net/abs/1908.08962.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  • Xiong et al. [2017] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 55–64, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450350228.
  • Yun et al. [2020] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.
  • Zhang and Rui [2013] Lei Zhang and Yong Rui. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 9(1s):1–20, 2013.
  • Zhu et al. [2023] Xiaofeng Zhu, Thomas Lin, Vishal Anand, Matthew Calderwood, Eric Clausen-Brown, Gord Lueck, Wen-wai Yim, and Cheng Wu. Explicit and implicit semantic ranking framework. In Companion Proceedings of the ACM Web Conference 2023, pages 326–330, 2023.

Appendix A Experimental details

A.1 Hyper-parameters

The main hyperparameters for LITE are the MLP widths. For Separable LITE (cf. (3) and (4)), if the input dot-product matrix has shape L1×L2subscript𝐿1subscript𝐿2L_{1}\times L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has shape (m2,L2)subscript𝑚2subscript𝐿2(m_{2},L_{2})( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), 𝐖2subscript𝐖2\mathbf{W}_{2}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has shape (L2,m2)subscript𝐿2subscript𝑚2(L_{2},m_{2})( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), 𝐖3subscript𝐖3\mathbf{W}_{3}bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT has shape (m1,L1)subscript𝑚1subscript𝐿1(m_{1},L_{1})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), and 𝐖4subscript𝐖4\mathbf{W}_{4}bold_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT has shape (L1,m1)subscript𝐿1subscript𝑚1(L_{1},m_{1})( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). In this work, we let m1=360subscript𝑚1360m_{1}=360italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 360 and m2=2400subscript𝑚22400m_{2}=2400italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2400 in most experiments for simplicity, but we also note that much smaller widths can already give a high accuracy while also reducing the latency (cf. Table 3).

A.2 Training details

Here we first define the loss functions used in our experiments.

For simplicity, let us first consider the triplet setting, where we are given a query q𝑞qitalic_q, a positive document d+subscript𝑑d_{+}italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, and a negative document dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT. Suppose the teacher score is given by 𝐭=(t+,t)𝐭subscript𝑡subscript𝑡\mathbf{t}=(t_{+},t_{-})bold_t = ( italic_t start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ), and the student score is 𝐬=(s+,s)𝐬subscript𝑠subscript𝑠\mathbf{s}=(s_{+},s_{-})bold_s = ( italic_s start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ). The margin MSE loss is defined as ((t+t)(s+s))2superscriptsubscript𝑡subscript𝑡subscript𝑠subscript𝑠2\left((t_{+}-t_{-})-(s_{+}-s_{-})\right)^{2}( ( italic_t start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) - ( italic_s start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., it calculates the teacher score margin and student score margin, and applies a squared loss. The KL loss first calculates the teacher and student probability distributions as below

𝐩(t)superscript𝐩𝑡\displaystyle\mathbf{p}^{(t)}bold_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =(exp(t+)exp(t+)+exp(t),exp(t)exp(t+)+exp(t)),absentsubscript𝑡subscript𝑡subscript𝑡subscript𝑡subscript𝑡subscript𝑡\displaystyle=\left(\frac{\exp(t_{+})}{\exp(t_{+})+\exp(t_{-})},\frac{\exp(t_{% -})}{\exp(t_{+})+\exp(t_{-})}\right),= ( divide start_ARG roman_exp ( italic_t start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( italic_t start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + roman_exp ( italic_t start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG , divide start_ARG roman_exp ( italic_t start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( italic_t start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + roman_exp ( italic_t start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG ) ,
𝐩(s)superscript𝐩𝑠\displaystyle\mathbf{p}^{(s)}bold_p start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT =(exp(s+)exp(s+)+exp(s),exp(s)exp(s+)+exp(s)),absentsubscript𝑠subscript𝑠subscript𝑠subscript𝑠subscript𝑠subscript𝑠\displaystyle=\left(\frac{\exp(s_{+})}{\exp(s_{+})+\exp(s_{-})},\frac{\exp(s_{% -})}{\exp(s_{+})+\exp(s_{-})}\right),= ( divide start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + roman_exp ( italic_s start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG , divide start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) + roman_exp ( italic_s start_POSTSUBSCRIPT - end_POSTSUBSCRIPT ) end_ARG ) ,

and then calculates the KL divergence KL(𝐩(t)||𝐩(s)){\rm KL}(\mathbf{p}^{(t)}||\mathbf{p}^{(s)})roman_KL ( bold_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | bold_p start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ).

In our NQ experiments, we use one positive document and multiple negative documents. In this case the KL loss is defined similarly, while for the margin MSE loss we consider the margins between the positive document and every negative document. Formally, suppose there are N𝑁Nitalic_N documents, the first one is positive while the remaining ones are negative, and let tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the teacher and student scores for the i𝑖iitalic_i-th document, then we consider

i=2N((t1ti)(s1si))2.superscriptsubscript𝑖2𝑁superscriptsubscript𝑡1subscript𝑡𝑖subscript𝑠1subscript𝑠𝑖2\sum_{i=2}^{N}((t_{1}-t_{i})-(s_{1}-s_{i}))^{2}.∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

It is also an interesting open direction to try other training frameworks, such as sRank [Zhu et al., 2023].

On the optimization algorithm, we use AdamW [Loshchilov and Hutter, 2019] with batch size 128, peak learning rate 2.8×1052.8superscript1052.8\times 10^{-5}2.8 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, weight decay 0.01, and 1.5 million steps. We use a linear learning rate warm up of 30000 steps, then a linear learning rate decay.

A.3 Results with different loss functions

Here we present results on different scorers and loss functions.

First, Table 4 includes results on MS MARCO.

Table 4: MS MARCO Dev MRR@10. Separable LITE achieves the best results among factorized (non-CE) models.
Scorer KL Margin MSE
CE student 0.394 0.395
DE 0.355 0.350
ColBERT 0.383 0.378
Separable LITE 0.388 0.393

For context, the T2 teacher [Hofstätter et al., 2020] achieves a Dev MRR@10 of 0.399. Even a CE student (with 6 layers and token dimension 768) cannot match this teacher performance: the best MRR@10 we get is 0.395.

We also note that separable LITE get good results for both the KL loss and margin MSE loss, while other scorers seem to prefer only one loss. It is interesting to understand the effects of loss functions.

Table 5: Natural Questions Dev MRR@10. Separable LITE achieves the best results both in direct training and distillation settings.
Scorer Cross Entropy (one-hot labels) KL (distillation) Margin MSE
DE 0.678 0.699 0.699
ColBERT 0.690 0.754 0.756
Separable LITE 0.710 0.741 0.769

Table 5 includes results on NQ. Here we report results in two settings: direct training with 1-hot labels and the cross entropy loss, and distillation training with the KL loss and margin MSE loss. Separable LITE achieves the best results for both the cross-entropy loss and margin MSE loss; although ColBERT performs better with the KL loss, it gives lower scores than the margin MSE loss.

A.4 Model ablations

Using top-k𝑘kitalic_k aligned document tokens in ColBERT.

Given query Transformer embedding vectors 𝐪1,,𝐪L1subscript𝐪1subscript𝐪subscript𝐿1\mathbf{q}_{1},\ldots,\mathbf{q}_{L_{1}}bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_q start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and document Transformer embedding vectors 𝐝1,,𝐝L2subscript𝐝1subscript𝐝subscript𝐿2\mathbf{d}_{1},\ldots,\mathbf{d}_{L_{2}}bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_d start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, recall that ColBERT performs a sum-max reduction:

i[L1]maxj[L2]𝐪i𝐝j.subscript𝑖delimited-[]subscript𝐿1subscript𝑗delimited-[]subscript𝐿2superscriptsubscript𝐪𝑖topsubscript𝐝𝑗\sum_{i\in[L_{1}]}\max_{j\in[L_{2}]}\mathbf{q}_{i}^{\top}\mathbf{d}_{j}.∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j ∈ [ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .

In other words, for each query token 𝐪isubscript𝐪𝑖\mathbf{q}_{i}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ColBERT finds the most-aligned document embedding vector and includes their dot-product in the score. Qian et al. [2022] suggest using top-k𝑘kitalic_k aligned document tokens for each query token; here we try k=2,4,8𝑘248k=2,4,8italic_k = 2 , 4 , 8 on MS MARCO, but do not notice significant improvement compared with k=1𝑘1k=1italic_k = 1.

k𝑘kitalic_k 1 2 4 8
MRR@10 0.383 0.378 0.380 0.382
Table 6: Dev MRR@10 on MS MARCO with different values of k𝑘kitalic_k. We find that k=1𝑘1k=1italic_k = 1 (i.e., the original ColBERT) is better than other options we try (k=2,4,8𝑘248k=2,4,8italic_k = 2 , 4 , 8).

Freezing query and document encoders.

Recall that we use pretrained BERT models for query and document encoding, and moreover in all experiments above we also fine-tune the pretrained Transformers on MS MARCO and NQ. Here we explore performance of different scorers when the query and document Transformer encoders are frozen (i.e., pre-trained but not fine-tuned on MS MARCO).

When the query and document encoders are frozen, ColBERT does not require any additional fine-tuning since the sum-max function does not include any weights. In this case, ColBERT can achieve Dev MRR@10 score 0.112 on MS MARCO.

For separable LITE, if we freeze the query and document Transformer encoders and only fine tune the separable LITE scorer (i.e., 𝐖1,𝐛1,𝐖2,𝐛2,𝐖3,𝐛3,𝐖4,𝐛4subscript𝐖1subscript𝐛1subscript𝐖2subscript𝐛2subscript𝐖3subscript𝐛3subscript𝐖4subscript𝐛4\mathbf{W}_{1},\mathbf{b}_{1},\mathbf{W}_{2},\mathbf{b}_{2},\mathbf{W}_{3},% \mathbf{b}_{3},\mathbf{W}_{4},\mathbf{b}_{4}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in (3) and (4)), then it can achieve Dev MRR@10 score 0.188 on MS MARCO, which is much better than ColBERT.

A.5 KNRM results

For KNRM, following Xiong et al. [2017], we use K=11𝐾11K=11italic_K = 11 kernels, where μ1=0.9subscript𝜇10.9\mu_{1}=0.9italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, μ2=0.7subscript𝜇20.7\mu_{2}=0.7italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.7, \ldots, μ10=0.9subscript𝜇100.9\mu_{10}=-0.9italic_μ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT = - 0.9 with σ1==σ10=0.1subscript𝜎1subscript𝜎100.1\sigma_{1}=\cdots=\sigma_{10}=0.1italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_σ start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT = 0.1, and μ11=1.0subscript𝜇111.0\mu_{11}=1.0italic_μ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = 1.0 with σ11=103subscript𝜎11superscript103\sigma_{11}=10^{-3}italic_σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. We hold μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fixed and only train 𝐰𝐰\mathbf{w}bold_w.

We report MRR@10 and nDCG@10 scores on in-domain tasks in Table 7. KNRM achieves similar scores to ColBERT overall, while separable LITE is more accurate than KNRM on all benchmarks.

Table 7: MRR@10 and nDCG@10 scores for in-domain tasks. KNRM is similar to ColBERT overall, while worse than separable LITE on all tasks.
MS MARCO DL 2019 DL 2020 NQ
Scorer MRR nDCG MRR nDCG MRR nDCG MRR nDCG
ColBERT 0.383 0.442 0.878 0.753 0.860 0.731 0.756 0.689
KNRM 0.390 0.448 0.859 0.744 0.858 0.730 0.759 0.682
Sep LITE 0.393 0.452 0.898 0.765 0.873 0.756 0.769 0.693

Moreover, separable LITE is much better than KNRM on zero-shot transfer: it is better than KNRM on 12 out of 14 datasets, as shown in Table 8.

Table 8: BEIR nDCG@10. Separable LITE is better than KNRM on 12 out of 14 datasets.
Dataset KNRM Separable LITE
T-COVID 0.741 0.763
NFCorpus 0.353 0.358
NQ 0.526 0.540
HotpotQA 0.678 0.681
FiQA-2018 0.328 0.336
ArguAna 0.446 0.424
Touché-2020 0.301 0.305
CQAD 0.367 0.374
Quora 0.239 0.839
DBPedia 0.420 0.434
SCIDOCS 0.159 0.164
FEVER 0.715 0.788
C-FEVER 0.199 0.213
SciFact 0.645 0.633

A.6 BEIR results of small separable LITE

Table 9 shows BEIR results for small separable LITE introduce in Table 3. It is better than ColBERT on 8 out of 14 datasets.

Table 9: BEIR nDCG@10. Separable LITE is better than ColBERT on 11 out of 14 datasets.
Dataset ColBERT Small sep LITE
T-COVID 0.761 0.767
NFCorpus 0.356 0.353
NQ 0.525 0.538
HotpotQA 0.685 0.680
FiQA-2018 0.330 0.329
ArguAna 0.433 0.433
Touché-2020 0.274 0.298
CQAD 0.363 0.374
Quora 0.767 0.836
DBPedia 0.410 0.436
SCIDOCS 0.155 0.163
FEVER 0.782 0.772
C-FEVER 0.190 0.214
SciFact 0.667 0.622

Appendix B Proof of Theorem 3.1

Here we prove Theorem 3.1. We first restate it here and also include a universal approximation result without positional encodings.

Theorem B.1 (Universal approximation with LITE).

Let s:(P×L1)×(P×L2):𝑠superscript𝑃subscript𝐿1𝑃subscript𝐿2s:\mathbb{R}^{(P\times L_{1})\times(P\times L_{2})}\to\mathbb{R}italic_s : blackboard_R start_POSTSUPERSCRIPT ( italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × ( italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT → blackboard_R denote a continuous scoring function with a compact support ΩΩ\Omegaroman_Ω and L1,L22subscript𝐿1subscript𝐿22L_{1},L_{2}\geq 2italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 2. For any {f,s}subscriptfsubscripts\mathcal{F}\in\{\mathcal{F}_{\rm f},\mathcal{F}_{\rm s}\}caligraphic_F ∈ { caligraphic_F start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT } and any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, there exists a query Transformer T1:P×L1P×L1:subscript𝑇1superscript𝑃subscript𝐿1superscript𝑃subscript𝐿1T_{1}:\mathbb{R}^{P\times L_{1}}\to\mathbb{R}^{P\times L_{1}}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a document Transformer T2:P×L2P×L2:subscript𝑇2superscript𝑃subscript𝐿2superscript𝑃subscript𝐿2T_{2}:\mathbb{R}^{P\times L_{2}}\to\mathbb{R}^{P\times L_{2}}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and a scorer f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, such that

Ω(f(T1(𝐗)T2(𝐘))s(𝐗,𝐘))2d(𝐗,𝐘)ϵ.subscriptΩsuperscript𝑓subscript𝑇1superscript𝐗topsubscript𝑇2𝐘𝑠𝐗𝐘2d𝐗𝐘italic-ϵ\displaystyle\int_{\Omega}\left(f\left(T_{1}(\mathbf{X})^{\top}T_{2}(\mathbf{Y% })\right)-s(\mathbf{X},\mathbf{Y})\right)^{2}{\rm d}(\mathbf{X},\mathbf{Y})% \leq\epsilon.∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_f ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) ) - italic_s ( bold_X , bold_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d ( bold_X , bold_Y ) ≤ italic_ϵ .

Under the same conditions, there also exist positional encoding matrices 𝐄P×L1𝐄superscript𝑃subscript𝐿1\mathbf{E}\in\mathbb{R}^{P\times L_{1}}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐅P×L2𝐅superscript𝑃subscript𝐿2\mathbf{F}\in\mathbb{R}^{P\times L_{2}}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a query Transformer T1:P×L1P×L1:subscript𝑇1superscript𝑃subscript𝐿1superscript𝑃subscript𝐿1T_{1}:\mathbb{R}^{P\times L_{1}}\to\mathbb{R}^{P\times L_{1}}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a pooling function 𝗉𝗈𝗈𝗅1:P×L1P×2:subscript𝗉𝗈𝗈𝗅1superscript𝑃subscript𝐿1superscript𝑃2{\sf pool}_{1}:\mathbb{R}^{P\times L_{1}}\to\mathbb{R}^{P\times 2}sansserif_pool start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × 2 end_POSTSUPERSCRIPT, a document Transformer T2:P×L2P×L2:subscript𝑇2superscript𝑃subscript𝐿2superscript𝑃subscript𝐿2T_{2}:\mathbb{R}^{P\times L_{2}}\to\mathbb{R}^{P\times L_{2}}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a pooling function 𝗉𝗈𝗈𝗅2:P×L2P×2:subscript𝗉𝗈𝗈𝗅2superscript𝑃subscript𝐿2superscript𝑃2{\sf pool}_{2}:\mathbb{R}^{P\times L_{2}}\to\mathbb{R}^{P\times 2}sansserif_pool start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × 2 end_POSTSUPERSCRIPT, and a scorer f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, such that

Ω(f(𝗉𝗈𝗈𝗅1(T1(𝐗+𝐄))𝗉𝗈𝗈𝗅2(T2(𝐘+𝐅)))s(𝐗,𝐘))2d(𝐗,𝐘)ϵ.subscriptΩsuperscript𝑓subscript𝗉𝗈𝗈𝗅1superscriptsubscript𝑇1𝐗𝐄topsubscript𝗉𝗈𝗈𝗅2subscript𝑇2𝐘𝐅𝑠𝐗𝐘2d𝐗𝐘italic-ϵ\displaystyle\int_{\Omega}\left(f\left({\sf pool}_{1}(T_{1}(\mathbf{X}+\mathbf% {E}))^{\top}{\sf pool}_{2}(T_{2}(\mathbf{Y}+\mathbf{F}))\right)-s(\mathbf{X},% \mathbf{Y})\right)^{2}{\rm d}(\mathbf{X},\mathbf{Y})\leq\epsilon.∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_f ( sansserif_pool start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X + bold_E ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT sansserif_pool start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y + bold_F ) ) ) - italic_s ( bold_X , bold_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d ( bold_X , bold_Y ) ≤ italic_ϵ .

Our proof is based on the analysis of [Yun et al., 2020]: they showed that Transformer networks are universal approximators of continuous and compactly-supported sequence-to-sequence functions. In our case, we need to show universal approximation with the dot-product matrix; to this end, we actually need a few technical lemmas from [Yun et al., 2020], as detailed below.

Without loss of generality, we assume the support of the ground-truth scoring function is contained in [0,1)P×L1×[0,1)P×L2superscript01𝑃subscript𝐿1superscript01𝑃subscript𝐿2[0,1)^{P\times L_{1}}\times[0,1)^{P\times L_{2}}[ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The first step is to replace the ground-truth scoring function s𝑠sitalic_s with a piece-wise constant function: let δ>0𝛿0\delta>0italic_δ > 0 be small enough, and let

sδ(𝐗,𝐘):=𝐗𝔾δ,𝐘δs(𝐗,𝐘)𝟙[𝐗𝐗 and 𝐘𝐘],assignsubscript𝑠𝛿𝐗𝐘subscriptformulae-sequencesuperscript𝐗subscript𝔾𝛿superscript𝐘subscript𝛿𝑠superscript𝐗superscript𝐘1delimited-[]𝐗subscriptsuperscript𝐗 and 𝐘subscriptsuperscript𝐘\displaystyle s_{\delta}(\mathbf{X},\mathbf{Y}):=\sum_{\mathbf{X}^{\prime}\in% \mathbb{G}_{\delta},\mathbf{Y}^{\prime}\in\mathbb{H}_{\delta}}s(\mathbf{X}^{% \prime},\mathbf{Y}^{\prime})\mathds{1}\left[\mathbf{X}\in\mathbb{C}_{\mathbf{X% }^{\prime}}\textrm{ and }\mathbf{Y}\in\mathbb{C}_{\mathbf{Y}^{\prime}}\right],italic_s start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( bold_X , bold_Y ) := ∑ start_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT , bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s ( bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) blackboard_1 [ bold_X ∈ blackboard_C start_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and bold_Y ∈ blackboard_C start_POSTSUBSCRIPT bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] , (5)

where 𝐗[0,1)P×L1𝐗superscript01𝑃subscript𝐿1\mathbf{X}\in[0,1)^{P\times L_{1}}bold_X ∈ [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝐘[0,1)P×L2𝐘superscript01𝑃subscript𝐿2\mathbf{Y}\in[0,1)^{P\times L_{2}}bold_Y ∈ [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝔾δ:={0,δ,,1δ}P×L1assignsubscript𝔾𝛿superscript0𝛿1𝛿𝑃subscript𝐿1\mathbb{G}_{\delta}:=\{0,\delta,\ldots,1-\delta\}^{P\times L_{1}}blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT := { 0 , italic_δ , … , 1 - italic_δ } start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and δ:={0,δ,,1δ}P×L2assignsubscript𝛿superscript0𝛿1𝛿𝑃subscript𝐿2\mathbb{H}_{\delta}:=\{0,\delta,\ldots,1-\delta\}^{P\times L_{2}}blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT := { 0 , italic_δ , … , 1 - italic_δ } start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝐗:=j=1Pk=1L1[Xj,k,Xj,k+δ)assignsubscriptsuperscript𝐗superscriptsubscriptproduct𝑗1𝑃superscriptsubscriptproduct𝑘1subscript𝐿1subscriptsuperscript𝑋𝑗𝑘subscriptsuperscript𝑋𝑗𝑘𝛿\mathbb{C}_{\mathbf{X}^{\prime}}:=\prod_{j=1}^{P}\prod_{k=1}^{L_{1}}[X^{\prime% }_{j,k},X^{\prime}_{j,k}+\delta)blackboard_C start_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT + italic_δ ), and 𝐘:=j=1Pk=1L2[Yj,k,Yj,k+δ)assignsubscriptsuperscript𝐘superscriptsubscriptproduct𝑗1𝑃superscriptsubscriptproduct𝑘1subscript𝐿2subscriptsuperscript𝑌𝑗𝑘subscriptsuperscript𝑌𝑗𝑘𝛿\mathbb{C}_{\mathbf{Y}^{\prime}}:=\prod_{j=1}^{P}\prod_{k=1}^{L_{2}}[Y^{\prime% }_{j,k},Y^{\prime}_{j,k}+\delta)blackboard_C start_POSTSUBSCRIPT bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT + italic_δ ). Since s𝑠sitalic_s is continuous, if δ𝛿\deltaitalic_δ is small enough, it holds that sδsubscript𝑠𝛿s_{\delta}italic_s start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is a good approximation of s𝑠sitalic_s.

Next we follow [Yun et al., 2020] and try to approximate sδsubscript𝑠𝛿s_{\delta}italic_s start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT using LITE models based on modified Transformers. Recall that a standard Transformer uses softmax in attention layers and ReLU activation in MLPs; by contrast, in a modified Transformer, we use hardmax in attention layers, and in MLPs we are allowed to use activation functions from ΦΦ\Phiroman_Φ which consists of piece-wise linear functions with at most three pieces where at least one piece is a constant. Such a modified Transformer can then be approximated by a standard Transformer [Yun et al., 2020, Lemma 9].

Here are two key lemmas from [Yun et al., 2020]. For simplicity, we state them for the query Transformer, but they will also be applied to the document Transformer.

The following lemma ensures that there exists a modified Transformer that can quantize the input domain, and thus we can just work with 𝔾δsubscript𝔾𝛿\mathbb{G}_{\delta}blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT. Similarly, on the document side, we can focus on δsubscript𝛿\mathbb{H}_{\delta}blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT.

Lemma B.2 ([Yun et al., 2020] Lemma 5).

There exists a feedforward network gq:[0,1)P×L1𝔾δ:subscript𝑔qsuperscript01𝑃subscript𝐿1subscript𝔾𝛿g_{\mathrm{q}}:[0,1)^{P\times L_{1}}\to\mathbb{G}_{\delta}italic_g start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT : [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT with activations from ΦΦ\Phiroman_Φ, such that for any entry 1iP1𝑖𝑃1\leq i\leq P1 ≤ italic_i ≤ italic_P and any 1jL11𝑗subscript𝐿11\leq j\leq L_{1}1 ≤ italic_j ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, it holds that gq(𝐗)i,j=kδsubscript𝑔qsubscript𝐗𝑖𝑗𝑘𝛿g_{\mathrm{q}}(\mathbf{X})_{i,j}=k\deltaitalic_g start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_k italic_δ if Xi,j[kδ,(k+1)δ)subscript𝑋𝑖𝑗𝑘𝛿𝑘1𝛿X_{i,j}\in[k\delta,(k+1)\delta)italic_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ italic_k italic_δ , ( italic_k + 1 ) italic_δ ), k=0,,1/δ1𝑘01𝛿1k=0,\ldots,1/\delta-1italic_k = 0 , … , 1 / italic_δ - 1.

The following lemma ensures the existence of a modified Transformer that can implement a “contextual map**”: roughly speaking, it means each token of the Transformer output is a a unique Hash encoding of the whole input token sequence. Below is a formal statement.

Lemma B.3 ([Yun et al., 2020] Lemma 6).

Consider the following subset of 𝔾δsubscript𝔾𝛿\mathbb{G}_{\delta}blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT:

𝔾~δ:={𝐗𝔾δ|𝐗:,i𝐗:,j for all ij}.assignsubscript~𝔾𝛿conditional-set𝐗subscript𝔾𝛿subscript𝐗:𝑖subscript𝐗:𝑗 for all 𝑖𝑗\displaystyle\widetilde{\mathbb{G}}_{\delta}:=\left\{\mathbf{X}\in\mathbb{G}_{% \delta}\middle|\mathbf{X}_{:,i}\neq\mathbf{X}_{:,j}\textrm{ for all }i\neq j% \right\}.over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT := { bold_X ∈ blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ≠ bold_X start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT for all italic_i ≠ italic_j } .

If L12subscript𝐿12L_{1}\geq 2italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 2 and δ1/2𝛿12\delta\leq 1/2italic_δ ≤ 1 / 2, then there exists an attention network gc:P×L1P×L1:subscript𝑔csuperscript𝑃subscript𝐿1superscript𝑃subscript𝐿1g_{\mathrm{c}}:\mathbb{R}^{P\times L_{1}}\to\mathbb{R}^{P\times L_{1}}italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with the hardmax operator, a vector 𝐮P𝐮superscript𝑃\mathbf{u}\in\mathbb{R}^{P}bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, constants tl,trsubscript𝑡𝑙subscript𝑡𝑟t_{l},t_{r}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT with 0<tl<tr0subscript𝑡𝑙subscript𝑡𝑟0<t_{l}<t_{r}0 < italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, such that α(𝐗):=𝐮gc(𝐗)assign𝛼𝐗superscript𝐮topsubscript𝑔c𝐗\alpha(\mathbf{X}):=\mathbf{u}^{\top}g_{\mathrm{c}}(\mathbf{X})italic_α ( bold_X ) := bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) satisfies the following conditions:

  1. 1.

    For any 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, all entries of α(𝐗)𝛼𝐗\alpha(\mathbf{X})italic_α ( bold_X ) are different.

  2. 2.

    For any 𝐗,𝐗𝔾~δ𝐗superscript𝐗subscript~𝔾𝛿\mathbf{X},\mathbf{X}^{\prime}\in\widetilde{\mathbb{G}}_{\delta}bold_X , bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT such that 𝐗superscript𝐗\mathbf{X}^{\prime}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not a permutation of 𝐗𝐗\mathbf{X}bold_X, all entries of α(𝐗)𝛼𝐗\alpha(\mathbf{X})italic_α ( bold_X ), α(𝐗)𝛼superscript𝐗\alpha(\mathbf{X}^{\prime})italic_α ( bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are different.

  3. 3.

    For any 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, all entries of α(𝐗)𝛼𝐗\alpha(\mathbf{X})italic_α ( bold_X ) are in [tl,tr]subscript𝑡𝑙subscript𝑡𝑟[t_{l},t_{r}][ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ].

  4. 4.

    For any 𝐗𝔾δ𝔾~δ𝐗subscript𝔾𝛿subscript~𝔾𝛿\mathbf{X}\in\mathbb{G}_{\delta}\setminus\widetilde{\mathbb{G}}_{\delta}bold_X ∈ blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ∖ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, all entries of α(𝐗)𝛼𝐗\alpha(\mathbf{X})italic_α ( bold_X ) are outside [tl,tr]subscript𝑡𝑙subscript𝑡𝑟[t_{l},t_{r}][ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ].

For the document side, consider

~δ:={𝐗δ|𝐘:,i𝐘:,j for all ij}.assignsubscript~𝛿conditional-set𝐗subscript𝛿subscript𝐘:𝑖subscript𝐘:𝑗 for all 𝑖𝑗\displaystyle\widetilde{\mathbb{H}}_{\delta}:=\left\{\mathbf{X}\in\mathbb{H}_{% \delta}\middle|\mathbf{Y}_{:,i}\neq\mathbf{Y}_{:,j}\textrm{ for all }i\neq j% \right\}.over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT := { bold_X ∈ blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT | bold_Y start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ≠ bold_Y start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT for all italic_i ≠ italic_j } .

Lemma B.3 also ensures the existence of an attention network hc:P×L2P×L2:subscriptcsuperscript𝑃subscript𝐿2superscript𝑃subscript𝐿2h_{\mathrm{c}}:\mathbb{R}^{P\times L_{2}}\to\mathbb{R}^{P\times L_{2}}italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with the hardmax operator, a vector 𝐯P𝐯superscript𝑃\mathbf{v}\in\mathbb{R}^{P}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, constants sl,srsubscript𝑠𝑙subscript𝑠𝑟s_{l},s_{r}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT with 0<sl<sr0subscript𝑠𝑙subscript𝑠𝑟0<s_{l}<s_{r}0 < italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, such that β(𝐘):=𝐯hc(𝐘)assign𝛽𝐘superscript𝐯topsubscriptc𝐘\beta(\mathbf{Y}):=\mathbf{v}^{\top}h_{\mathrm{c}}(\mathbf{Y})italic_β ( bold_Y ) := bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) satisfies similar conditions. Also note that for small enough δ𝛿\deltaitalic_δ, we can neglect 𝔾δ𝔾~δsubscript𝔾𝛿subscript~𝔾𝛿\mathbb{G}_{\delta}\setminus\widetilde{\mathbb{G}}_{\delta}blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ∖ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and δ~δsubscript𝛿subscript~𝛿\mathbb{H}_{\delta}\setminus\widetilde{\mathbb{H}}_{\delta}blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ∖ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, since |𝔾δ𝔾~δ|=O(δP|𝔾δ|)subscript𝔾𝛿subscript~𝔾𝛿𝑂superscript𝛿𝑃subscript𝔾𝛿|\mathbb{G}_{\delta}\setminus\widetilde{\mathbb{G}}_{\delta}|=O\left(\delta^{P% }|\mathbb{G}_{\delta}|\right)| blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ∖ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT | = italic_O ( italic_δ start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT | blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT | ) and |δ~δ|=O(δP|δ|)subscript𝛿subscript~𝛿𝑂superscript𝛿𝑃subscript𝛿|\mathbb{H}_{\delta}\setminus\widetilde{\mathbb{H}}_{\delta}|=O\left(\delta^{P% }|\mathbb{H}_{\delta}|\right)| blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ∖ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT | = italic_O ( italic_δ start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT | blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT | ).

Now we are ready to prove Theorem B.1. We first consider the case without positional encodings.

Analysis without positional encodings.

Note that for 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and 𝐘~δ𝐘subscript~𝛿\mathbf{Y}\in\widetilde{\mathbb{H}}_{\delta}bold_Y ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, it holds that α(𝐗)𝛼𝐗\alpha(\mathbf{X})italic_α ( bold_X ) and β(𝐘)𝛽𝐘\beta(\mathbf{Y})italic_β ( bold_Y ) already include enough information to determine the score. However, in LITE models, the final score is calculated only based on dot products between query embedding vectors and document embedding vectors. As a result, we need to first insert 𝐮𝐮\mathbf{u}bold_u and 𝐯𝐯\mathbf{v}bold_v into the Transformer embeddings. The following lemma handles this issue: there exists a feedforward network such that for each 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, it replaces one token in gc(𝐗)subscript𝑔c𝐗g_{\mathrm{c}}(\mathbf{X})italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) with 𝐯𝐯\mathbf{v}bold_v while kee** other tokens unchanged.

Lemma B.4.

Consider the activation function φ𝜑\varphiitalic_φ with φ(z)=1𝜑𝑧1\varphi(z)=1italic_φ ( italic_z ) = 1 if 0z10𝑧10\leq z\leq 10 ≤ italic_z ≤ 1, and φ(z)=0𝜑𝑧0\varphi(z)=0italic_φ ( italic_z ) = 0 if z<0𝑧0z<0italic_z < 0 or z>1𝑧1z>1italic_z > 1. There exists a feedforward network gv:PP:subscript𝑔vsuperscript𝑃superscript𝑃g_{\mathrm{v}}:\mathbb{R}^{P}\to\mathbb{R}^{P}italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT with activation φ𝜑\varphiitalic_φ such that for any 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, let i:=argminjα(𝐗)jassign𝑖subscriptargmin𝑗𝛼subscript𝐗𝑗i:={\operatorname{argmin}}_{j}\alpha(\mathbf{X})_{j}italic_i := roman_argmin start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_α ( bold_X ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then gv(gc(𝐗):,i)=𝐯subscript𝑔vsubscript𝑔csubscript𝐗:𝑖𝐯g_{\mathrm{v}}(g_{\mathrm{c}}(\mathbf{X})_{:,i})=\mathbf{v}italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT ) = bold_v, while for ji𝑗𝑖j\neq iitalic_j ≠ italic_i, it holds that gv(gc(𝐗):,j)=gc(𝐗):,jsubscript𝑔vsubscript𝑔csubscript𝐗:𝑗subscript𝑔csubscript𝐗:𝑗g_{\mathrm{v}}(g_{\mathrm{c}}(\mathbf{X})_{:,j})=g_{\mathrm{c}}(\mathbf{X})_{:% ,j}italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT.

Proof.

For any 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and any i𝑖iitalic_i, 1iL11𝑖subscript𝐿11\leq i\leq L_{1}1 ≤ italic_i ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Lemma B.3 ensures that there exists constants l(𝐗,i)𝑙𝐗𝑖l(\mathbf{X},i)italic_l ( bold_X , italic_i ) and r(𝐗,i)𝑟𝐗𝑖r(\mathbf{X},i)italic_r ( bold_X , italic_i ) such that 0<l(𝐗,i)<α(𝐗)i<r(𝐗,i)0𝑙𝐗𝑖𝛼subscript𝐗𝑖𝑟𝐗𝑖0<l(\mathbf{X},i)<\alpha(\mathbf{X})_{i}<r(\mathbf{X},i)0 < italic_l ( bold_X , italic_i ) < italic_α ( bold_X ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_r ( bold_X , italic_i ), and that [l(𝐗,i),r(𝐗,i)]𝑙𝐗𝑖𝑟𝐗𝑖[l(\mathbf{X},i),r(\mathbf{X},i)][ italic_l ( bold_X , italic_i ) , italic_r ( bold_X , italic_i ) ] does not contain other entries in α(𝐗)𝛼𝐗\alpha(\mathbf{X})italic_α ( bold_X ), and moreover [l(𝐗,i),r(𝐗,i)]𝑙𝐗𝑖𝑟𝐗𝑖[l(\mathbf{X},i),r(\mathbf{X},i)][ italic_l ( bold_X , italic_i ) , italic_r ( bold_X , italic_i ) ] does not contain entries from α(𝐗)𝛼superscript𝐗\alpha(\mathbf{X}^{\prime})italic_α ( bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for 𝐗𝔾~δsuperscript𝐗subscript~𝔾𝛿\mathbf{X}^{\prime}\in\widetilde{\mathbb{G}}_{\delta}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT which is not a permutation of 𝐗𝐗\mathbf{X}bold_X. For this (𝐗,i)𝐗𝑖(\mathbf{X},i)( bold_X , italic_i ) pair, if i:=argminjα(𝐗)jassign𝑖subscriptargmin𝑗𝛼subscript𝐗𝑗i:={\operatorname{argmin}}_{j}\alpha(\mathbf{X})_{j}italic_i := roman_argmin start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_α ( bold_X ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we construct the following neuron

ψ𝐗,i(𝐳):=φ(1r(𝐗,i)l(𝐗,i)(𝐮𝐳l(𝐗,i)))𝐯,assignsubscript𝜓𝐗𝑖𝐳𝜑1𝑟𝐗𝑖𝑙𝐗𝑖superscript𝐮top𝐳𝑙𝐗𝑖𝐯\displaystyle\psi_{\mathbf{X},i}(\mathbf{z}):=\varphi\left(\frac{1}{r(\mathbf{% X},i)-l(\mathbf{X},i)}\left(\mathbf{u}^{\top}\mathbf{z}-l(\mathbf{X},i)\right)% \right)\mathbf{v},italic_ψ start_POSTSUBSCRIPT bold_X , italic_i end_POSTSUBSCRIPT ( bold_z ) := italic_φ ( divide start_ARG 1 end_ARG start_ARG italic_r ( bold_X , italic_i ) - italic_l ( bold_X , italic_i ) end_ARG ( bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_z - italic_l ( bold_X , italic_i ) ) ) bold_v ,

otherwise let

ψ𝐗,i(𝐳):=φ(1r(𝐗,i)l(𝐗,i)(𝐮𝐳l(𝐗,i)))gc(𝐗):,i.assignsubscript𝜓𝐗𝑖𝐳𝜑1𝑟𝐗𝑖𝑙𝐗𝑖superscript𝐮top𝐳𝑙𝐗𝑖subscript𝑔csubscript𝐗:𝑖\displaystyle\psi_{\mathbf{X},i}(\mathbf{z}):=\varphi\left(\frac{1}{r(\mathbf{% X},i)-l(\mathbf{X},i)}\left(\mathbf{u}^{\top}\mathbf{z}-l(\mathbf{X},i)\right)% \right)g_{\mathrm{c}}(\mathbf{X})_{:,i}.italic_ψ start_POSTSUBSCRIPT bold_X , italic_i end_POSTSUBSCRIPT ( bold_z ) := italic_φ ( divide start_ARG 1 end_ARG start_ARG italic_r ( bold_X , italic_i ) - italic_l ( bold_X , italic_i ) end_ARG ( bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_z - italic_l ( bold_X , italic_i ) ) ) italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT .

The full network is the sum of all such neurons

gv(𝐳):=𝐗𝔾~δ,1iL1ψ𝐗,i(𝐳),assignsubscript𝑔v𝐳subscriptformulae-sequence𝐗subscript~𝔾𝛿1𝑖subscript𝐿1subscript𝜓𝐗𝑖𝐳\displaystyle g_{\mathrm{v}}(\mathbf{z}):=\sum_{\mathbf{X}\in\widetilde{% \mathbb{G}}_{\delta},1\leq i\leq L_{1}}\psi_{\mathbf{X},i}(\mathbf{z}),italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( bold_z ) := ∑ start_POSTSUBSCRIPT bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT bold_X , italic_i end_POSTSUBSCRIPT ( bold_z ) ,

which satisfies the requirement of Lemma B.4. ∎

Lemma B.4 is stated for the query side; on the document side, it also follows that there exists a feedforward network husubscriptuh_{\mathrm{u}}italic_h start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT that can replace one token in the embeddings given by hcsubscriptch_{\mathrm{c}}italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT by 𝐮𝐮\mathbf{u}bold_u. Then we are ready to prove Theorem B.1 without positional encodings.

Proof of Theorem B.1, no positional encodings.

In this proof, we will focus on 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and 𝐘~δ𝐘subscript~𝛿\mathbf{Y}\in\widetilde{\mathbb{H}}_{\delta}bold_Y ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT as ensured by Lemmas B.2 and B.3. We also use notation introduced in Lemmas B.3 and B.4.

First consider 𝐮𝐮\mathbf{u}bold_u and 𝐯𝐯\mathbf{v}bold_v given by Lemma B.3. Without loss of generality, we can assume 𝐮𝐯0superscript𝐮top𝐯0\mathbf{u}^{\top}\mathbf{v}\leq 0bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v ≤ 0; if 𝐮𝐯>0superscript𝐮top𝐯0\mathbf{u}^{\top}\mathbf{v}>0bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v > 0, we will replace 𝐯𝐯\mathbf{v}bold_v with 𝐯𝐯-\mathbf{v}- bold_v and replace hc(𝐘)subscriptc𝐘h_{\rm c}(\mathbf{Y})italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) with hc(𝐘)subscriptc𝐘-h_{\rm c}(\mathbf{Y})- italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ), which ensures 𝐮𝐯0superscript𝐮top𝐯0\mathbf{u}^{\top}\mathbf{v}\leq 0bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v ≤ 0, and moreover the conclusions of Lemma B.3 still hold. In detail, in the construction of gvsubscript𝑔vg_{\rm v}italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT, we use 𝐯𝐯-\mathbf{v}- bold_v instead of 𝐯𝐯\mathbf{v}bold_v, while in the construction of husubscriptuh_{\rm u}italic_h start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT, we use hc(𝐘)subscriptc𝐘-h_{\rm c}(\mathbf{Y})- italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) instead of hc(𝐘)subscriptc𝐘h_{\rm c}(\mathbf{Y})italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ). As a result, in the following we assume 𝐮𝐯0superscript𝐮top𝐯0\mathbf{u}^{\top}\mathbf{v}\leq 0bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v ≤ 0.

Recall that for 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, the range of 𝐮gc(𝐗)superscript𝐮topsubscript𝑔c𝐗\mathbf{u}^{\top}g_{\rm c}(\mathbf{X})bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) is denoted by [tl,tr]subscript𝑡𝑙subscript𝑡𝑟[t_{l},t_{r}][ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] with 0<tl<tr0subscript𝑡𝑙subscript𝑡𝑟0<t_{l}<t_{r}0 < italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, while for 𝐘~δ𝐘subscript~𝛿\mathbf{Y}\in\widetilde{\mathbb{H}}_{\delta}bold_Y ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, the range of 𝐯hc(𝐘)superscript𝐯topsubscriptc𝐘\mathbf{v}^{\top}h_{\rm c}(\mathbf{Y})bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) is denoted by [sl,sr]subscript𝑠𝑙subscript𝑠𝑟[s_{l},s_{r}][ italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] with 0<sl<sr0subscript𝑠𝑙subscript𝑠𝑟0<s_{l}<s_{r}0 < italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Define

M:=max𝐗𝔾~δmax𝐘~δmaxi,j|gc(𝐗):,ihc(𝐘):,j|.assign𝑀subscript𝐗subscript~𝔾𝛿subscript𝐘subscript~𝛿subscript𝑖𝑗subscript𝑔csuperscriptsubscript𝐗:𝑖topsubscriptcsubscript𝐘:𝑗\displaystyle M:=\max_{\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}}\max_{% \mathbf{Y}\in\widetilde{\mathbb{H}}_{\delta}}\max_{i,j}\left|g_{\rm c}(\mathbf% {X})_{:,i}^{\top}h_{\rm c}(\mathbf{Y})_{:,j}\right|.italic_M := roman_max start_POSTSUBSCRIPT bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT bold_Y ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT | .

In the following, we will assume tl>Msubscript𝑡𝑙𝑀t_{l}>Mitalic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_M and sl>trsubscript𝑠𝑙subscript𝑡𝑟s_{l}>t_{r}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT without loss of generality; if these conditions do not hold, we can let λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be large enough such that λ1tl>Msubscript𝜆1subscript𝑡𝑙𝑀\lambda_{1}t_{l}>Mitalic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_M and λ2sl>λ1trsubscript𝜆2subscript𝑠𝑙subscript𝜆1subscript𝑡𝑟\lambda_{2}s_{l}>\lambda_{1}t_{r}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and scale 𝐮𝐮\mathbf{u}bold_u to λ1𝐮subscript𝜆1𝐮\lambda_{1}\mathbf{u}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_u, and scale 𝐯𝐯\mathbf{v}bold_v to λ2𝐯subscript𝜆2𝐯\lambda_{2}\mathbf{v}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_v.

Given 𝐗𝔾~δ𝐗subscript~𝔾𝛿\mathbf{X}\in\widetilde{\mathbb{G}}_{\delta}bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and 𝐘~δ𝐘subscript~𝛿\mathbf{Y}\in\widetilde{\mathbb{H}}_{\delta}bold_Y ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, we consider 𝐐=gv(gc(𝐗))P×L1𝐐subscript𝑔vsubscript𝑔c𝐗superscript𝑃subscript𝐿1\mathbf{Q}=g_{\rm v}(g_{\rm c}(\mathbf{X}))\in\mathbb{R}^{P\times L_{1}}bold_Q = italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝐃=hu(hc(𝐘))P×L2𝐃subscriptusubscriptc𝐘superscript𝑃subscript𝐿2\mathbf{D}=h_{\rm u}(h_{\rm c}(\mathbf{Y}))\in\mathbb{R}^{P\times L_{2}}bold_D = italic_h start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and the dot-product matrix 𝐒:=𝐐𝐃L1×L2assign𝐒superscript𝐐top𝐃superscriptsubscript𝐿1subscript𝐿2\mathbf{S}:=\mathbf{Q}^{\top}\mathbf{D}\in\mathbb{R}^{L_{1}\times L_{2}}bold_S := bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Lemma B.4 ensures that 𝐐𝐐\mathbf{Q}bold_Q has one column equal to 𝐯𝐯\mathbf{v}bold_v, while 𝐃𝐃\mathbf{D}bold_D has one column equal to 𝐮𝐮\mathbf{u}bold_u.

Let 𝐪𝐪\mathbf{q}bold_q denote an arbitrary column of 𝐐𝐐\mathbf{Q}bold_Q other than 𝐯𝐯\mathbf{v}bold_v, and let 𝐝𝐝\mathbf{d}bold_d denote an arbitrary column of 𝐃𝐃\mathbf{D}bold_D other than 𝐮𝐮\mathbf{u}bold_u. Due to previous discussion, we have 𝐯𝐝sl>tr𝐪𝐮superscript𝐯top𝐝subscript𝑠𝑙subscript𝑡𝑟superscript𝐪top𝐮\mathbf{v}^{\top}\mathbf{d}\geq s_{l}>t_{r}\geq\mathbf{q}^{\top}\mathbf{u}bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_d ≥ italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≥ bold_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u, and therefore we can distinguish them. Additionally 𝐪𝐮tl>Msuperscript𝐪top𝐮subscript𝑡𝑙𝑀\mathbf{q}^{\top}\mathbf{u}\geq t_{l}>Mbold_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u ≥ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > italic_M, and thus we can distinguish it from other entries of 𝐒𝐒\mathbf{S}bold_S, including 𝐯𝐮0superscript𝐯top𝐮0\mathbf{v}^{\top}\mathbf{u}\leq 0bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u ≤ 0.

Now let us examine 𝐒𝐒\mathbf{S}bold_S in detail. Suppose 𝐐:,i=𝐯subscript𝐐:𝑖𝐯\mathbf{Q}_{:,i}=\mathbf{v}bold_Q start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT = bold_v and 𝐃:,j=𝐮subscript𝐃:𝑗𝐮\mathbf{D}_{:,j}=\mathbf{u}bold_D start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT = bold_u for some 1iL11𝑖subscript𝐿11\leq i\leq L_{1}1 ≤ italic_i ≤ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 1jL21𝑗subscript𝐿21\leq j\leq L_{2}1 ≤ italic_j ≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then

𝐒i,:=(𝐐𝐃)i,:=[𝐯𝐝1,,𝐯𝐮,,𝐯𝐝L2],subscript𝐒𝑖:subscriptsuperscript𝐐top𝐃𝑖:superscript𝐯topsubscript𝐝1superscript𝐯top𝐮superscript𝐯topsubscript𝐝subscript𝐿2\displaystyle\mathbf{S}_{i,:}=(\mathbf{Q}^{\top}\mathbf{D})_{i,:}=[\mathbf{v}^% {\top}\mathbf{d}_{1},\cdots,\mathbf{v}^{\top}\mathbf{u},\cdots,\mathbf{v}^{% \top}\mathbf{d}_{L_{2}}],bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT = ( bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D ) start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT = [ bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u , ⋯ , bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_d start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ,

and

𝐒:,j=[𝐪1𝐮,,𝐯𝐮,,𝐪L1𝐮].subscript𝐒:𝑗superscriptsuperscriptsubscript𝐪1top𝐮superscript𝐯top𝐮superscriptsubscript𝐪subscript𝐿1top𝐮top\displaystyle\mathbf{S}_{:,j}=[\mathbf{q}_{1}^{\top}\mathbf{u},\cdots,\mathbf{% v}^{\top}\mathbf{u},\cdots,\mathbf{q}_{L_{1}}^{\top}\mathbf{u}]^{\top}.bold_S start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT = [ bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u , ⋯ , bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u , ⋯ , bold_q start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

The previous scaling allows us to find 𝐒i,:subscript𝐒𝑖:\mathbf{S}_{i,:}bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT and 𝐒:,jsubscript𝐒:𝑗\mathbf{S}_{:,j}bold_S start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT. Lemma B.3 ensures that every element of 𝐒i,:subscript𝐒𝑖:\mathbf{S}_{i,:}bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT other than 𝐯𝐮superscript𝐯top𝐮\mathbf{v}^{\top}\mathbf{u}bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u can uniquely determine the set of columns of the document input 𝐘𝐘\mathbf{Y}bold_Y, but not the order of columns since Transformers without positional encodings are permutation-equivariant [Yun et al., 2020, Claim 1]. However, all elements of 𝐒i,:subscript𝐒𝑖:\mathbf{S}_{i,:}bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT together are able to determine the exact order of columns of 𝐘𝐘\mathbf{Y}bold_Y. Similarly, 𝐒:,jsubscript𝐒:𝑗\mathbf{S}_{:,j}bold_S start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT as a whole can determine the exact query input 𝐗𝐗\mathbf{X}bold_X, including the order of columns. Consequently, 𝐒𝐒\mathbf{S}bold_S can uniquely determine the input pair (𝐗,𝐘)𝐗𝐘(\mathbf{X},\mathbf{Y})( bold_X , bold_Y ), and also the ground-truth score sδ(𝐗,𝐘)subscript𝑠𝛿𝐗𝐘s_{\delta}(\mathbf{X},\mathbf{Y})italic_s start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( bold_X , bold_Y ).

For flattened LITE, note that 𝔾~δsubscript~𝔾𝛿\widetilde{\mathbb{G}}_{\delta}over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT and ~δsubscript~𝛿\widetilde{\mathbb{H}}_{\delta}over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT are both finite, and thus the set of possible dot-product matrix

{𝐐𝐃|𝐐=gv(gc(𝐗)),𝐃=hu(hc(𝐘)),𝐗𝔾~δ,𝐘~δ}conditional-setsuperscript𝐐top𝐃formulae-sequence𝐐subscript𝑔vsubscript𝑔c𝐗formulae-sequence𝐃subscriptusubscriptc𝐘formulae-sequence𝐗subscript~𝔾𝛿𝐘subscript~𝛿\left\{\mathbf{Q}^{\top}\mathbf{D}\middle|\mathbf{Q}=g_{\rm v}(g_{\rm c}(% \mathbf{X})),\mathbf{D}=h_{\rm u}(h_{\rm c}(\mathbf{Y})),\mathbf{X}\in% \widetilde{\mathbb{G}}_{\delta},\mathbf{Y}\in\widetilde{\mathbb{H}}_{\delta}\right\}{ bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D | bold_Q = italic_g start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) ) , bold_D = italic_h start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) ) , bold_X ∈ over~ start_ARG blackboard_G end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT , bold_Y ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT }

is also finite. Moreover, each dot-product matrix uniquely determines the ground-truth score, as discussed above. Therefore there exists a 2-layer ReLU network that uniformly approximates an interpolations of these scores [Cybenko, 1989, Funahashi, 1989, Hornik et al., 1989], which finishes the proof.

For separable LITE, recall that we first apply a shared MLP f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to reduce every row of 𝐒𝐒\mathbf{S}bold_S to a scalar, and thus get a column vector; then we apply another MLP f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to reduce this column vector to a final score. Now let ψ𝜓\psiitalic_ψ denote an injection from ~δsubscript~𝛿\widetilde{\mathbb{H}}_{\delta}over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT to [tr+1,tr+2]subscript𝑡𝑟1subscript𝑡𝑟2[t_{r}+1,t_{r}+2][ italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 2 ], i.e., for any 𝐘,𝐘~δ𝐘superscript𝐘subscript~𝛿\mathbf{Y},\mathbf{Y}^{\prime}\in\widetilde{\mathbb{H}}_{\delta}bold_Y , bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT, we have ψ(𝐘),ψ(𝐘)[tr+1,tr+2]𝜓𝐘𝜓superscript𝐘subscript𝑡𝑟1subscript𝑡𝑟2\psi(\mathbf{Y}),\psi(\mathbf{Y}^{\prime})\in[t_{r}+1,t_{r}+2]italic_ψ ( bold_Y ) , italic_ψ ( bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 2 ], and ψ(𝐘)ψ(𝐘)𝜓𝐘𝜓superscript𝐘\psi(\mathbf{Y})\neq\psi(\mathbf{Y}^{\prime})italic_ψ ( bold_Y ) ≠ italic_ψ ( bold_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). There exists such a ψ𝜓\psiitalic_ψ since ~δsubscript~𝛿\widetilde{\mathbb{H}}_{\delta}over~ start_ARG blackboard_H end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is finite.

Now if the i𝑖iitalic_i-th column of 𝐐𝐐\mathbf{Q}bold_Q is 𝐯𝐯\mathbf{v}bold_v, then we let f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT map 𝐒i,:subscript𝐒𝑖:\mathbf{S}_{i,:}bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT to ψ(𝐘)𝜓𝐘\psi(\mathbf{Y})italic_ψ ( bold_Y ); this is well-defined since 𝐒i,:subscript𝐒𝑖:\mathbf{S}_{i,:}bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT uniquely determines 𝐘𝐘\mathbf{Y}bold_Y, as discussed above. For any iisuperscript𝑖𝑖i^{\prime}\neq iitalic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_i, we let f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT map 𝐒i,:subscript𝐒superscript𝑖:\mathbf{S}_{i^{\prime},:}bold_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , : end_POSTSUBSCRIPT to 𝐪i𝐮[tl,tr]superscriptsubscript𝐪superscript𝑖top𝐮subscript𝑡𝑙subscript𝑡𝑟\mathbf{q}_{i^{\prime}}^{\top}\mathbf{u}\in[t_{l},t_{r}]bold_q start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_u ∈ [ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ]. Note that by our construction, f1(𝐒i,:)tr+1>trf1(𝐒i,:)subscript𝑓1subscript𝐒𝑖:subscript𝑡𝑟1subscript𝑡𝑟subscript𝑓1subscript𝐒superscript𝑖:f_{1}(\mathbf{S}_{i,:})\geq t_{r}+1>t_{r}\geq f_{1}(\mathbf{S}_{i^{\prime},:})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT ) ≥ italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 > italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≥ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , : end_POSTSUBSCRIPT ). As a result, f1(𝐒)subscript𝑓1𝐒f_{1}(\mathbf{S})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) can uniquely determines (𝐗,𝐘)𝐗𝐘(\mathbf{X},\mathbf{Y})( bold_X , bold_Y ), and thus there exists another MLP f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT which can approximate the ground-truth score sδsubscript𝑠𝛿s_{\delta}italic_s start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT. ∎

Analysis with positional encodings.

Here we consider the case with positional encodings. Following [Yun et al., 2020], we will use fixed positional encodings: let 𝟏1\mathbf{1}bold_1 denote the P𝑃Pitalic_P-dimensional all-ones vector, and let 𝐄P×L1𝐄superscript𝑃subscript𝐿1\mathbf{E}\in\mathbb{R}^{P\times L_{1}}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the matrix whose j𝑗jitalic_j-th column is given by (j1)𝟏𝑗11(j-1)\mathbf{1}( italic_j - 1 ) bold_1, and similarly let 𝐅P×L2𝐅superscript𝑃subscript𝐿2\mathbf{F}\in\mathbb{R}^{P\times L_{2}}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the matrix whose j𝑗jitalic_j-th column is given by (j1)𝟏𝑗11(j-1)\mathbf{1}( italic_j - 1 ) bold_1. Given input 𝐗[0,1)P×L1𝐗superscript01𝑃subscript𝐿1\mathbf{X}\in[0,1)^{P\times L_{1}}bold_X ∈ [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐘[0,1)P×L2𝐘superscript01𝑃subscript𝐿2\mathbf{Y}\in[0,1)^{P\times L_{2}}bold_Y ∈ [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we transform them to (𝐗+𝐄)/L1𝐗𝐄subscript𝐿1(\mathbf{X}+\mathbf{E})/L_{1}( bold_X + bold_E ) / italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and (𝐘+𝐅)/L2𝐘𝐅subscript𝐿2(\mathbf{Y}+\mathbf{F})/L_{2}( bold_Y + bold_F ) / italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Note that after the transformation, it holds that (𝐗+𝐄)/L1i=1Pj=1L1[(j1)/L1,j/L1)𝐗𝐄subscript𝐿1superscriptsubscriptproduct𝑖1𝑃superscriptsubscriptproduct𝑗1subscript𝐿1𝑗1subscript𝐿1𝑗subscript𝐿1(\mathbf{X}+\mathbf{E})/L_{1}\in\prod_{i=1}^{P}\prod_{j=1}^{L_{1}}[(j-1)/L_{1}% ,j/L_{1})( bold_X + bold_E ) / italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ( italic_j - 1 ) / italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j / italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ); in other words, different columns of (𝐗+𝐄)/L1𝐗𝐄subscript𝐿1(\mathbf{X}+\mathbf{E})/L_{1}( bold_X + bold_E ) / italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT have different ranges.

We can now invoke our earlier analysis. Let δ=1/(nL1L2)𝛿1𝑛subscript𝐿1subscript𝐿2\delta=1/(nL_{1}L_{2})italic_δ = 1 / ( italic_n italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for some large enough integer n𝑛nitalic_n such that the approximation error in (5) is small enough. Then Lemma B.2 implies there exist feedforward networks gqsubscript𝑔qg_{\mathrm{q}}italic_g start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT and hqsubscriptqh_{\mathrm{q}}italic_h start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT that can quantize the input domains to 𝔾δ={0,δ,,1δ}P×L1subscript𝔾𝛿superscript0𝛿1𝛿𝑃subscript𝐿1\mathbb{G}_{\delta}=\{0,\delta,\cdots,1-\delta\}^{P\times L_{1}}blackboard_G start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = { 0 , italic_δ , ⋯ , 1 - italic_δ } start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and δ={0,δ,,1δ}P×L2subscript𝛿superscript0𝛿1𝛿𝑃subscript𝐿2\mathbb{H}_{\delta}=\{0,\delta,\cdots,1-\delta\}^{P\times L_{2}}blackboard_H start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = { 0 , italic_δ , ⋯ , 1 - italic_δ } start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Combined with the positional encodings, we only need to consider the following input domains:

𝔾δ,pesubscript𝔾𝛿pe\displaystyle\mathbb{G}_{\delta,\mathrm{pe}}blackboard_G start_POSTSUBSCRIPT italic_δ , roman_pe end_POSTSUBSCRIPT :={gq((𝐗+𝐄)/L1)|𝐗[0,1)P×L1},assignabsentconditional-setsubscript𝑔q𝐗𝐄subscript𝐿1𝐗superscript01𝑃subscript𝐿1\displaystyle:=\left\{g_{\mathrm{q}}((\mathbf{X}+\mathbf{E})/L_{1})\middle|% \mathbf{X}\in[0,1)^{P\times L_{1}}\right\},:= { italic_g start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( ( bold_X + bold_E ) / italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | bold_X ∈ [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } ,
δ,pesubscript𝛿pe\displaystyle\mathbb{H}_{\delta,\mathrm{pe}}blackboard_H start_POSTSUBSCRIPT italic_δ , roman_pe end_POSTSUBSCRIPT :={hq((𝐘+𝐅)/L2)|𝐘[0,1)P×L2}.assignabsentconditional-setsubscriptq𝐘𝐅subscript𝐿2𝐘superscript01𝑃subscript𝐿2\displaystyle:=\left\{h_{\mathrm{q}}((\mathbf{Y}+\mathbf{F})/L_{2})\middle|% \mathbf{Y}\in[0,1)^{P\times L_{2}}\right\}.:= { italic_h start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT ( ( bold_Y + bold_F ) / italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | bold_Y ∈ [ 0 , 1 ) start_POSTSUPERSCRIPT italic_P × italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } .

Note that for any 𝐗𝔾δ,pe𝐗subscript𝔾𝛿pe\mathbf{X}\in\mathbb{G}_{\delta,\mathrm{pe}}bold_X ∈ blackboard_G start_POSTSUBSCRIPT italic_δ , roman_pe end_POSTSUBSCRIPT, all of its columns are different, and for any different 𝐗,𝐗𝔾δ,pe𝐗superscript𝐗subscript𝔾𝛿pe\mathbf{X},\mathbf{X}^{\prime}\in\mathbb{G}_{\delta,\mathrm{pe}}bold_X , bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_G start_POSTSUBSCRIPT italic_δ , roman_pe end_POSTSUBSCRIPT, it holds that the columns of 𝐗𝐗\mathbf{X}bold_X are not a permutation of the columns of 𝐗superscript𝐗\mathbf{X}^{\prime}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Then we can invoke Lemma B.3, which shows the existence of an attention network gcsubscript𝑔cg_{\mathrm{c}}italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and a vector 𝐮𝐮\mathbf{u}bold_u such that for any 𝐗𝔾δ,pe𝐗subscript𝔾𝛿pe\mathbf{X}\in\mathbb{G}_{\delta,\mathrm{pe}}bold_X ∈ blackboard_G start_POSTSUBSCRIPT italic_δ , roman_pe end_POSTSUBSCRIPT, it holds that any entry of 𝐮gc(𝐗)superscript𝐮topsubscript𝑔c𝐗\mathbf{u}^{\top}g_{\mathrm{c}}(\mathbf{X})bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) uniquely determines 𝐗𝐗\mathbf{X}bold_X. Similarly, there exists hcsubscriptch_{\mathrm{c}}italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT and 𝐯𝐯\mathbf{v}bold_v which implement contextual map** for documents. Now we just need the following pooling functions: for the query side, the pooling function outputs 𝐯𝐯\mathbf{v}bold_v and gc(𝐗):,1subscript𝑔csubscript𝐗:1g_{\mathrm{c}}(\mathbf{X})_{:,1}italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT; for the document side, the pooling function outputs 𝐮𝐮\mathbf{u}bold_u and hc(𝐘):,1subscriptcsubscript𝐘:1h_{\mathrm{c}}(\mathbf{Y})_{:,1}italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT. The similarity matrix is then given by

[𝐮𝐯𝐯hc(𝐘):,1𝐮gc(𝐗):,1gc(𝐗):,1hc(𝐘):,1]matrixsuperscript𝐮top𝐯superscript𝐯topsubscriptcsubscript𝐘:1superscript𝐮topsubscript𝑔csubscript𝐗:1subscript𝑔csuperscriptsubscript𝐗:1topsubscriptcsubscript𝐘:1\begin{bmatrix}\mathbf{u}^{\top}\mathbf{v}&\mathbf{v}^{\top}h_{\mathrm{c}}(% \mathbf{Y})_{:,1}\\ \mathbf{u}^{\top}g_{\mathrm{c}}(\mathbf{X})_{:,1}&g_{\mathrm{c}}(\mathbf{X})_{% :,1}^{\top}h_{\mathrm{c}}(\mathbf{Y})_{:,1}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v end_CELL start_CELL bold_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_g start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_X ) start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( bold_Y ) start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

In particular, the off-diagonal entries of the similarity matrix are enough to determine the query-document pair. Therefore we can further use MLP scorers to approximate the ground-truth scoring function.

Appendix C Proof of Theorem 3.2

To prove Theorem 3.2, we first construct an empirical dataset on which we show a simple dot-product dual encoder has a large approximation error based on a rank argument. This empirical dataset can then be extended to a distribution on [0,1]P×Lsuperscript01𝑃𝐿[0,1]^{P\times L}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT.

Here we let L1=L2=Lsubscript𝐿1subscript𝐿2𝐿L_{1}=L_{2}=Litalic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_L, i.e., all queries and documents have the same number of tokens. The set of queries is simply 𝒬:={0,1}P×Lassign𝒬superscript01𝑃𝐿\mathcal{Q}:=\{0,1\}^{P\times L}caligraphic_Q := { 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT, i.e., there are 2PLsuperscript2𝑃𝐿2^{PL}2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT queries, each of them has dimension P×L𝑃𝐿P\times Litalic_P × italic_L, and each coordinate of them can be either 00 or 1111. The set of documents is also given by 𝒟:={0,1}P×Lassign𝒟superscript01𝑃𝐿\mathcal{D}:=\{0,1\}^{P\times L}caligraphic_D := { 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT. Given a query 𝐗𝒬𝐗𝒬\mathbf{X}\in\mathcal{Q}bold_X ∈ caligraphic_Q and a document 𝐘𝒟𝐘𝒟\mathbf{Y}\in\mathcal{D}bold_Y ∈ caligraphic_D, define the ground-truth score as

K(𝐗,𝐘):=tr(𝐗𝐘)assignsuperscript𝐾𝐗𝐘trsuperscript𝐗top𝐘K^{*}(\mathbf{X},\mathbf{Y}):=\mathrm{tr}(\mathbf{X}^{\top}\mathbf{Y})italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X , bold_Y ) := roman_tr ( bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ) (6)

Let 𝐊2PL×2PLsuperscript𝐊superscriptsuperscript2𝑃𝐿superscript2𝑃𝐿\mathbf{K}^{*}\in\mathbb{R}^{2^{PL}\times 2^{PL}}bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denote the matrix of ground-truth scores between all query-document pairs. We will show the following result.

Lemma C.1.

Let T1:P×LO:subscript𝑇1superscript𝑃𝐿superscript𝑂T_{1}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT denote an arbitrary function that maps a query 𝐗𝒬𝐗𝒬\mathbf{X}\in\mathcal{Q}bold_X ∈ caligraphic_Q to an O𝑂Oitalic_O-dimensional vector, and let T2:P×LO:subscript𝑇2superscript𝑃𝐿superscript𝑂T_{2}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT denote an arbitrary function that maps a document 𝐘𝒟𝐘𝒟\mathbf{Y}\in\mathcal{D}bold_Y ∈ caligraphic_D to an O𝑂Oitalic_O-dimensional vector. Given 𝐗𝒬𝐗𝒬\mathbf{X}\in\mathcal{Q}bold_X ∈ caligraphic_Q and 𝐘𝒟𝐘𝒟\mathbf{Y}\in\mathcal{D}bold_Y ∈ caligraphic_D, define the dot-product DE score as Kde(𝐗,𝐘)=T1(𝐗)T2(𝐘)superscript𝐾de𝐗𝐘subscript𝑇1superscript𝐗topsubscript𝑇2𝐘K^{\rm de}(\mathbf{X},\mathbf{Y})=T_{1}(\mathbf{X})^{\top}T_{2}(\mathbf{Y})italic_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT ( bold_X , bold_Y ) = italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ), and let 𝐊de2PL×2PLsuperscript𝐊desuperscriptsuperscript2𝑃𝐿superscript2𝑃𝐿\mathbf{K}^{\rm de}\in\mathbb{R}^{2^{PL}\times 2^{PL}}bold_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denote the matrix of DE scores for all query-document pairs. If OPL1𝑂𝑃𝐿1O\leq PL-1italic_O ≤ italic_P italic_L - 1, then the mean square error between 𝐊superscript𝐊\mathbf{K}^{*}bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐊desuperscript𝐊de\mathbf{K}^{\rm de}bold_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT is at least 1/161161/161 / 16:

122PL𝐊𝐊deF2116.1superscript22𝑃𝐿superscriptsubscriptnormsuperscript𝐊superscript𝐊de𝐹2116\frac{1}{2^{2PL}}\|\mathbf{K}^{*}-\mathbf{K}^{\rm de}\|_{F}^{2}\geq\frac{1}{16}.divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT 2 italic_P italic_L end_POSTSUPERSCRIPT end_ARG ∥ bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 16 end_ARG .

To prove Lemma C.1, we first show the following linear algebra fact.

Proposition C.2.

Let Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the n𝑛nitalic_n-by-n𝑛nitalic_n diagonal matrix, and let Jnsubscript𝐽𝑛J_{n}italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the n𝑛nitalic_n-by-n𝑛nitalic_n matrix whose entries are all 1111. For λ>0𝜆0\lambda>0italic_λ > 0, the matrix λIn+Jn𝜆subscript𝐼𝑛subscript𝐽𝑛\lambda I_{n}+J_{n}italic_λ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has rank n𝑛nitalic_n; its top eigenvalue is λ+n𝜆𝑛\lambda+nitalic_λ + italic_n, while the remaining n1𝑛1n-1italic_n - 1 eigenvalues are λ𝜆\lambdaitalic_λ.

Proof.

First consider the matrix Jnsubscript𝐽𝑛J_{n}italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Let 𝟏nsubscript1𝑛\mathbf{1}_{n}bold_1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the n𝑛nitalic_n-dimensional vector whose entries are all 1111; it is an eigenvector of Jnsubscript𝐽𝑛J_{n}italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with eigenvalue n𝑛nitalic_n. Moreover, Jnsubscript𝐽𝑛J_{n}italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT also has eigenvalue 00; the corresponding eigenspace is given by {𝐳n|izi=0}conditional-set𝐳superscript𝑛subscript𝑖subscript𝑧𝑖0\{\mathbf{z}\in\mathbb{R}^{n}|\sum_{i}z_{i}=0\}{ bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 }, which has dimension n1𝑛1n-1italic_n - 1. As a result, the eigenvalue 00 has multiplicity n1𝑛1n-1italic_n - 1.

Moreover, note that for any n𝑛nitalic_n-by-n𝑛nitalic_n matrix 𝐀𝐀\mathbf{A}bold_A with eigenvalue μ𝜇\muitalic_μ, the matrix λIn+𝐀𝜆subscript𝐼𝑛𝐀\lambda I_{n}+\mathbf{A}italic_λ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_A has an eigenvalue λ+μ𝜆𝜇\lambda+\muitalic_λ + italic_μ. Consequently, the matrix λIn+Jn𝜆subscript𝐼𝑛subscript𝐽𝑛\lambda I_{n}+J_{n}italic_λ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_J start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has eigenvalue λ+n𝜆𝑛\lambda+nitalic_λ + italic_n with multiplicity 1111, and eigenvalue λ𝜆\lambdaitalic_λ with multiplicity n1𝑛1n-1italic_n - 1. ∎

Next we prove the following properties of 𝐊superscript𝐊\mathbf{K}^{*}bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using Proposition C.2.

Lemma C.3.

It holds that 𝐊superscript𝐊\mathbf{K}^{*}bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has rank PL𝑃𝐿PLitalic_P italic_L; its top eigenvalue is 2PL2(PL+1)superscript2𝑃𝐿2𝑃𝐿12^{PL-2}(PL+1)2 start_POSTSUPERSCRIPT italic_P italic_L - 2 end_POSTSUPERSCRIPT ( italic_P italic_L + 1 ), while the remaining PL1𝑃𝐿1PL-1italic_P italic_L - 1 eigenvalues are 2PL2superscript2𝑃𝐿22^{PL-2}2 start_POSTSUPERSCRIPT italic_P italic_L - 2 end_POSTSUPERSCRIPT.

Proof.

Let 𝐔2PL×PL𝐔superscriptsuperscript2𝑃𝐿𝑃𝐿\mathbf{U}\in\mathbb{R}^{2^{PL}\times PL}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT × italic_P italic_L end_POSTSUPERSCRIPT denote the matrix whose rows are obtained by flattening elements of {0,1}P×Lsuperscript01𝑃𝐿\{0,1\}^{P\times L}{ 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT (i.e., the query set 𝒬𝒬\mathcal{Q}caligraphic_Q and document set 𝒟𝒟\mathcal{D}caligraphic_D). It then holds that 𝐊=𝐔𝐔superscript𝐊superscript𝐔𝐔top\mathbf{K}^{*}=\mathbf{U}\mathbf{U}^{\top}bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_UU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We will analyze the spectrum of 𝐊superscript𝐊\mathbf{K}^{*}bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by considering 𝐔𝐔superscript𝐔top𝐔\mathbf{U}^{\top}\mathbf{U}bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U, since it has the same eigenvalues as 𝐔𝐔superscript𝐔𝐔top\mathbf{U}\mathbf{U}^{\top}bold_UU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

We claim that 𝐔𝐔=2PL2(IPL+JPL)superscript𝐔top𝐔superscript2𝑃𝐿2subscript𝐼𝑃𝐿subscript𝐽𝑃𝐿\mathbf{U}^{\top}\mathbf{U}=2^{PL-2}(I_{PL}+J_{PL})bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U = 2 start_POSTSUPERSCRIPT italic_P italic_L - 2 end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT + italic_J start_POSTSUBSCRIPT italic_P italic_L end_POSTSUBSCRIPT ). First consider diagonal entries of 𝐔𝐔superscript𝐔top𝐔\mathbf{U}^{\top}\mathbf{U}bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U. For any 1iPL1𝑖𝑃𝐿1\leq i\leq PL1 ≤ italic_i ≤ italic_P italic_L, it holds that 𝐔:,isubscript𝐔:𝑖\mathbf{U}_{:,i}bold_U start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT has half entries equal to 00, and the other half entries equal to 1111. As a result, (𝐔𝐔)i,i=2PL1subscriptsuperscript𝐔top𝐔𝑖𝑖superscript2𝑃𝐿1(\mathbf{U}^{\top}\mathbf{U})_{i,i}=2^{PL-1}( bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U ) start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_P italic_L - 1 end_POSTSUPERSCRIPT. Next we consider off-diagonal entries of 𝐔𝐔superscript𝐔top𝐔\mathbf{U}^{\top}\mathbf{U}bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U. For any 1i,jPLformulae-sequence1𝑖𝑗𝑃𝐿1\leq i,j\leq PL1 ≤ italic_i , italic_j ≤ italic_P italic_L and ij𝑖𝑗i\neq jitalic_i ≠ italic_j, it holds that Uk,i=Uk,j=1subscript𝑈𝑘𝑖subscript𝑈𝑘𝑗1U_{k,i}=U_{k,j}=1italic_U start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT = 1 for 1/4141/41 / 4 of all positions k𝑘kitalic_k; therefore (𝐔𝐔)i,j=2PL2subscriptsuperscript𝐔top𝐔𝑖𝑗superscript2𝑃𝐿2(\mathbf{U}^{\top}\mathbf{U})_{i,j}=2^{PL-2}( bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_P italic_L - 2 end_POSTSUPERSCRIPT. This proves our claim.

The claim of Lemma C.3 then follows from Proposition C.2. ∎

Now we can prove Lemma C.1

Proof of Lemma C.1.

Let T1:P×LO:subscript𝑇1superscript𝑃𝐿superscript𝑂T_{1}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT denote an arbitrary map**; in particular, it could represent a Transformer with positional encodings which maps a query 𝐗𝒬𝐗𝒬\mathbf{X}\in\mathcal{Q}bold_X ∈ caligraphic_Q to an O𝑂Oitalic_O-dimensional embedding vector. Furthermore, let T1(𝒬)2PL×Osubscript𝑇1𝒬superscriptsuperscript2𝑃𝐿𝑂T_{1}(\mathcal{Q})\in\mathbb{R}^{2^{PL}\times O}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_Q ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT × italic_O end_POSTSUPERSCRIPT denote the embeddings of all queries given by T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Similarly, let T2:P×LO:subscript𝑇2superscript𝑃𝐿superscript𝑂T_{2}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT denote an arbitrary map** which represents the document encoder, and let T2(𝒟)2PL×Osubscript𝑇2𝒟superscriptsuperscript2𝑃𝐿𝑂T_{2}(\mathcal{D})\in\mathbb{R}^{2^{PL}\times O}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_D ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_P italic_L end_POSTSUPERSCRIPT × italic_O end_POSTSUPERSCRIPT denote embeddings of all documents given by T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The matrix of dot-product DE scores is then given by 𝐊de:=T1(𝒬)T2(𝒟)assignsuperscript𝐊desubscript𝑇1𝒬subscript𝑇2superscript𝒟top\mathbf{K}^{\rm de}:=T_{1}(\mathcal{Q})T_{2}(\mathcal{D})^{\top}bold_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT := italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_Q ) italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_D ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

By definition, 𝐊desuperscript𝐊de\mathbf{K}^{\rm de}bold_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT has rank at most O𝑂Oitalic_O. If OPL1𝑂𝑃𝐿1O\leq PL-1italic_O ≤ italic_P italic_L - 1, then Lemma C.3 implies that

122PL𝐊𝐊deF2122PL(2PL2)2116.1superscript22𝑃𝐿superscriptsubscriptnormsuperscript𝐊superscript𝐊de𝐹21superscript22𝑃𝐿superscriptsuperscript2𝑃𝐿22116\frac{1}{2^{2PL}}\|\mathbf{K}^{*}-\mathbf{K}^{\rm de}\|_{F}^{2}\geq\frac{1}{2^% {2PL}}(2^{PL-2})^{2}\geq\frac{1}{16}.divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT 2 italic_P italic_L end_POSTSUPERSCRIPT end_ARG ∥ bold_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_K start_POSTSUPERSCRIPT roman_de end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT 2 italic_P italic_L end_POSTSUPERSCRIPT end_ARG ( 2 start_POSTSUPERSCRIPT italic_P italic_L - 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 16 end_ARG .

Then we extend Lemma C.1 to Theorem 3.2.

Proof of Theorem 3.2.

Recall that the domain of the ground-truth score Ksuperscript𝐾K^{*}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT defined in (6) is {0,1}P×L×{0,1}P×Lsuperscript01𝑃𝐿superscript01𝑃𝐿\{0,1\}^{P\times L}\times\{0,1\}^{P\times L}{ 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT × { 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT. We first extend its domain to [0,1]P×L×[0,1]P×Lsuperscript01𝑃𝐿superscript01𝑃𝐿[0,1]^{P\times L}\times[0,1]^{P\times L}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT × [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT by quantizing the inputs: given 𝐗[0,1]P×L𝐗superscript01𝑃𝐿\mathbf{X}\in[0,1]^{P\times L}bold_X ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT, its quantized version 𝐗^{0,1}P×L^𝐗superscript01𝑃𝐿\widehat{\mathbf{X}}\in\{0,1\}^{P\times L}over^ start_ARG bold_X end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT is obtained by map** all entries less than 1/2121/21 / 2 to 00 and other entries to 1111. Similarly, given 𝐘[0,1]P×L𝐘superscript01𝑃𝐿\mathbf{Y}\in[0,1]^{P\times L}bold_Y ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT, we can define its quantized version 𝐘^{0,1}P×L^𝐘superscript01𝑃𝐿\widehat{\mathbf{Y}}\in\{0,1\}^{P\times L}over^ start_ARG bold_Y end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT. We then let K(𝐗,𝐘)=K(𝐗^,𝐘^)=tr(𝐗^𝐘^)superscript𝐾𝐗𝐘superscript𝐾^𝐗^𝐘trsuperscript^𝐗top^𝐘K^{*}(\mathbf{X},\mathbf{Y})=K^{*}(\widehat{\mathbf{X}},\widehat{\mathbf{Y}})=% \mathrm{tr}(\widehat{\mathbf{X}}^{\top}\widehat{\mathbf{Y}})italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X , bold_Y ) = italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG bold_X end_ARG , over^ start_ARG bold_Y end_ARG ) = roman_tr ( over^ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_Y end_ARG ). Note that Ksuperscript𝐾K^{*}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT defined in this way is not yet continuous; later we will replace it with a continuous ground-truth function, but we will first use Ksuperscript𝐾K^{*}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT below since it simplifies the analysis.

Let T1:P×LO:subscript𝑇1superscript𝑃𝐿superscript𝑂T_{1}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT and T2:P×LO:subscript𝑇2superscript𝑃𝐿superscript𝑂T_{2}:\mathbb{R}^{P\times L}\to\mathbb{R}^{O}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT denote arbitrary map**s. Let

𝕄:={𝐙P×L|Zi,j=0 or 1/2,1iP,1jL}.assign𝕄conditional-set𝐙superscript𝑃𝐿formulae-sequenceformulae-sequencesubscript𝑍𝑖𝑗0 or 121𝑖𝑃1𝑗𝐿\mathbb{M}:=\left\{\mathbf{Z}\in\mathbb{R}^{P\times L}\middle|Z_{i,j}=0\textrm% { or }1/2,1\leq i\leq P,1\leq j\leq L\right\}.blackboard_M := { bold_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT | italic_Z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 or 1 / 2 , 1 ≤ italic_i ≤ italic_P , 1 ≤ italic_j ≤ italic_L } .

For 𝐙𝕄𝐙𝕄\mathbf{Z}\in\mathbb{M}bold_Z ∈ blackboard_M, let 𝐙:=i=1Pj=1L[Zi,j,Zi,j+1/2]assignsubscript𝐙superscriptsubscriptproduct𝑖1𝑃superscriptsubscriptproduct𝑗1𝐿subscript𝑍𝑖𝑗subscript𝑍𝑖𝑗12\mathbb{C}_{\mathbf{Z}}:=\prod_{i=1}^{P}\prod_{j=1}^{L}[Z_{i,j},Z_{i,j}+1/2]blackboard_C start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT [ italic_Z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + 1 / 2 ].

Now we want to find a lower bound on

𝐗[0,1]P×L,𝐘[0,1]P×L(T1(𝐗)T2(𝐘)K(𝐗,𝐘))2d𝐗d𝐘subscriptformulae-sequence𝐗superscript01𝑃𝐿𝐘superscript01𝑃𝐿superscriptsubscript𝑇1superscript𝐗topsubscript𝑇2𝐘superscript𝐾𝐗𝐘2differential-d𝐗differential-d𝐘\displaystyle\ \int_{\mathbf{X}\in[0,1]^{P\times L},\mathbf{Y}\in[0,1]^{P% \times L}}\left(T_{1}(\mathbf{X})^{\top}T_{2}(\mathbf{Y})-K^{*}(\mathbf{X},% \mathbf{Y})\right)^{2}{\rm d}\mathbf{X}{\rm d}\mathbf{Y}∫ start_POSTSUBSCRIPT bold_X ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT , bold_Y ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) - italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X , bold_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d bold_X roman_d bold_Y
=\displaystyle== 𝐙,𝐙𝕄𝐗𝐙,𝐘𝐙(T1(𝐗)T2(𝐘)K(𝐗,𝐘))2d𝐗d𝐘subscript𝐙superscript𝐙𝕄subscriptformulae-sequence𝐗subscript𝐙𝐘subscriptsuperscript𝐙superscriptsubscript𝑇1superscript𝐗topsubscript𝑇2𝐘superscript𝐾𝐗𝐘2differential-d𝐗differential-d𝐘\displaystyle\ \sum_{\mathbf{Z},\mathbf{Z}^{\prime}\in\mathbb{M}}\int_{\mathbf% {X}\in\mathbb{C}_{\mathbf{Z}},\mathbf{Y}\in\mathbb{C}_{\mathbf{Z}^{\prime}}}% \left(T_{1}(\mathbf{X})^{\top}T_{2}(\mathbf{Y})-K^{*}(\mathbf{X},\mathbf{Y})% \right)^{2}{\rm d}\mathbf{X}{\rm d}\mathbf{Y}∑ start_POSTSUBSCRIPT bold_Z , bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_M end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT bold_X ∈ blackboard_C start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT , bold_Y ∈ blackboard_C start_POSTSUBSCRIPT bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) - italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X , bold_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d bold_X roman_d bold_Y
=\displaystyle== 𝐗𝟎,𝐘𝟎𝐙,𝐙𝕄(T1(𝐗+𝐙)T2(𝐘+𝐙)K(𝐗+𝐙,𝐘+𝐙))2d𝐗d𝐘,subscriptformulae-sequence𝐗subscript0𝐘subscript0subscript𝐙superscript𝐙𝕄superscriptsubscript𝑇1superscript𝐗𝐙topsubscript𝑇2𝐘superscript𝐙superscript𝐾𝐗𝐙𝐘superscript𝐙2d𝐗d𝐘\displaystyle\ \int_{\mathbf{X}\in\mathbb{C}_{\mathbf{0}},\mathbf{Y}\in\mathbb% {C}_{\mathbf{0}}}\sum_{\mathbf{Z},\mathbf{Z}^{\prime}\in\mathbb{M}}\left(T_{1}% (\mathbf{X}+\mathbf{Z})^{\top}T_{2}(\mathbf{Y}+\mathbf{Z}^{\prime})-K^{*}(% \mathbf{X}+\mathbf{Z},\mathbf{Y}+\mathbf{Z}^{\prime})\right)^{2}{\rm d}\mathbf% {X}{\rm d}\mathbf{Y},∫ start_POSTSUBSCRIPT bold_X ∈ blackboard_C start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT , bold_Y ∈ blackboard_C start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_Z , bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_M end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X + bold_Z ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y + bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X + bold_Z , bold_Y + bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d bold_X roman_d bold_Y , (7)

where we let 𝟎0\mathbf{0}bold_0 denotes the P𝑃Pitalic_P-by-L𝐿Litalic_L matrix whose entries are all 00. Note that in (7), for any 𝐗,𝐘𝟎𝐗𝐘subscript0\mathbf{X},\mathbf{Y}\in\mathbb{C}_{\mathbf{0}}bold_X , bold_Y ∈ blackboard_C start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, the error can be lower bounded by 22PL/16superscript22𝑃𝐿162^{2PL}/162 start_POSTSUPERSCRIPT 2 italic_P italic_L end_POSTSUPERSCRIPT / 16 using the proof of Lemma C.1. Therefore we have

(7)italic-(7italic-)\displaystyle\eqref{eq:de_neg_tmp}italic_( italic_) 𝐗𝟎,𝐘𝟎22PL16d𝐗d𝐘absentsubscriptformulae-sequence𝐗subscript0𝐘subscript0superscript22𝑃𝐿16differential-d𝐗differential-d𝐘\displaystyle\geq\int_{\mathbf{X}\in\mathbb{C}_{\mathbf{0}},\mathbf{Y}\in% \mathbb{C}_{\mathbf{0}}}\frac{2^{2PL}}{16}{\rm d}\mathbf{X}{\rm d}\mathbf{Y}≥ ∫ start_POSTSUBSCRIPT bold_X ∈ blackboard_C start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT , bold_Y ∈ blackboard_C start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT 2 italic_P italic_L end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG roman_d bold_X roman_d bold_Y
=22PL16𝗏𝗈𝗅(𝟎)2absentsuperscript22𝑃𝐿16𝗏𝗈𝗅superscriptsubscript02\displaystyle=\frac{2^{2PL}}{16}\cdot{\sf vol}(\mathbb{C}_{\mathbf{0}})^{2}= divide start_ARG 2 start_POSTSUPERSCRIPT 2 italic_P italic_L end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG ⋅ sansserif_vol ( blackboard_C start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=116.absent116\displaystyle=\frac{1}{16}.= divide start_ARG 1 end_ARG start_ARG 16 end_ARG .

As mentioned above, Ksuperscript𝐾K^{*}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is not continuous, and the final step of the proof is to replace it with a continuous ground-truth function. Previously, we quantize the input by transforming entries less than 1/2121/21 / 2 to 00 and other entries to 1111. Now we use the following transformation function ϕτsubscriptitalic-ϕ𝜏\phi_{\tau}italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT: ϕτ(z)=0subscriptitalic-ϕ𝜏𝑧0\phi_{\tau}(z)=0italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_z ) = 0 if z12τ𝑧12𝜏z\leq\frac{1}{2}-\tauitalic_z ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG - italic_τ, and ϕτ(z)=1subscriptitalic-ϕ𝜏𝑧1\phi_{\tau}(z)=1italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_z ) = 1 if z12+τ𝑧12𝜏z\geq\frac{1}{2}+\tauitalic_z ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_τ, and otherwise ϕτ(z)=12+12τ(z12)subscriptitalic-ϕ𝜏𝑧1212𝜏𝑧12\phi_{\tau}(z)=\frac{1}{2}+\frac{1}{2\tau}(z-\frac{1}{2})italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_z ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_τ end_ARG ( italic_z - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ). Given 𝐗,𝐘[0,1]P×L𝐗𝐘superscript01𝑃𝐿\mathbf{X},\mathbf{Y}\in[0,1]^{P\times L}bold_X , bold_Y ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_P × italic_L end_POSTSUPERSCRIPT, we apply ϕτsubscriptitalic-ϕ𝜏\phi_{\tau}italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT to every entry of 𝐗,𝐘𝐗𝐘\mathbf{X},\mathbf{Y}bold_X , bold_Y and get ϕτ(𝐗)subscriptitalic-ϕ𝜏𝐗\phi_{\tau}(\mathbf{X})italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_X ) and ϕτ(𝐘)subscriptitalic-ϕ𝜏𝐘\phi_{\tau}(\mathbf{Y})italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_Y ), and define Kτ(𝐗,𝐘)superscriptsubscript𝐾𝜏𝐗𝐘K_{\tau}^{*}(\mathbf{X},\mathbf{Y})italic_K start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X , bold_Y )

Kτ(𝐗,𝐘):=tr(ϕτ(𝐗)ϕτ(𝐘)).assignsuperscriptsubscript𝐾𝜏𝐗𝐘trsubscriptitalic-ϕ𝜏superscript𝐗topsubscriptitalic-ϕ𝜏𝐘K_{\tau}^{*}(\mathbf{X},\mathbf{Y}):=\mathrm{tr}(\phi_{\tau}(\mathbf{X})^{\top% }\phi_{\tau}(\mathbf{Y})).italic_K start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_X , bold_Y ) := roman_tr ( italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_X ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_Y ) ) .

Note that Kτsubscriptsuperscript𝐾𝜏K^{*}_{\tau}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is continuous for any τ𝜏\tauitalic_τ, and as τ𝜏\tauitalic_τ goes to 00, it holds that Kτsubscriptsuperscript𝐾𝜏K^{*}_{\tau}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT becomes arbitrarily close to Ksuperscript𝐾K^{*}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance. Therefore there exists a small enough τ𝜏\tauitalic_τ such that Kτsubscriptsuperscript𝐾𝜏K^{*}_{\tau}italic_K start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT satisfies the requirements of Theorem 3.2. ∎