License: CC BY 4.0
arXiv:2212.09744v3 [cs.CL] 08 Dec 2023

DSI++: Updating Transformer Memory with New Documents

Sanket Vaibhav Mehta1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT   Jai Gupta22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT   Yi Tay33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT   Mostafa Dehghani44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT   Vinh Q. Tran22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
**feng Rao55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT22footnotemark: 2  Marc Najork44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT  Emma Strubell11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Donald Metzler22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

1111Carnegie Mellon University
2222Google Research  3333Google Brain  4444Google DeepMind  5555Google

[email protected]
  Work done during an internship at Google Research  Work completed while at Google
Abstract

Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12%percent12+12\%+ 12 %). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by +21.1%percent21.1+21.1\%+ 21.1 % over competitive baselines for NQ and requires 6666 times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

1 Introduction

Differentiable Search Indices (DSIs; Tay et al. (2022)) represent a new modeling paradigm for information retrieval tasks using sequence-to-sequence learning. Specifically, DSIs leverage Transformer memory (Vaswani et al., 2017) to encode all of the information in a corpus of documents and then use that memory to answer user queries directly, thereby simplifying the retrieval process. DSIs achieve this functionality by jointly optimizing for indexing (or memorization) and retrieval tasks. The indexing task requires learning a map** from document content to its identifier, typically represented by integers or short strings (document identifiers, abbreviated docids). Then, the retrieval task necessitates map** user queries to relevant docids. Besides its simplicity and end-to-end differentiable nature, DSI significantly outperforms state-of-the-art “retrieve-and-rank" methods based on dual-encoders (Ni et al., 2022).

Despite the remarkable performance of DSI models, there remain open questions about their applicability in the practical setting of dynamic corpora. Consider the realistic scenario wherein new documents are continually added to the indexed corpus. Updating the index in dual-encoder-based methods requires computing embeddings for new documents, followed by re-indexing all document embeddings (Karpukhin et al., 2020). In contrast, index construction using a DSI involves training a Transformer model. Therefore, the model must be re-trained from scratch every time the underlying corpus is updated, thus incurring prohibitively high computational costs compared to dual-encoders. In this work, we aim to address this issue by devising methods for effective incremental indexing using Transformer memory without re-training the DSI model from scratch.

Lifelong (or continual) learning (Thrun, 1995; Parisi et al., 2019) is a biologically-inspired machine learning paradigm that deals with continuous learning of new tasks by preserving past knowledge and using it to learn new concepts efficiently. Based on this paradigm, we propose DSI++ (DSI + new documents), a continual learning challenge for DSI to incrementally index new documents while maintaining the ability to answer user queries related to both previously and newly indexed documents. To enable DSI++, we introduce novel benchmarks constructed from existing Natural Questions (Kwiatkowski et al., 2019) and MS MARCO (Nguyen et al., 2016) datasets, simulating the continual addition of documents to the system. To our knowledge, there is no prior work studying incremental learning for DSI.

Refer to caption
Figure 1: Indexing accuracy of D0,D1,subscript𝐷0subscript𝐷1D_{0},D_{1},italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT document corpora visualized as we continuously index new documents (averaged over 3333 runs). We observe that continual indexing of new documents leads to severe forgetting of the previously memorized documents.

A naive solution for DSI++ is to continuously fine-tune the model with an indexing objective over new documents. However, Figure 1 shows that continual indexing of new documents leads to catastrophic forgetting of the previously memorized documents (more details in §2.1), a common phenomenon in neural networks wherein learning of the new concepts interferes with the previously acquired knowledge (McCloskey and Cohen, 1989). Furthermore, when we investigate the learning dynamics of the DSI model during memorization (Figure 3, we observe a significant number of documents (approx. 88%percent8888\%88 %) experience forgetting events after they have been memorized. Concretely, a forgetting event (Toneva et al., 2019) is when a prediction for an individual document goes from correct docid to incorrect one throughout learning. Therefore, implicit forgetting during memorization and explicit forgetting from continual indexing of new documents are two key challenges to overcome for successfully implementing a DSI++ system.

To reduce forgetting during memorization, we propose explicitly optimizing for flatter loss basins using Sharpness-Aware Minimization (SAM; Foret et al. (2021)). Recent works have shown that geometrical properties of the minima play a vital role in forgetting, especially models in flatter loss basins tend to undergo less forgetting while lifelong learning from task sequences (Mehta et al., 2023). Next, we introduce a generative memory to sample pseudo-queries for already indexed documents and use them to alleviate forgetting of the retrieval task during incremental indexing of the new documents. Also, the generative memory enables continual semi-supervised learning of the retrieval task by generating pseudo-queries for an incoming batch of new documents. Our main contributions can be summarized as follows:

  • We introduce DSI++, a continual learning challenge for the recently proposed Differentiable Search Indices (DSI) paradigm. To enable DSI++ evaluations, we create two benchmarks based on existing Natural Questions and MS MARCO datasets. To understand the severity of the forgetting phenomenon across multiple scenarios, we analyze a suite of pre-trained models (T5-Base, T5-Large, T5-XL) and different document identifier representations (unstructured atomic, naively structured, and semantically structured).

  • We hypothesize and verify that the DSI model experiences forgetting events throughout memorization. To alleviate these, we propose modifying training dynamics to promote flatter minima using SAM and show that the model stably memorizes +12%percent12+12\%+ 12 % documents.

  • We propose a generative memory-based experience rehearsal approach to alleviate explicit forgetting during continual indexing and improve the average Hits@1 by +25.0%percent25.0+25.0\%+ 25.0 % and Hits@10 by +21.1%percent21.1+21.1\%+ 21.1 % over competitive baselines for MS MARCO and NQ, respectively.

2 DSI++: Continual learning challenge for DSI

2.1 Problem setup

We focus on a setup where we receive an initial corpus of documents, D0={d1,,dn}subscript𝐷0subscript𝑑1subscript𝑑𝑛D_{0}=\{d_{1},\cdots,d_{n}\}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, and user queries corresponding to a subset of them, R0={<qj,j>,j𝒴D},R_{0}=\{<q_{j},j>,\forall j\in\mathcal{Y}_{D}\},italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { < italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j > , ∀ italic_j ∈ caligraphic_Y start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } , where DD0𝐷subscript𝐷0D\subset D_{0}italic_D ⊂ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. DSI paradigm involves two tasks: (i) memorization task where the goal is to learn an indexer fθ:𝒳𝒴:subscript𝑓𝜃𝒳𝒴f_{\theta}:\mathcal{X}\rightarrow\mathcal{Y}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y, a text-to-text model parameterized by θP𝜃superscript𝑃\theta\in\mathbb{R}^{P}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, that takes document tokens (x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X) as input and maps it to a document identifier (docid) j𝒴𝑗𝒴j\in\mathcal{Y}italic_j ∈ caligraphic_Y, and (ii) retrieval task where the goal is to use the same indexer fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to directly map a user query q𝑞qitalic_q to a relevant docid j𝒴𝑗𝒴j\in\mathcal{Y}italic_j ∈ caligraphic_Y. Two different prompts are used to differentiate between these tasks. Tay et al. (2022) discusses several variants for representing docids – unstructured atomic and structured string docids, where each document is assigned a unique token and tokenized string, respectively. Under the unified text-to-text format, both of the above tasks are cast as generation tasks, i.e., decoding one unique token (unstructured atomic) or decoding a tokenized string sequentially, one token at a time (naively/ semantically structured).

In the dynamic corpus scenario, we simulate the arrival of new documents by updating the initial corpus D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a sequence of batches D1Dtsubscript𝐷1subscript𝐷𝑡D_{1}\rightarrow\cdots\rightarrow D_{t}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ⋯ → italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In DSI++, we have access to the new batch of documents Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but we do not have any queries related to these documents.

Goal:

Learn a DSI++ system that incrementally indexes D1,D2,subscript𝐷1subscript𝐷2D_{1},D_{2},\cdotsitalic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ in fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT while being able to answer queries related to previously as well as additionally indexed documents.

2.2 Benchmarks for DSI++

To enable research on DSI++, we introduce two benchmarks constructed from the Natural Questions (NQ; Kwiatkowski et al. (2019)) and MS MARCO (Nguyen et al., 2016) datasets. The NQ dataset consists of Wikipedia articles and corresponding natural language questions. Similar to Tay et al. (2022), we consider Wikipedia articles for memorization and the retrieval task as identifying the Wikipedia article that answers the given question. We use the original NQ train split to construct train(80%percent8080\%80 %)/ validation(20%percent2020\%20 %) splits and use NQ validation as a test split. We randomly sample 50K50𝐾50K50 italic_K unique articles to constitute the initial D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus. Next, we construct five corpora (D1,,D5subscript𝐷1subscript𝐷5D_{1},\cdots,D_{5}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT), each containing 10K10𝐾10K10 italic_K unique articles, to add them to the DSI model sequentially. Corresponding to articles in each of these corpora, we filter queries from original NQ train/ validation splits to construct Ritrain,Rival,Ritestsuperscriptsubscript𝑅𝑖𝑡𝑟𝑎𝑖𝑛superscriptsubscript𝑅𝑖𝑣𝑎𝑙superscriptsubscript𝑅𝑖𝑡𝑒𝑠𝑡R_{i}^{train},R_{i}^{val},R_{i}^{test}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_a italic_l end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT (i{0,,5}for-all𝑖05\forall i\in\{0,\cdots,5\}∀ italic_i ∈ { 0 , ⋯ , 5 }) splits. We use R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to train the DSI model for the retrieval task and use Ritestsuperscriptsubscript𝑅𝑖𝑡𝑒𝑠𝑡R_{i}^{test}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT to evaluate previously and newly indexed articles. The full MS MARCO dataset has approx. 500K500𝐾500K500 italic_K passage-query training pairs and 6,98069806,9806 , 980 validation pairs. Like the benchmark created from the MS MARCO dataset (Pradeep et al., 2023), we randomly sample 50K50𝐾50K50 italic_K unique passages to constitute the initial D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus and five more corpora, each with 10K10𝐾10K10 italic_K passages. See Table 2 (in the Appendix) for exact dataset statistics for NQ and MS MARCO.

2.3 Evaluation Metrics

For DSI evaluation, we report indexing accuracy for memorization task and Hits@k (k{1,10}𝑘110k\in\{1,10\}italic_k ∈ { 1 , 10 }) metric for retrieval task. Indexing accuracy and Hits@k are the proportion of correctly memorized documents and correct documents ranked in the top k predictions, respectively. We formally define metrics to summarize the model performance as we incrementally index new documents. Let Pn,osubscript𝑃𝑛𝑜P_{n,o}italic_P start_POSTSUBSCRIPT italic_n , italic_o end_POSTSUBSCRIPT denote the performance (e.g., indexing accuracy) on corpus Dosubscript𝐷𝑜D_{o}italic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT after training on corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Following prior work (Mehta et al., 2023), we compute the average performance (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT), forgetting (Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) and learning performance (LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) metrics after indexing the corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

The term Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (aka backward transfer) refers to the effect of indexing the corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT on the performance of all previously indexed documents Dosubscript𝐷𝑜D_{o}italic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, where 0o<n0𝑜𝑛0\leq o<n0 ≤ italic_o < italic_n. LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (or forward transfer) measures the model’s ability to learn when presented with a new corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and is defined as the average performance over the new corpora D1,,Dnsubscript𝐷1subscript𝐷𝑛D_{1},\cdots,D_{n}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. When the Dnthsuperscriptsubscript𝐷𝑛thD_{n}^{\text{th}}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT corpus is incrementally indexed, Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are defined as follows:

Ansubscript𝐴𝑛\displaystyle A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =1n+1o=0nPn,o;LAn=1no=1nPo,o;formulae-sequenceabsent1𝑛1superscriptsubscript𝑜0𝑛subscript𝑃𝑛𝑜𝐿subscript𝐴𝑛1𝑛superscriptsubscript𝑜1𝑛subscript𝑃𝑜𝑜\displaystyle=\frac{1}{n+1}\sum_{o=0}^{n}P_{n,o};LA_{n}=\frac{1}{n}\sum_{o=1}^% {n}P_{o,o};= divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_o = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_n , italic_o end_POSTSUBSCRIPT ; italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_o , italic_o end_POSTSUBSCRIPT ;
Fnsubscript𝐹𝑛\displaystyle F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =1no=0n1maxo{0,,n1}(Po,oPn,o);absent1𝑛superscriptsubscript𝑜0𝑛1subscriptsuperscript𝑜0𝑛1subscript𝑃superscript𝑜𝑜subscript𝑃𝑛𝑜\displaystyle=\frac{1}{n}\sum_{o=0}^{n-1}\max_{o^{\prime}\in\{0,\cdots,n-1\}}(% P_{o^{\prime},o}-P_{n,o});= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_o = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 0 , ⋯ , italic_n - 1 } end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_o end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_n , italic_o end_POSTSUBSCRIPT ) ; (1)
Refer to caption
Figure 2: Systematic study about forgetting and forward transfer when incrementally indexing new corpus of documents across different model sizes (T5-Base, T5-Large, T5-XL) and docid representations. We use atomic docids by default and denote (N)/(S) for naively/ semantically structured docids. \uparrow indicates higher is better, \downarrow indicates lower is better. All results are averaged over 3333 runs. We observe that the average Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and learning LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT performance improves by increasing the model scale. However, forgetting Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is severe across all model scales. Next, we observe that naively structured docids, T5-Base(N), underperform unstructured atomic docids, T5-Base, across all metrics - indexing accuracy, Hits@1, (see Figure 6 in Appendix for Hits@10 results). Imbuing the docid space with a semantic (S) structure alleviates the forgetting compared to an arbitrary/ naive (N) structure.

2.4 Case study: Forgetting and Forward Transfer

After introducing the DSI++ problem setup, benchmark, and evaluation metrics, we study the behavior of the DSI model as new documents are continuously added to the system. Concretely, we are interested in investigating the following for continual training of the DSI model with indexing objective on new documents – (Q1) How severe is the forgetting for the initially indexed documents? (Q2) How does continual updating of the DSI model over a sequence of corpora affect the forgetting? (Q3) How does the updated DSI model perform on newly indexed documents, especially the retrieval task? (Q4) How do different docid representation strategies affect forgetting? (Q5) How does the DSI model scale affect forgetting? Figure 2 visualizes results on the validation split of DSI++ and helps us convincingly answer these questions.

Forgetting.

From Figure 2, we see that the T5-Base model with atomic docid representation (blue line plots) undergoes significant forgetting. This trend holds across all DSI evaluation metrics - indexing accuracy, Hits@1, and Hits@10 (see 6 in Appendix). For the originally indexed D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus, indexing accuracy and Hits@1 drop by approx. 25252525 and 20202020 points, respectively. Further, as we continue indexing the sequence of corpora, we see that forgetting becomes even more severe. For example, after continually indexing the D5subscript𝐷5D_{5}italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT corpus, F5subscript𝐹5F_{5}italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (forgetting) for indexing accuracy increases to 75757575. These results provide evidence to answer (Q1) & (Q2) that the DSI model undergoes severe forgetting under continual indexing of new documents.

Forward transfer.

To answer (Q3), we visualize the learning performance (LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) for all DSI metrics for sequential indexing. From Figure 2, we see LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT increases in indexing accuracy, suggesting that the DSI model is plastic enough to index new documents. However, from Figure 2, we see a declining trend for Hits@1. Due to the continuous indexing updates, the underlying DSI model drifts and becomes less effective for the retrieval task. These findings hint at an approach that replays indexing and retrieval tasks during continual learning (hence our proposed method in §4).

Docid representations.

For studying (Q4), we consider unstructured atomic, naively(N) structured, and semantically(S) structured docid representations. From Figure 2, we see that T5-Base(N) underperforms T5-Base by a significant margin. For example, the average performance A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the Hits@1 metric is approx. 30303030 and 39393939 for naive and atomic docids, respectively. Further, as the naively structured approach treats unstructured docids as tokenizable strings as opposed to dedicated unique tokens in the case of atomic docids, they are relatively more prone to interference from new docids (see Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT subplot for indexing accuracy). Imbuing semantic structure to the naive docid space helps to reduce forgetting however still underperforms unstructured docids.

Model scale.

As atomic docids are superior to naive docids, we only consider atomic docids for answering (Q5). From Figure 2, we observe that larger models outperform their smaller counterparts in terms of the average performance Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the learning performance LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (T5-XL >>> T5-Large >>> T5-Base). However, empirically we report that forgetting Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is severe across all model scales, without any clear best performer, and therefore, we focus on T5-Base for the rest of our experiments.

3 Implicit Forgetting: SAM

Memorization (or indexing) is a primary task in the DSI paradigm where the goal is to learn a neural corpus indexer that takes document content as input and maps it to a document identifier (docid). Under the unstructured atomic docid representation strategy, each docid is assigned a unique token/class label. Now given a large number of documents in the corpus (even more than a million), memorization constitutes an instance of challenging extreme classification setting (Bengio et al., 2019). Furthermore, for every class, we have only one labeled example (i.e., document and its identifier), making this task setup rare. Motivated by this largely unexplored setup, we investigate the learning dynamics for the memorization task throughout training.

Forgetting events.

In Figure 5, we visualize the indexing accuracy for the T5-Base model, optimized with Adafactor (Shazeer and Stern, 2018). We note that the model performance fluctuates throughout training, suggesting unstable memorization. We hypothesize that the model continuously undergoes the forgetting phenomenon wherein subsequent mini-batch updates interfere with the previously memorized documents. To differentiate this phenomenon from forgetting due to adding new documents, we refer to the earlier one as implicit forgetting and the latter as explicit forgetting. To quantify instability during memorization, we compute forgetting event (Toneva et al., 2019) statistics. Forgetting event is defined when an individual document goes from being classified correctly (mapped to correct docid) to incorrectly throughout memorization. In Figure 3, we plot the cumulative histogram of forgetting events where almost 88%percent8888\%88 % of the documents undergo forgetting at least once, validating our hypothesis about implicit forgetting.

Refer to caption
Figure 3: Investigating the effectiveness of SAM for alleviating implicit forgetting in the T5-Base model by visualizing cumulative histogram of forgetting events. A forgetting event (Toneva et al., 2019) is defined when an individual document goes from being classified correctly to incorrectly over the course of memorization. SAM increases the percentage of examples experiencing zero forgetting events by absolute 12%percent1212\%12 % over Adafactor.

Flatness and forgetting.

Mirzadeh et al. (2020) shows that during sequential learning of tasks, flatter minima leads to less forgetting. Further, Mehta et al. (2023) shows that pre-trained initialization implicitly alleviates forgetting as they prefer flatter minima and explicitly optimizing for the flatness using Sharpness-Aware Minimization (SAM; Foret et al. (2021)) further lessens forgetting. Based on these observations, we hypothesize that modifying the training dynamics of the memorization tasks using SAM should alleviate implicit forgetting.

Sharpness-Aware Minimization.

For the loss function f𝑓fitalic_f, SAM seeks to find the parameters w𝑤witalic_w that lie in the neighborhood with uniformly low loss regions by optimizing the following minimax objective: minwmaxϵ2ρf(w+ϵ)subscript𝑤subscriptsubscriptnormitalic-ϵ2𝜌𝑓𝑤italic-ϵ\min_{w}\max_{||\epsilon||_{2}\leq\rho}f(w+\epsilon)roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT | | italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT italic_f ( italic_w + italic_ϵ ), where the maximization region is defined to be a psuperscript𝑝\ell^{p}roman_ℓ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ball with radius ρ𝜌\rhoitalic_ρ for p=2𝑝2p=2italic_p = 2. Foret et al. (2021) estimates the gradient of the inner maximization by employing first-order approximation as follows: wmaxϵ2ρf(w+ϵ)wf(w)|w+ϵ^(𝐰)\nabla_{w}\max_{||\epsilon||_{2}\leq\rho}f(w+\epsilon)\approx\nabla_{w}f(w)% \big{\rvert}_{w+\mathbf{\hat{\epsilon}(w)}}∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT | | italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT italic_f ( italic_w + italic_ϵ ) ≈ ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_f ( italic_w ) | start_POSTSUBSCRIPT italic_w + over^ start_ARG italic_ϵ end_ARG ( bold_w ) end_POSTSUBSCRIPT, where ϵ^(𝐰)=ρwf(w)/wf(w)2^italic-ϵ𝐰𝜌subscript𝑤𝑓𝑤subscriptnormsubscript𝑤𝑓𝑤2\mathbf{\hat{\epsilon}(w)}=\rho\nabla_{w}f(w)/||\nabla_{w}f(w)||_{2}over^ start_ARG italic_ϵ end_ARG ( bold_w ) = italic_ρ ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_f ( italic_w ) / | | ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_f ( italic_w ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For a given mini-batch B𝐵Bitalic_B, SAM approximately computes a point w=w+ϵ^(w)superscript𝑤𝑤^italic-ϵ𝑤w^{\prime}=w+\hat{\epsilon}(w)italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_w + over^ start_ARG italic_ϵ end_ARG ( italic_w ) where loss is maximum and then updates the current model weights w𝑤witalic_w using the gradient at wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We defer readers to (Foret et al., 2021) for complete details about this derivation.

SAM alleviates implicit forgetting.

We investigate the applicability of SAM for alleviating the implicit forgetting phenomenon. We use a pre-trained T5-Base model to memorize D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus containing 50505050K unique documents. We compare the performance of the SAM with the Adafactor optimizer. In Figure 5, we see that SAM outperforms Adafactor in terms of the overall indexing accuracy. We also note that SAM undergoes less severe fluctuations during training, thus, hinting at less forgetting. To bolster this claim, in Figure 3, we see that SAM has a significantly higher percentage of documents corresponding to a lower cumulative number of forgetting events, i.e., SAM stably (with zero forgetting events) memorizes +12%percent12+12\%+ 12 % more documents than Adafactor. We also note that SAM (35.9±2.2plus-or-minus35.92.235.9\pm 2.235.9 ± 2.2) outperforms Adafactor (32.5±6.4plus-or-minus32.56.432.5\pm 6.432.5 ± 6.4) when evaluated on the retrieval task (Hits@1) corresponding to D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, we set SAM to be our default optimizer for the rest of the experiments.

Discussion.

Mehta et al. (2023) show that explicitly optimizing for flatness using SAM leads to less forgetting, especially in task-incremental learning settings where data undergoes a clear distributional shift. We extend this work to the new DSI paradigm and convincingly demonstrate that SAM helps with the stable memorization of documents. Our results generalize the earlier findings even to the settings where data does not undergo a clear distributional shift (i.e., memorization task). Although SAM helps stably memorize documents, there is still room for improvement, and our work invites more future work in this direction.

4 Explicit Forgetting: Generative Memory

The DSI paradigm consists of two tasks – memorization and retrieval. The previous section showcases that SAM alleviates implicit forgetting by stably memorizing documents. In this section, we focus on the forgetting phenomenon that arises from the continual indexing of new documents, specifically in the context of the retrieval task. Through our systematic study (in §2.4), we show that irrespective of the model scale and docid representations, DSI models undergo severe forgetting. Moreover, we observe that the learning performance LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT keeps declining for the retrieval task (see Figures 2 and 6 for Hits@1 and Hits@10, respectively). This observation suggests that as we continuously update the DSI model with the indexing objective, the model forgets the retrieval task. In DSI, both memorization and retrieval tasks return docid for input. By setup, we can assume access to previous documents and continue indexing old and new documents to reduce forgetting of the retrieval task. However, in Figure 4, we see that the model still undergoes forgetting (more in §5.2).

Episodic memory.

According to the Complementary Learning Systems (McClelland et al., 1995) theory, humans use episodic memory to store and revisit past experiences for retaining learned knowledge. Based on this motivation, memory-based approaches (Sodhani et al., 2022), like Experience Replay (ER; Chaudhry et al. (2019)) for continual learning use a subset of previous task data to regularize the future task learning while minimizing forgetting. Based upon this, one approach for DSI++ is to retain ground-truth queries for the retrieval task in episodic memory and use them to co-train with incremental indexing tasks. However, in DSI++, we cannot access ground-truth queries for an incoming batch of new documents. Even if one retains queries for the initial D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus, we show in Table 1 that such a method suffers from forward transfer to newly indexed documents.

Generative memory.

Recent years have seen significant progress in the capabilities of the generative language models (Raffel et al., 2020; Brown et al., 2020). Motivated by the success of these models and the in-applicability of the episodic memory for DSI++, we pose a question – instead of retaining the ground-truth queries, can we learn a parametric model to generate such queries given a document? Concretely, we propose to train a query generator model to sample queries for previously seen documents and supplement them during incremental indexing. Since we use the generator model to sample queries for sparse experience replay, our proposed method – generative memory. Moreover, generative memory is also used to generate pseudo-queries for the incoming batch of new documents, thus, enabling continual semi-supervised learning of the retrieval task.

5 Experimentation

Added Method Eval corpus = D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Eval corpus = D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
corpus (Catastrophic forgetting) (Forward transfer)
Index acc. Hits@1 Hits@10 Index acc. Hits@1 Hits@10
D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 81.81.2subscript81.81.281.8_{1.2}81.8 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 35.92.2subscript35.92.235.9_{2.2}35.9 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 66.90.9subscript66.90.966.9_{0.9}66.9 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT - - -
D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 52.43.5subscript52.43.552.4_{3.5}52.4 start_POSTSUBSCRIPT 3.5 end_POSTSUBSCRIPT 19.23.9subscript19.23.919.2_{3.9}19.2 start_POSTSUBSCRIPT 3.9 end_POSTSUBSCRIPT 43.65.7subscript43.65.743.6_{5.7}43.6 start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT 96.50.0subscript96.50.096.5_{0.0}96.5 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 31.76.4subscript31.76.431.7_{6.4}31.7 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 55.64.9subscript55.64.955.6_{4.9}55.6 start_POSTSUBSCRIPT 4.9 end_POSTSUBSCRIPT
cl(U1=D0D1subscript𝑈1subscript𝐷0subscript𝐷1U_{1}=D_{0}\cup D_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 78.20.5subscript78.20.578.2_{0.5}78.2 start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT 28.98.9subscript28.98.928.9_{8.9}28.9 start_POSTSUBSCRIPT 8.9 end_POSTSUBSCRIPT 59.07.9subscript59.07.959.0_{7.9}59.0 start_POSTSUBSCRIPT 7.9 end_POSTSUBSCRIPT 91.80.4subscript91.80.491.8_{0.4}91.8 start_POSTSUBSCRIPT 0.4 end_POSTSUBSCRIPT 34.02.4subscript34.02.434.0_{2.4}34.0 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 60.21.9subscript60.21.960.2_{1.9}60.2 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+epsmem(D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) 77.80.5subscript77.80.577.8_{0.5}77.8 start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT 22.91.5subscript22.91.522.9_{1.5}22.9 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 51.40.5subscript51.40.551.4_{0.5}51.4 start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT 93.10.0subscript93.10.093.1_{0.0}93.1 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 13.12.1subscript13.12.113.1_{2.1}13.1 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 39.63.1subscript39.63.139.6_{3.1}39.6 start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) 77.80.3subscript77.80.377.8_{0.3}77.8 start_POSTSUBSCRIPT 0.3 end_POSTSUBSCRIPT 26.06.9subscript26.06.926.0_{6.9}26.0 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 54.98.3subscript54.98.354.9_{8.3}54.9 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 93.00.5subscript93.00.593.0_{0.5}93.0 start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT 8.64.8subscript8.64.88.6_{4.8}8.6 start_POSTSUBSCRIPT 4.8 end_POSTSUBSCRIPT 31.611.8subscript31.611.831.6_{11.8}31.6 start_POSTSUBSCRIPT 11.8 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+epsmem(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 53.23.1subscript53.23.153.2_{3.1}53.2 start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT 7.72.1subscript7.72.17.7_{2.1}7.7 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 26.02.0subscript26.02.026.0_{2.0}26.0 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 96.50.0subscript96.50.096.5_{0.0}96.5 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 48.32.3subscript48.32.348.3_{2.3}48.3 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 70.71.9subscript70.71.970.7_{1.9}70.7 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 50.10.8subscript50.10.850.1_{0.8}50.1 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 7.01.2subscript7.01.27.0_{1.2}7.0 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 23.12.2subscript23.12.223.1_{2.2}23.1 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 96.50.0subscript96.50.096.5_{0.0}96.5 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 57.71.5subscript57.71.557.7_{1.5}57.7 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 76.70.9subscript76.70.976.7_{0.9}76.7 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 78.20.3subscript78.20.378.2_{0.3}78.2 start_POSTSUBSCRIPT 0.3 end_POSTSUBSCRIPT 18.42.8subscript18.42.818.4_{2.8}18.4 start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 47.53.9subscript47.53.947.5_{3.9}47.5 start_POSTSUBSCRIPT 3.9 end_POSTSUBSCRIPT 92.10.3subscript92.10.392.1_{0.3}92.1 start_POSTSUBSCRIPT 0.3 end_POSTSUBSCRIPT 48.56.1subscript48.56.148.5_{6.1}48.5 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 73.82.9subscript73.82.973.8_{2.9}73.8 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, docid parameters only) 78.90.1subscript78.90.178.9_{0.1}78.9 start_POSTSUBSCRIPT 0.1 end_POSTSUBSCRIPT 32.75.1subscript32.75.132.7_{5.1}32.7 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 64.84.2subscript64.84.264.8_{4.2}64.8 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 94.60.1subscript94.60.194.6_{0.1}94.6 start_POSTSUBSCRIPT 0.1 end_POSTSUBSCRIPT 10.83.8subscript10.83.810.8_{3.8}10.8 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 35.07.3subscript35.07.335.0_{7.3}35.0 start_POSTSUBSCRIPT 7.3 end_POSTSUBSCRIPT
train from scratch 78.70.6subscript78.70.678.7_{0.6}78.7 start_POSTSUBSCRIPT 0.6 end_POSTSUBSCRIPT 35.91.4subscript35.91.435.9_{1.4}35.9 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 66.40.0subscript66.40.066.4_{0.0}66.4 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 79.20.3subscript79.20.379.2_{0.3}79.2 start_POSTSUBSCRIPT 0.3 end_POSTSUBSCRIPT 32.91.8subscript32.91.832.9_{1.8}32.9 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 63.91.2subscript63.91.263.9_{1.2}63.9 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT
Table 1: Comparing performance on incremental indexing of D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpus across different methods - cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT): continue fine-tuning with indexing task on D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT): continue fine-tuning on the updated corpus U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+epsmem(D𝐷Ditalic_D): continual indexing of U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT along with ER of queries for D𝐷Ditalic_D, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D𝐷Ditalic_D): continual indexing of U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT along with ER of pseudo-queries for D𝐷Ditalic_D. We observe that continual indexing on the updated corpus cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) reduces forgetting compared to just indexing new corpus cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) in the Natural Questions (NQ) dataset (|D0|=50Ksubscript𝐷050𝐾|D_{0}|=50K| italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | = 50 italic_K, |D1|=10Ksubscript𝐷110𝐾|D_{1}|=10K| italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | = 10 italic_K). Next, ER with either D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT hurts forward transfer or forgetting. Our proposed approach of augmenting pseudo-queries for all documents along with continual indexing, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), alleviates forgetting of D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus and improves forward transfer to D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpus.

In this section, the models are initialized with the pre-trained T5-Base model, while the additional parameters for atomic docid tokens are randomly initialized. See §A.1 for implementation details.

5.1 Methods

We compare our proposed generative memory-based approach with the following methods:

Continual indexing, cl(𝐃𝐧subscript𝐃𝐧\mathbf{D_{n}}bold_D start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT).

The DSI model is sequentially fine-tuned with the indexing objective on the incoming corpus of documents Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Continual indexing with all seen documents, cl(𝐔𝐧subscript𝐔𝐧\mathbf{U_{n}}bold_U start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT).

The DSI model is continuously fine-tuned with the indexing objective on the updated corpora Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (i=0nDisuperscriptsubscript𝑖0𝑛subscript𝐷𝑖\bigcup_{i=0}^{n}D_{i}⋃ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) with the same replay frequency for the old (i=0n1Disuperscriptsubscript𝑖0𝑛1subscript𝐷𝑖\bigcup_{i=0}^{n-1}D_{i}⋃ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and new (Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) corpora in the tasks mixture.

Continual experience replay using generative memory, genmem(𝐃𝐧subscript𝐃𝐧\mathbf{D_{n}}bold_D start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT).

In this method, the proposed generative memory model is used to sample pseudo-queries corresponding to the corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Next, these pseudo-queries are used for (sparse) experience replay of the retrieval task samples.

Continual experience replay using episodic memory, epsmem(𝐃𝐧subscript𝐃𝐧\mathbf{D_{n}}bold_D start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT).

In this method, ground-truth queries corresponding to the Dnthsuperscriptsubscript𝐷𝑛𝑡D_{n}^{th}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT corpus are used for experience replay of the retrieval task.

cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, docid parameters only).

In this method, we only update the parameters corresponding to atomic docid tokens using the updated Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT corpus. This method in spirit is a dual-encoder-baseline.

Train from scratch, (no cl).

The DSI model is trained from scratch every time a new corpus is added. This method corresponds to a non-continual learning setup and is computationally expensive.

5.2 Results

In this section, we revisit some of the questions (Q1)-(Q3) raised in our case study (see §2.4) to investigate the effectiveness of our proposed generative memory-based approach. To answer these questions, in Table 1, we report the performance of the DSI model on D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (to study the forgetting phenomenon) and D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpora (to answer forward transfer question) after continual indexing on D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for both NQ and MS MARCO datasets. In Figures 4 and 7 (NQ) and Figure 8 (MS MARCO), we report overall performance across DSI metrics as we continuously update the model with the sequence of five corpora (D1D5subscript𝐷1subscript𝐷5D_{1}\rightarrow\cdots\rightarrow D_{5}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ⋯ → italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT).

Does generative memory alleviate forgetting of old documents?

In Table 1, for the NQ dataset, we report Hits@1 to be 35.935.935.935.9 for the model after training on D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We see that continually indexing both D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpora (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 28.928.928.928.9), significantly reduce forgetting the retrieval task (Hits@1) over just indexing the new corpora D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 19.219.219.219.2). Next, we look at the performance of the ER approaches when augmented with the continual indexing of all documents. We see that both episodic memory (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+epsmem(D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) - 22.922.922.922.9), and generative memory (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) - 26.0) reduce forgetting compared to cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) when we replay (pseudo-)queries corresponding to D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus. Moreover, generative memory outperforms episodic memory without retaining original queries. Although from Table 1, we see generative memory, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), underperforms cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), from Figures 4 and 7, we see that generative memory, cl(U5subscript𝑈5U_{5}italic_U start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT)+genmem(U5subscript𝑈5U_{5}italic_U start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT), outperforms cl(U5subscript𝑈5U_{5}italic_U start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT) both in terms of average performance Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and forgetting Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over five sequential updates. These results convincingly show that the ER with generative memory significantly alleviates forgetting the retrieval task compared to considered baselines.

Refer to caption
Figure 4: Investigating the effectiveness of generative memory in mitigating forgetting when continuously indexing new corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (T5-Base model and atomic docids representation) for the NQ dataset. \uparrow indicates higher is better, \downarrow indicates lower is better. We observe that continual indexing of old and new documents cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) helps to alleviate forgetting of older documents when evaluated on retrieval tasks. However, average Hits@10 (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) still undergo 23232323 points drop after sequential updates (D0D1D5subscript𝐷0subscript𝐷1subscript𝐷5D_{0}\rightarrow D_{1}\cdots\rightarrow D_{5}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ → italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT). Generative memory enables sparse replaying of pseudo-queries for old documents and continual semi-supervised learning with new documents. We observe that augmenting generative memory during continual indexing not only reduces the forgetting (Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) but also improves average Hits@10 (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) by +21.1%percent21.1+21.1\%+ 21.1 % over considered baselines (see Figure 7 for Hits@1 results. Figure 8 for MS MARCO results in the Appendix).

Does generative memory enable forward transfer to new documents?

One of the goals of DSI++ is to enable answering queries related to newly indexed documents. Towards this goal, in Table 1, for the NQ dataset, we look at the retrieval task performance (Hits@1) for D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT after incrementally indexing D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. To compare different methods, we consider a baseline in the form of ER with ground-truth queries for D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+epsmem(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 48.3). We see that without any fine-tuning on the retrieval task for D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, incremental learning with indexing objective shows impressive forward transfer (or zero-shot gains, cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 31.7 and cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 34.0). Moreover, ER with generative memory outperforms supervised baseline (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 57.7). However, we notice that replaying queries corresponding to either D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT hurt forward transfer to D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) - 8.6) or amplify forgetting of D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 7.0). These results suggest that the memory module should include (pseudo-)queries corresponding to old and new documents. From Figure 4, we see that continual indexing method cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) has a downward trend for LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (Hits@10), therefore, eventually forgetting the retrieval task. On the other hand, ER with generative memory is relatively constant, providing evidence against forgetting. In summary, ER with generative memory enhances retrieval task performance by reducing forgetting of indexed documents and enabling forward transfer to newly indexed documents.

Does generative memory generalize to different datasets?

In Table 3, for the MS MARCO dataset, we report Hits@1 to be 78.278.278.278.2 after training on D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT passages. We see that continually indexing both D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpora (cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 76.576.576.576.5 and cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 73.773.773.773.7), significantly reduce forgetting the retrieval task (Hits@1) over just indexing the new corpora D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 68.068.068.068.0). Next, we look at the retrieval task performance (Hits@1) for D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT after incrementally indexing D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We see that without any fine-tuning on the retrieval task for D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, incremental learning with indexing objective shows impressive forward transfer (cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 36.1 and cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 35.3). Moreover, ER with generative memory, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) - 80.6, performs far superior to just incremental indexing objective. Similar to the results with the NQ dataset, we show that ER with generative memory, cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT)+genmem(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT), improves the overall performance for the retrieval task, reducing forgetting of previously indexed documents and enables forward transfer to new documents compared to continual indexing of all documents, cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT). We show that our results hold across two datasets, thus, showcasing the generalizability of our approach.

Investigating the effectiveness of the generative memory with the scale of a corpus.

We conduct experiments with a full MS MARCO dataset (8.9Mabsent8.9𝑀\approx 8.9M≈ 8.9 italic_M passages). We construct two corpora – D0=8Msubscript𝐷08𝑀D_{0}=8Mitalic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 8 italic_M and D1=841,823subscript𝐷1841823D_{1}=841,823italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 841 , 823 passages. We train the DSI model using D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT passages and incremental add D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT passages. In Table 3, we report results for MS MARCO. We see that continual fine-tuning with the indexing task on D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), completely forget the retrieval task for D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT passages (Hits@1 goes to 0.10.10.10.1 from 16.316.316.316.3). However, the generative memory-based approach significantly reduces forgetting (Hits@1 of 7.37.37.37.3). Moreover, generative memory enables continual semi-supervised learning by augmenting pseudo-queries for D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT passages, thereby improving forward transfer (Hits@1 of 31.631.631.631.6 vs. 18.218.218.218.2 for cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)). Our proposed solution reduces forgetting in large corpus settings.

Investigating sparsity of experience replay (ER) on forgetting.

ER with generative memory co-trains the indexing and pseudo-labeled retrieval tasks. Tay et al. (2022) introduces a mixing ratio r to define the ratio of indexing to retrieval samples. The mixing ratio is inversely related to the sparsity of ER, i.e., higher r𝑟ritalic_r (more indexing samples) corresponds to sparse updates from pseudo-labeled retrieval samples. Following (Tay et al., 2022), we consider r={2,32}𝑟232r=\{2,32\}italic_r = { 2 , 32 } for our analysis. From Figure 4, we see that r=32𝑟32r=32italic_r = 32 (sparse replay) slightly outperforms r=2𝑟2r=2italic_r = 2 in terms of average performance, forgetting, and learning accuracy. These results suggest that even sparse regularization updates from ER positively influence backward and forward transfer in DSI++.

Analyzing index construction time for DSI++.

DSI involves training a Transformer model for index construction. DSI++ allows incremental updating of the indexer. In Figures 4, 7, and 8, we demonstrate that our incremental indexer updating method surpasses the “train from scratch” baseline in terms of Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Note that the “train from scratch” baseline can serve as a performance upper bound for continual learning when there is no detrimental interference among tasks, and all tasks are evenly balanced. However, in the case of DSI++, there exists an initial base corpus that is larger than subsequent corpora, leading to an imbalance among tasks. Consequently, “train from scratch” should be regarded as a competitive baseline rather than an inherent upper bound. This is also the reason behind reporting the learning accuracy (LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) for every metric, which can be seen as an upper bound since it maintains a running average of the best performance across all corpora. Furthermore, one of the key objectives of continual learning is to leverage prior knowledge to enhance the learning of new tasks. Indeed, from Tables 1 and 3, we observe that our proposed method excels in forward transfer compared to the “train from scratch” approach.

For the NQ dataset, indexing the initial D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus of 50K documents requires 350K training steps. If we sequentially index additional D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to D5subscript𝐷5D_{5}italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT corpora (10K each) by re-training the DSI model each time, it would require around 1.75M steps. In contrast, our approach only requires slightly above 300K additional updates to incrementally index all corpora, which is approximately six times fewer updates. Our approach achieves superior overall performance compared to re-training from scratch, while also being more computationally efficient.

6 Conclusion

DSI++ introduces a new approach to address a crucial requirement of DSI models for practical use in production setups, where continuous addition of new documents to the corpus is necessary. Through experiments, we demonstrate the effectiveness of our proposed solutions: sharpness-aware minimization and generative memory, which significantly reduce catastrophic forgetting. This work establishes a foundation for further research, benefiting both DSI models and the broader community of continual (semi-supervised) learning.

Limitations

In this study, we explore the phenomenon of forgetting in relation to the addition of new and distinct documents into the indexer. It is important to note that when a new document refutes or modifies a previously indexed document, the model’s behavior becomes unpredictable, requiring further analysis. Additionally, we examine the effectiveness of our proposed method on a larger dataset, such as the full MS MARCO dataset. However, it is worth noting that with this larger dataset, the method exhibits significant forgetting. As a result, additional research is necessary to enhance the model’s performance, particularly when dealing with datasets of larger scales.

Ethics Statement

Training large models is expensive and can have a detrimental impact on the environment (Strubell et al., 2019). Continual learning on top of existing models is preferable to re-training from scratch in this regard since it requires many fewer training steps. With DSI++, we aim to reduce the need to re-train DSI models from scratch whenever a new set of documents is added to the corpus thereby making it cheaper and better for the environment. Concretely, in §5.2, we analyze the index construction time for DSI++ and show that our approach is computationally efficient in comparison to re-training the model from scratch. At the same time, we acknowledge that reduced cost can increase overall consumption (Jevons’ paradox).

Acknowledgements

We thank the anonymous reviewers for their valuable feedback and suggestions, which helped improve the paper. We also thank Ronak Pradeep and Kai Hui for help with the MS MARCO setup, Tal Schuster and Raghuram Mandyam Annasamy for reviewing the paper, and William W. Cohen, Aditya Gupta, Dara Bahri, and Fuzhao Xue for sharing insights and intuitions during initial discussions. We would like to thank COMEDY (COhorts of Maarten Sap, Emma Strubell, Daniel Fried, and Yonatan Bisk) lab members for reviewing the paper and providing valuable comments; Jeremiah Milbauer, Clara Na, Jared Fernandez, Nupoor Gandhi, Zhisong Zhang, and Vijay Viswanathan also gave constructive feedback on drafts and tables.

References

Appendix A Appendix

A.1 Implementation Details

We utilize the pre-trained T5-Base (Raffel et al., 2020) to initialize all models and randomly initialize the additional parameters for atomic docid tokens. Bahri et al. (2022) demonstrates the successful applicability of SAM for language model generalization, especially in pre-trained T5 models. We mainly follow (Bahri et al., 2022) to set our hyper-parameters: ρ=0.15𝜌0.15\rho=0.15italic_ρ = 0.15, batch size=32323232 for the inner maximization step in SAM.

While indexing D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus, we train all the models for a maximum of 1111M steps with a warmup of 100100100100K steps. During continual indexing of other corpora, we train for a maximum of 100100100100K steps with a warmup of 100100100100 steps. For the rest of the hyper-parameters, we follow Tay et al. (2022) – set a learning rate to 0.0010.0010.0010.001, batch size to 128128128128, and input sequence length to 32323232. We evaluate models after every 5555K steps and retain the checkpoint yielding the best performance. For the initial training with D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus, we co-train on indexing and retrieval tasks; therefore, we use the average of all DSI metrics (indexing accuracy, Hits@1, and Hits@10) for model selection. For the continual learning experiments, we have access to only indexing accuracy for all involved corpora, so we use it for model selection.

To train a parametric model for generative memory, we utilize the retrieval dataset R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which corresponds to the D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus. We set the maximum sequence length for document contents to 1024102410241024, the target length for generated queries to 32323232, batch size to 128128128128, train for a maximum of 100100100100K steps, and use BLUE for model selection. We use beam decoding to generate pseudo-queries. We tune the learning rate amongst {0.001,0.0005}0.0010.0005\{0.001,0.0005\}{ 0.001 , 0.0005 } and linear warmup amongst {1K,10K}1𝐾10𝐾\{1K,10K\}{ 1 italic_K , 10 italic_K }. For all our experiments, we use the T5X (Roberts et al., 2022) framework along with 4444-8888 TPUv4 chips to train the models.

Dataset #D Natural Questions (NQ) MS MARCO
#Train #Validation #Test #Train #Validation #Test
R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 50K 53.8K 13.5K 3.9K 2M 25.0K 3.6K
R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 10K 10.7K 2.7K 809 400K 5.1K 762
R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 10K 10.6K 2.7K 787 400K 5.1K 770
R3subscript𝑅3R_{3}italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 10K 10.7K 2.7K 727 400K 4.9K 734
R4subscript𝑅4R_{4}italic_R start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 10K 10.9K 2.7K 772 400K 4.9K 730
R5subscript𝑅5R_{5}italic_R start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 10K 10.7K 2.7K 847 400K 4.9K 660
Table 2: DSI++ dataset statistics for NQ and MS MARCO: memorization and retrieval tasks.
Refer to caption
Figure 5: Investigating the effectiveness of SAM for alleviating implicit forgetting in the T5-Base model by visualizing indexing accuracy during memorization. We observe serious fluctuations in the indexing accuracy in the case of the Adafactor optimizer, thereby suggesting unstable memorization. SAM leads to relatively stable memorization of documents.
Added Method Eval corpus = D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Eval corpus = D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
corpus (Catastrophic forgetting) (Forward transfer)
Index acc. Hits@1 Hits@10 Index acc. Hits@1 Hits@10
MS MARCO – |D0|=50Ksubscript𝐷050𝐾|D_{0}|=50K| italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | = 50 italic_K, |D1|=10Ksubscript𝐷110𝐾|D_{1}|=10K| italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | = 10 italic_K
D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 99.40.2subscript99.40.299.4_{0.2}99.4 start_POSTSUBSCRIPT 0.2 end_POSTSUBSCRIPT 78.20.2subscript78.20.278.2_{0.2}78.2 start_POSTSUBSCRIPT 0.2 end_POSTSUBSCRIPT 95.00.1subscript95.00.195.0_{0.1}95.0 start_POSTSUBSCRIPT 0.1 end_POSTSUBSCRIPT - - -
D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 46.718.6subscript46.718.646.7_{18.6}46.7 start_POSTSUBSCRIPT 18.6 end_POSTSUBSCRIPT 68.02.0subscript68.02.068.0_{2.0}68.0 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 87.31.3subscript87.31.387.3_{1.3}87.3 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 99.80.0subscript99.80.099.8_{0.0}99.8 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 36.19.5subscript36.19.536.1_{9.5}36.1 start_POSTSUBSCRIPT 9.5 end_POSTSUBSCRIPT 65.86.9subscript65.86.965.8_{6.9}65.8 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 99.40.0subscript99.40.099.4_{0.0}99.4 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 76.50.7subscript76.50.776.5_{0.7}76.5 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 94.20.3subscript94.20.394.2_{0.3}94.2 start_POSTSUBSCRIPT 0.3 end_POSTSUBSCRIPT 99.80.0subscript99.80.099.8_{0.0}99.8 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 35.34.1subscript35.34.135.3_{4.1}35.3 start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 64.43.3subscript64.43.364.4_{3.3}64.4 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 99.30.1subscript99.30.199.3_{0.1}99.3 start_POSTSUBSCRIPT 0.1 end_POSTSUBSCRIPT 73.70.2subscript73.70.273.7_{0.2}73.7 start_POSTSUBSCRIPT 0.2 end_POSTSUBSCRIPT 93.90.3subscript93.90.393.9_{0.3}93.9 start_POSTSUBSCRIPT 0.3 end_POSTSUBSCRIPT 99.80.0subscript99.80.099.8_{0.0}99.8 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 80.61.0subscript80.61.080.6_{1.0}80.6 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 95.50.1subscript95.50.195.5_{0.1}95.5 start_POSTSUBSCRIPT 0.1 end_POSTSUBSCRIPT
train from scratch 99.50.0subscript99.50.099.5_{0.0}99.5 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 75.00.2subscript75.00.275.0_{0.2}75.0 start_POSTSUBSCRIPT 0.2 end_POSTSUBSCRIPT 93.90.1subscript93.90.193.9_{0.1}93.9 start_POSTSUBSCRIPT 0.1 end_POSTSUBSCRIPT 99.60.0subscript99.60.099.6_{0.0}99.6 start_POSTSUBSCRIPT 0.0 end_POSTSUBSCRIPT 73.41.3subscript73.41.373.4_{1.3}73.4 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 93.40.9subscript93.40.993.4_{0.9}93.4 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT
MS MARCO (full) – |D0|=8Msubscript𝐷08𝑀|D_{0}|=8M| italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | = 8 italic_M, |D1|=842Ksubscript𝐷1842𝐾|D_{1}|=842K| italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | = 842 italic_K
D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 99.499.499.499.4 16.316.316.316.3 46.846.846.846.8 - - -
D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 0.00.00.00.0 0.10.10.10.1 0.60.60.60.6 97.997.997.997.9 18.218.218.218.2 40.540.540.540.5
cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 20.420.420.420.4 7.37.37.37.3 31.331.331.331.3 86.686.686.686.6 31.631.631.631.6 65.865.865.865.8
Table 3: Comparing performance on incremental indexing of D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpus across different methods - cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT): continue fine-tuning with indexing task on D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT): continue fine-tuning on the updated corpus U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(D𝐷Ditalic_D): continual indexing of U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT along with ER of pseudo-queries for D𝐷Ditalic_D. We observe that continual indexing on the updated corpus cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) reduces forgetting compared to just indexing new corpus cl(D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) in the MS MARCO dataset. Our proposed approach of augmenting pseudo-queries for all documents along with continual indexing, cl(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT)+genmem(U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), alleviates forgetting of D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corpus and improves forward transfer to D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corpus. We also show that our proposed solution reduces forgetting of D0(=8M)annotatedsubscript𝐷0absent8𝑀D_{0}(=8M)italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( = 8 italic_M ) passages while incremental indexing in a large corpus setting, MS MARCO (full) containing 8.9M8.9𝑀8.9M8.9 italic_M passages.
Refer to caption
Figure 6: Systematic study about forgetting and forward transfer when incrementally indexing new corpus of documents across different model sizes (T5-Base, T5-Large, T5-XL) and docid representations. We use atomic docids by default and denote (N)/(S) for naively/semantically structured string docids. \uparrow indicates higher is better, \downarrow indicates lower is better. We observe that by increasing the model scale, the average Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and learning LAn𝐿subscript𝐴𝑛LA_{n}italic_L italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT performance improves. However, forgetting Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is severe across all model scales. Moreover, we observe that naive string docids (N) underperform atomic docids across the Hits@10 metric. Similar to Figure 2, imbuing the docid space with a semantic (S) structure alleviates the forgetting compared to an arbitrary/ naive (N) structure.
Refer to caption
Figure 7: Investigating the effectiveness of generative memory in mitigating forgetting when continuously indexing new corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (T5-Base model and atomic docids representation) for the NQ dataset. \uparrow indicates higher is better, \downarrow indicates lower is better. We observe that continual indexing of old and new documents cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) helps to alleviate forgetting of older documents when evaluated on retrieval tasks. However, average Hits@1 (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) still undergo 19191919 points drop after sequential updates (D0D1D5subscript𝐷0subscript𝐷1subscript𝐷5D_{0}\rightarrow D_{1}\cdots\rightarrow D_{5}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ → italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT). We observe that augmenting generative memory during continual indexing not only reduces the forgetting (Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) but also improves average Hits@1 (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) by +17.3%percent17.3+17.3\%+ 17.3 % over continual indexing.
Refer to caption
Figure 8: Investigating the effectiveness of generative memory in mitigating forgetting when continuously indexing new corpus Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (T5-Base model and atomic docids representation) for the MS MARCO dataset. \uparrow indicates higher is better, \downarrow indicates lower is better. We observe that continual indexing of old and new documents cl(Unsubscript𝑈𝑛U_{n}italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) helps to alleviate forgetting of older documents when evaluated on retrieval tasks. However, average Hits@10 (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) still undergo 25.025.025.025.0 points drop after sequential updates (D0D1D5subscript𝐷0subscript𝐷1subscript𝐷5D_{0}\rightarrow D_{1}\cdots\rightarrow D_{5}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ → italic_D start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT). Generative memory enables sparse replaying of pseudo-queries for old documents and continual semi-supervised learning with new documents. We observe that augmenting generative memory during continual indexing not only reduces the forgetting (Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) but also improves average Hits@10 (Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) by +23.0%percent23.0+23.0\%+ 23.0 % over considered baselines.

A.2 Related Work

We review relevant prior work along two dimensions: Application setups related to DSI++ and continual learning methods to alleviate forgetting and enable forward transfer.

Language models (LMs) as knowledge bases (KBs).

Petroni et al. (2019) shows that pre-trained BERT (Devlin et al., 2019) models capture relational knowledge comparable to that of the KBs constructed using off-the-shelf techniques. Concretely, these models can be used to extract factual knowledge about relations between entities by providing a prompt to predict missing words in a cloze-style template (e.g., “New Delhi is the capital of    ”). Similarly, Roberts et al. (2020) demonstrates that pre-trained T5 (Raffel et al., 2020) models can be employed to answer open-domain questions without access to any external knowledge or context. However, unlike structured KBs, it is non-trivial to update knowledge stored implicitly in the weights of these models. Therefore, Zhu et al. (2020) introduces an experimentation setup where the task is to update facts stored within the pre-trained models and proposes a constrained optimization method, similar to Elastic Weight Consolidation (Kirkpatrick et al., 2017), to alleviate catastrophic forgetting. With similar motivation, (Dhingra et al., 2022) introduces a diagnostic dataset to probe LMs for facts that change over time. It also suggests jointly modeling text with its timestamp for improved memorization of seen facts. Recent works have been investigating efficient ways to localize and edit facts stored with the LMs (AlKhamissi et al., 2022) using finetuning (Zhu et al., 2020; Dhingra et al., 2022), hyper-networks (De Cao et al., 2021; Mitchell et al., 2022), and direct editing (Meng et al., 2022). Although a crucial line of work around updating facts in the pre-trained LMs, using prompting as our probing mechanism only provides a lower bound estimate of the knowledge contained in these models (Jiang et al., 2020). On the other hand, we explicitly focus on the memorization task in DSI++. This task helps us to answer questions related to catastrophic forgetting more convincingly rather than bounded by the mechanism of how we probe these models.

Optimization-based approaches

for continual learning encode the necessary inductive biases required to enable continual learning by modifying the training dynamics. Flatter minima are shown to alleviate forgetting (Mirzadeh et al., 2020). Further, Mehta et al. (2023) showed that explicitly optimizing for flatter loss basins using Sharpness-Aware Minimization (SAM; Foret et al. (2021)) reduces forgetting. Building on these works, we show that flatter minima induced by SAM reduce implicit forgetting during memorization, thereby leading to more stable memorization (see §3).

Memory-based (aka data-based regularization) approaches

for continual learning constrain the parameter updates based on the previous task examples sampled from memory. Sparse experience replay using episodic memory (Chaudhry et al., 2019) is a prominent approach, and in §4, we discuss its limitations of it for DSI++. Next, Shin et al. (2017); Sun et al. (2020) learns a parametric model to reconstruct the examples for seen tasks. However, in DSI++, we do not see queries for the new documents. Therefore, we use a parametric memory to generate pseudo-queries for already indexed (older) documents and an incoming batch of new documents, thus, enabling us to leverage unlabeled data (in the form of new documents) for continual semi-supervised learning. On the other hand, Sun et al. (2020) assumes that the incoming data are fully labeled, which is not applicable in DSI++ (we do not get to see queries for the new documents). Furthermore, Sun et al. (2020) shows that using a parametric model underperforms episodic memory. In our work, we do not generate example pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) but rather generate pseudo-queries (y𝑦yitalic_y), similar to contemporary works (Zhuang et al., 2022; Bonifacio et al., 2022). We show that our approach outperforms episodic memory. Lastly, in the context of pseudo-query generation, neural models are prone to hallucinate additional content not supported by the input documents. Future works can study methods to filter out noisy pseudo-queries (Mehta et al., 2022) during incremental indexing.

Test time adaptation approaches

for continual learning use episodic memory at the inference time to alter the model weights before making predictions (Rebuffi et al., 2017; Sprechmann et al., 2018; de Masson D’Autume et al., 2019; Wang et al., 2020). Updating the DSI indexer for every user query is computationally expensive, so we focus on continual learning methods during training. Apart from continual learning-focused approaches, retrieval augmented generation (Guu et al., 2020; Izacard and Grave, 2021; Borgeaud et al., 2022) family of approaches retrieve auxiliary passages/documents to enhance pre-trained language models. These approaches alter test-time predictions of the generative models by augmenting their input with relevant passages retrieved from external retrievable memory. Moreover, one explicitly disables the updates to the employed pre-trained (and retrieval) model using the external retrievable memory. Such approaches do not faithfully assess the fundamental challenge of learning continually, specifically catastrophic forgetting. On the other hand, our work focuses on the recently introduced DSI paradigm (Tay et al., 2022), where information in the document corpus is encoded into the model parameters. Therefore, any updates to the underlying corpus necessitate updating the model parameters hence, undergoing severe forgetting. Our work tackles a more challenging setup for studying the forgetting phenomenon in detail. However, retrieval-augmented generation-based methods do not analyze the forgetting phenomenon, only looking at overall performance metrics. We agree that continual learning is broader than catastrophic forgetting. However, in this work, we decided to study the forgetting phenomenon in detail on one of the most challenging setups, if not the most difficult.

Parameter isolation-based approaches

for continual learning assign different dedicated subsets of the model parameters to each task to prevent forgetting (De Lange et al., 2021). While learning a new task, these methods either freeze a subset of the parameters corresponding to older tasks or dynamically add new parameters per new task. At the prediction time, these methods typically require task identity to activate the corresponding subset of parameters for inference. In the DSI paradigm, we are given user queries at the inference time, and the goal is to predict relevant document identifiers. Now during incremental indexing, if we consider every new document corpus as a new task, then a typical parameter isolation-based approach would require corpus identity for every user query at the test time, defeating the whole purpose of the DSI paradigm. Due to this, the parameter isolation-based approaches in their current form are rendered less useful for DSI++. Nevertheless, we believe that by masking the weights for the already indexed corpus, one is explicitly disabling the updates to the underlying DSI model; therefore, parameter isolation-based methods would be robust to forgetting, and future works should explore them for DSI++. We believe, however, that adapting these methods for DSI++ is out of scope for this paper, and we would not be able to do both this topic and our current work justice in the limited space available.