CHEW: A Dataset of CHanging Events in Wikipedia

Hsuvas Borkakoty¹, Luis Espinosa-Anke^1,2,
¹Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK
²AMPLYFI, UK
{borkakotyh,espinosaankel}@cardiff.ac.uk

Abstract

We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their “timeline” understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.

Hsuvas Borkakoty¹, Luis Espinosa-Anke^1,2, ¹Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK ²AMPLYFI, UK {borkakotyh,espinosaankel}@cardiff.ac.uk

1 Introduction

Since language models (LMs) are trained on raw web text and, often, without any explicit temporal grounding Zhao et al. (2024), they are prone to suffer temporal misalignment Luu et al. (2021); Lazaridou et al. (2021); Jang et al. (2022). While there is a significant body of work concerned with fixing this issue via, e.g., in-domain pretraining Gururangan et al. (2020), neologism-focused pretraining Yu et al. (2021), knowledge editing De Cao et al. (2021); Zhu et al. (2020); Dai et al. (2021), continual learning (Agarwal and Nenkova, 2021; Del Tredici et al., 2018; Giulianelli et al., 2020; Dhingra et al., 2022; Loureiro et al., 2022a), or model editing Rosin and Radinsky (2022), there is however little understanding on LLMs’ ability to reproduce timelines of entities and events, primarily due to pretraining temporal chaos Zhao et al. (2024).

Probing models for representing temporal knowledge has been the focus of prior works, e.g., in the lexical semantics space, where studying semantic change or meaning shift has been used as a proxy to explore internal representations of word meaning, typically via diachronic embeddings Del Tredici et al. (2019); Schlechtweg et al. (2020); Loureiro et al. (2022b) or generating temporally grounded definitions Giulianelli et al. (2023); Luden et al. (2024). Moreover, “time-sliced” perplexity has examples ranging from Wikipedia to Twitter Cheng et al. (2024); Loureiro et al. (2022a). Conversely, temporal question answering (tasks where the correct answers change over time) probe factual and world knowledge on LLMs with some kind of time context Liska et al. (2022); Kasai et al. (2024); Zhao et al. (2024); Wallat et al. (2024). Despite this, there are not enough evaluation benchmarks to probe for change modeling, with TemporalWiki Jang et al. (2022) partly addressing this issue by curating a Wikipedia diffs dataset, although the authors themselves admit it is not trivial to tell if a change in Wikipedia or Wikidata content signify meaningful world changes. In this paper, we deep dive on the notion of change by proposing CHEW (CHanging Events in Wikipedia), a temporally grounded dataset from Wikipedia that focuses on finding important changes to events and entities, starting from a collection of Wikipedia events and entities, and their associated changes over time extracted from Wikipedia lists and originally curated in the TAQA dataset Zhao et al. (2024). We report generation, classification and downstream results using CHEW, shedding light on LLMs’ capabilities to handle temporal information in various settings, and their potential for temporal alignment.

Wikipedia Title	Timestamp 1	Text 1	Timestamp 2	Text 2	Label
Andrés Iniesta	05-01-2017	Andrés Iniesta Luján (born 11 May 1984) is a Spanish professional footballer who plays as a central midfielder for FC Barcelona and the Spain national team. He serves as the captain for Barcelona…	27-12-2018	Andrés Iniesta Luján (born 11 May 1984) is a Spanish professional footballer who plays as a central midfielder for Japanese club Vissel Kobe…	change
Sonotone	30-12-2009	The Sonotone 1010 hearing aid, introduced December 1952, was the was the first commercial product to use transistors …	29-12-2010	The Sonotone 1010 hearing aid, introduced on 29 December 1952, was the first commercial product to use transistors …	no change

Table 1: Examples of CHEW which illustrate the significant changes in entities contained in the positive examples, as opposed to stylistic differences in negative examples.

2 Building CHEW

The TAQA dataset Zhao et al. (2024), denoted as $D_{\text{TAQA}}$ , comprises Wikipedia articles with temporal question-answer pairs. Our goal is to derive from it $C=P\cup N$ , a set of Wikipedia page pairs representing temporal changes $P$ and their corresponding negative examples $N$ . Let $Q$ be the set of all questions and $T$ the set of all timestamps. For a question $q\in Q$ and timestamps $t_{1},t_{2}\in T$ where $t_{1}<t_{2}$ , let $A(q,t)$ denote the answer to question $q$ at time $t$ . We define $\Delta A(q,t_{1},t_{2})=A(q,t_{1})\neq A(q,t_{2})$ to identify changes in answers. From $D_{\text{TAQA}}$ , we extract pairs $(q,(t_{1},t_{2}))$ such that $\Delta A(q,t_{1},t_{2})$ holds. For each valid pair, we obtain the revisions of the corresponding Wikipedia articles at $t_{1}$ and $t_{2}$ . Specifically, consider the ranked list of revisions at $t_{1}$ , $\{R_{t_{1}}^{1},R_{t_{1}}^{2},\ldots,R_{t_{1}}^{n}\}$ , sorted by their timestamps in ascending order. We select the first revision from this list, denoted as $R_{1}=R_{t_{1}}^{1}$ , and similarly, the first revision from the list at $t_{2}$ , denoted as $R_{2}=R_{t_{2}}^{1}$ . We then compute cosine similarities using SBERT Reimers and Gurevych (2019), specifically $S(R_{1},R_{2})$ , $S(A(q,t_{1}),A(q,t_{2}))$ , and $S(A(q,t_{2}),R_{1})$ . Pairs satisfying $S\geq\theta$ for all three similarities¹¹1We empirically set $\theta$ to 0.6. are retained, and form $P$ . This filtering step ensures the following criteria are met: (1) $R_{1}$ and $R_{2}$ reflect the change available in $D_{\text{TAQA}}$ , which was generated from a manually curated Wikipedia list, and therefore is accurate²²2Note that relying on Wikipedia lists has other advantages, since Wikipedia topics are popular and well structured, and their distribution is less biased than open knowledge graphs Piscopo et al. (2017).; (2) $R_{1}$ and $R_{2}$ are sufficiently similar, which ensures to a great extent that no other important changes have occurred between $t_{1}$ and $t_{2}$ ; and (3) We can guarantee that the evidence for the change that originated that positive datapoint is contained in one sentence (as opposed to, e.g., split across several sentences using anaphoric references).

For curating $N$ (i.e., no change), we simply sample the first and last snapshot $R_{t_{1}}$ and $R_{t_{2}}$ of a period in the timeline of a Wikipedia page that was highly edited. We then filter out those pairs where the cosine similarity of their definition sentences, $S(R_{t_{1}}^{\text{def}},R_{t_{2}}^{\text{def}})$ , is sufficiently high, but not 1 (again, empirically, between 0.8 and 1). After manual inspection of a sample, we found that this step ensures that the negative examples represent pairs of Wikipedia pages that are different, but with no significant change. Statistics on CHEW are provided in Table 2, for several splits: Random, where we randomly separate all pairs between predefined train/validation/test splits; No overlap between entities (NoOv); Time-forward (TFwd) and Time-reversed (TRvsd), the latter two for no temporal overlap across splits (see Fig. Figure 1).

Refer to caption — Figure 1: Barplots with the time forward and time forward data splits.

Data split	Set	No ch.	Change	Total
Random	Train	1,632	2,649	4,281
	Val	470	141	611
	Test	942	282	1,224
NoOv	Train	2,106	2,193	4,299
	Val	307	301	608
	Test	631	578	1,209
TFwd	Train	2,828	2,076	4,904
	Val	90	341	431
	Test	126	655	781
TRvsd	Train	1,498	2,283	3,781
	Val	1,407	299	1,706
	Test	139	490	629

Table 2: CHEW statistics for the four splits we introduce.

3 Experiments

Prompting for timeline knowledge

Given timestamps $t_{1}$ and $t_{2}$ , we submit to an LLM $LLM$ a prompt $p$ with a tuple $(i,w_{t1},t_{2})$ , where $i$ is the instruction (further details can be found in Appendix A), $w_{t1}$ is the revision of $w$ at timestamp $t_{1}$ , and $t_{2}$ , where $t_{1}<t_{2}$ . We then estimate the accuracy of the LLMs’ response $LLM(i,w_{t1},t_{2})$ by retrieving the maximum SBERT text similarity $S$ between the response and the content in $w_{t2}$ as follows:

S=\max\{\text{SBERT}(r,w_{t2})\mid r\in LLM(i,w_{t1},t_{2})\}

where $\text{SBERT}(r,w_{t2})$ is the cosine similarity between the response $r$ and the content in $w_{t2}$ . The higher the similarity score, the more accurate the LLM’s response was. For this and subsequent experiments, we consider the following popular open source models: LLama2-7B and LLama2-13B Touvron et al. (2023b), LLama3-8B Touvron et al. (2023a) and Mistral-7B Jiang et al. (2023). After collecting the responses and performing some ad-hoc cleanup, we visualize the similarities obtained for all models in Figure 2. The most capable model in this setup is LLam2-13B, whereas Mistral-7B seems to be struggling the most to generate appropriate responses. An interesting observation is that Llama3-8b and Llama2-13b are actually capable to reproduce verbatim the content of $w_{2}$ , as we can see from the large number of comparisons with cosine similarity being 1. This is a somewhat surprising fact, and suggests that these models are in principle capable to generate a reasonable portion of timestamped Wikipedia articles.

		CHEW-Random			CHEW-NoOv			CHEW-TFwd			CHEW-TRvsd
		P	R	F1	P	R	F1	P	R	F1	P	R	F1
n.i.	llama2-7b	0.82	0.82	0.82	0.65	0.62	0.61	0.53	0.54	0.53	0.61	0.51	0.55
	llama2-13b	0.81	0.81	0.81	0.59	0.55	0.48	0.50	0.49	0.49	0.41	0.50	0.45
	llama3-8b	0.87	0.88	0.87	0.66	0.66	0.66	0.55	0.57	0.56	0.57	0.50	0.53
	mistral-7b	0.66	0.65	0.65	0.59	0.59	0.59	0.52	0.52	0.52	0.63	0.57	0.59
i.	llama2-7b	0.81	0.81	0.81	0.77	0.77	0.78	0.52	0.53	0.52	0.55	0.51	0.53
	llama2-13b	0.93	0.93	0.93	0.77	0.77	0.78	0.46	0.46	0.46	0.64	0.52	0.57
	llama3-8b	0.87	0.87	0.87	0.78	0.79	0.78	0.56	0.58	0.57	0.42	0.50	0.45
	mistral-7b	0.93	0.93	0.93	0.69	0.69	0.69	0.61	0.70	0.65	0.75	0.64	0.69

Table 3: SFT fine-tuning results on the different CHEW splits, with and without temporal system prompt instruction. Best F1 scores per split are highglighted in bold, best results within the specific instruction setting are underlined.

Prompt-based change detection

We now explore a complementary dimension of this probing experiment: binary text classification. Here, we are still in the non parametric space, and we simply prompt LLMs for a binary label, given the pair $\left(w_{1},w_{2}\right)$ . Figure 3 shows the results of this experiment. We report results on the test sets of all splits (simply to have a point of comparison with the supervised approaches we will discuss later), and just like in the previous experiment, we report results of “bare” prompting ( $p_{\text{zero}}$ ) as well as with in-context examples ( $p_{\text{few}}$ ) (see Appendix B for details of these prompts and the examples provided). Our results yield two immediate observations. First, all models struggle significantly more when prompted with newer entities, and on the other hand, Llama-2 seems to struggle more than the others across the board in all splits. Interestingly, while struggling to accurately generate new information, Mistral is in fact often the best model in this classification task, which motivates us to explore it further in a downstream task (cf. Section 4).

Fine-tuning experiments

We proceed to fine-tune the considered models using standard LoRa Hu et al. (2021) and SFT Ouyang et al. (2022) techniques on the training sets (Appendix C for implementation details). The results provided in Table 3 yield a number of interesting insights. Note that we study two different system prompts: one where we simply refer to the task as a binary classification problem (n.i), and one where the models are given more context about the task (i.). Mistral-7b, which was already showing signs of being capable of handling this task, benefits substantially from the fine-tuning strategy, outperforming the Llama models in three out of four splits. Moreover, we find that Llama3-8b is only the best of the Llama models in 2 out of 4 settings, which highlights the potential of the older Llama-2 models to enhance their temporal knowledge via fine-tuning.

4 Better temporal embeddings with CHEW

Enhancing an LLM’s temporal capabilities is one of the long-standing goals of creating temporal datasets. We find that TempoWiC Loureiro et al. (2022b) is a suitable benchmark to test whether this is the case. This is a binary classification problem where given a pair of timestamped tweets and a target word, models must determine whether the meaning of the target word has changed in context to the tweets. This idea closely aligns with our proposed approach of detecting changes, and so we test whether LLMs pre-trained on CHEW improve over base models. The upper bound for the test set in this dataset is around 77% macro-F1 Lyu et al. (2022), and 50% for a random baseline.

Specifically, we generate embeddings with both base and CHEW-finetuned LLMs, and use them in a number of ways: as features for a logistic regression classifier (contextualized embeddings concatenated and averaged), as well as deriving cosine similarities, which are then themselves the features for the logistic regression classifier (following Loureiro et al. (2022b)’s official baselines). We consider different layers (last, last 4, or all layers), and we select Mistral-7b base and CHEW-finetuned as our target model, since it showed to benefit the most from the finetuning step. The results clearly indicate an improvement on the quality of the embeddings after CHEW-finetuning, making this model on par with encoder-only baselines based on RoBERTa, which are known to be much better than decoder-only models for text representations Tang et al. (2019); BehnamGhader et al. (2024).

Training data	Experiment	Acc	P	R	F1
CHEW+TW	Similarity-last	0.68	0.69	0.68	0.68
	Similarity-last4	0.67	0.7	0.67	0.67
	Similarity-all	0.63	0.41	0.63	0.5
	Concat	0.6	0.4	0.63	0.49
	Averaged	0.6	0.5	0.6	0.54
TW	Similarity-last	0.67	0.58	0.66	0.61
	Similarity-last4	0.68	0.55	0.65	0.59
	Similarity-all	0.58	0.5	0.57	0.5
	Concat	0.56	0.49	0.56	0.49
	Averaged	0.48	0.5	0.49	0.49

Table 4: TempoWiC Results.

5 Conclusions and Future Work

We have introduced CHEW, a dataset of Changing Events in Wikipedia, with which we hope to contribute to research in continual learning, temporal alingment, and other areas involving LLMs and their ability (or lack thereof) to make sense of temporal information at various levels, old and new, seen and not seen during pretraining, etc. We showed that aligning LLMs to temporal change makes them surprisingly competitive in the downstream task of word-in-context temporal classification when compared with their base counterparts.

6 Limitations

Our work does not extensively evaluate a wider range of LLMs. We also have made assumptions in the similarity comparisons between Wikipedia revisions, and while our automatic, manual and downstream tests are all consistent, further extending the comparisons between revisions could lead to a more accurate dataset. Moreover, we have only focused on the English Wikipedia, which is a significant limitation, especially for exploring “tail” entities.

7 Ethics statement

We believe that updating LLMs’ knoweldge over time, given their prevalence and how high the interest on Generative AI is today, is critical for deploying accurate and trustworthy AI models. Therefore, it should be noted that flagging critically new content in Wikipedia might conflict with Wikipedia quality standards, as well as raise misinformation concerns (if, for example, false information peppers a Wikipedia page in a way that a change detection model is unable to capture). Further research into ensuring highly accurate scans for change of community resources remains critical today more than ever, again, especially due to how prevalent GenAI tools are today.

References

Agarwal and Nenkova (2021) Oshin Agarwal and Ani Nenkova. 2021. Temporal effects on pre-trained models for language processing tasks. arXiv preprint arXiv:2111.12790.
BehnamGhader et al. (2024) Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
Cheng et al. (2024) Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. 2024. Dated data: Tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958.
Dai et al. (2021) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2021. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
De Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.
Del Tredici et al. (2018) Marco Del Tredici, Raquel Fernández, and Gemma Boleda. 2018. Short-term meaning shift: A distributional exploration. arXiv preprint arXiv:1809.03169.
Del Tredici et al. (2019) Marco Del Tredici, Raquel Fernández, and Gemma Boleda. 2019. Short-term meaning shift: A distributional exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2069–2075.
Dhingra et al. (2022) Bhuwan Dhingra, Jeremy R Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W Cohen. 2022. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273.
Giulianelli et al. (2020) Mario Giulianelli, Marco Del Tredici, and Raquel Fernández. 2020. Analysing lexical semantic change with contextualised word representations. arXiv preprint arXiv:2004.14118.
Giulianelli et al. (2023) Mario Giulianelli, Iris Luden, Raquel Fernandez, and Andrey Kutuzov. 2023. Interpretable word sense representations via definition generation: The case of semantic change analysis. In The 61st Annual Meeting Of The Association For Computational Linguistics.
Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR).
Jang et al. (2022) Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, and Minjoon Seo. 2022. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. arXiv preprint arXiv:2204.14211.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Kasai et al. (2024) Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Ye** Choi, Kentaro Inui, et al. 2024. Realtime qa: What’s the answer right now? Advances in Neural Information Processing Systems, 36.
Lazaridou et al. (2021) Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, et al. 2021. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34.
Liska et al. (2022) Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, D’Autume Cyprien De Masson, Tim Scholtes, Manzil Zaheer, Susannah Young, et al. 2022. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, pages 13604–13622. PMLR.
Loureiro et al. (2022a) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. 2022a. Timelms: Diachronic language models from twitter. arXiv preprint arXiv:2202.03829.
Loureiro et al. (2022b) Daniel Loureiro, Aminette D’Souza, Areej Nasser Muhajab, Isabella A White, Gabriel Wong, Luis Espinosa Anke, Leonardo Neves, Francesco Barbieri, and Jose Camacho-Collados. 2022b. Tempowic: An evaluation benchmark for detecting meaning shift in social media. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3353–3359.
Luden et al. (2024) Iris Luden, Mario Giulianelli, and Raquel Fernández. 2024. Beyond perplexity: Examining temporal generalization in large language models via definition generation. Computational Linguistics in the Netherlands Journal, 13:205–232.
Luu et al. (2021) Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A Smith. 2021. Time waits for no one! analysis and challenges of temporal misalignment. arXiv preprint arXiv:2111.07408.
Lyu et al. (2022) Chenyang Lyu, Yongxin Zhou, and Tianbo Ji. 2022. Mllabs-lig at tempowic 2022: A generative approach for examining temporal meaning shift. In Proceedings of the The First Workshop on Ever Evolving NLP (EvoNLP), pages 1–6.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Piscopo et al. (2017) Alessandro Piscopo, Pavlos Vougiouklis, Lucie-Aimée Kaffee, Christopher Phethean, Jonathon Hare, and Elena Simperl. 2017. What do wikidata and wikipedia have in common? an analysis of their use of external references. In Proceedings of the 13th International Symposium on Open Collaboration, pages 1–10.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
Rosin and Radinsky (2022) Guy D Rosin and Kira Radinsky. 2022. Temporal attention for language models. arXiv preprint arXiv:2202.02093.
Schlechtweg et al. (2020) Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. 2020. Semeval-2020 task 1: Unsupervised lexical semantic change detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation. International Committee for Computational Linguistics.
Tang et al. (2019) Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2019. Understanding neural machine translation by simplification: The case of encoder-free models. In Recent Advances in Natural Language Processing.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wallat et al. (2024) Jonas Wallat, Adam Jatowt, and Avishek Anand. 2024. Temporal blind spots in large language models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 683–692.
Yu et al. (2021) Wenhao Yu, Chenguang Zhu, Yuwei Fang, Donghan Yu, Shuohang Wang, Yichong Xu, Michael Zeng, and Meng Jiang. 2021. Dict-bert: Enhancing language model pre-training with dictionary. arXiv preprint arXiv:2110.06490.
Zhao et al. (2024) Bowen Zhao, Zander Brumbaugh, Yizhong Wang, Hannaneh Hajishirzi, and Noah A Smith. 2024. Set the clock: Temporal alignment of pretrained language models. arXiv preprint arXiv:2402.16797.
Zhu et al. (2020) Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363.

Appendix A Prompt for Generation of changes

The instruction prompt that we use to generate the change from a LLM is given below:

Appendix B Prompt-based change detection experiment

The instruction prompt that we use to generate the Change Label from a LLM is as follows.

The in-context examples along with this prompt used for Few-shot learning are as follows.

1. Positive Example:

2. Negative Example:

Appendix C Models and training details

We use the chat/instruct version of the models from Huggingface in our experiment, fine-tuning them using LoRAHu et al. (2021). The model is loaded in 4-bit and for the task ‘SEQ_CLS’(Sequence Classification). We train the models using one NVIDIA A100 GPU and inference using one NVIDIA RTX4090, taking approximately 2 hours per epoch. The training details for the models are listed below.

•

Learning Rate: 2e-6
•

Optimizer: paged_adamw_8bit
•

Batch size (train/eval): 1

The list of models and their huggingface repository names are listed in Table 5.

Model Name	Huggingface Repository
llama2-7b	meta-llama/Llama-2-7b-chat-hf
llama2-13b	meta-llama/Llama-2-13b-chat-hf
llama3-8b	meta-llama/Meta-Llama-3-8B-Instruct
mistral-7b	mistralai/Mistral-7B-Instruct-v0.3

Table 5: List of Models used in our experiments and their huggingface repositories.