CHEW: A Dataset of CHanging Events in Wikipedia

Hsuvas Borkakoty1, Luis Espinosa-Anke1,2,
1Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK
2AMPLYFI, UK
{borkakotyh,espinosaankel}@cardiff.ac.uk
Abstract

We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their “timeline” understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.

CHEW: A Dataset of CHanging Events in Wikipedia


Hsuvas Borkakoty1, Luis Espinosa-Anke1,2, 1Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK 2AMPLYFI, UK {borkakotyh,espinosaankel}@cardiff.ac.uk


1 Introduction

Since language models (LMs) are trained on raw web text and, often, without any explicit temporal grounding Zhao et al. (2024), they are prone to suffer temporal misalignment Luu et al. (2021); Lazaridou et al. (2021); Jang et al. (2022). While there is a significant body of work concerned with fixing this issue via, e.g., in-domain pretraining Gururangan et al. (2020), neologism-focused pretraining Yu et al. (2021), knowledge editing De Cao et al. (2021); Zhu et al. (2020); Dai et al. (2021), continual learning (Agarwal and Nenkova, 2021; Del Tredici et al., 2018; Giulianelli et al., 2020; Dhingra et al., 2022; Loureiro et al., 2022a), or model editing Rosin and Radinsky (2022), there is however little understanding on LLMs’ ability to reproduce timelines of entities and events, primarily due to pretraining temporal chaos Zhao et al. (2024).

Probing models for representing temporal knowledge has been the focus of prior works, e.g., in the lexical semantics space, where studying semantic change or meaning shift has been used as a proxy to explore internal representations of word meaning, typically via diachronic embeddings Del Tredici et al. (2019); Schlechtweg et al. (2020); Loureiro et al. (2022b) or generating temporally grounded definitions Giulianelli et al. (2023); Luden et al. (2024). Moreover, “time-sliced” perplexity has examples ranging from Wikipedia to Twitter Cheng et al. (2024); Loureiro et al. (2022a). Conversely, temporal question answering (tasks where the correct answers change over time) probe factual and world knowledge on LLMs with some kind of time context Liska et al. (2022); Kasai et al. (2024); Zhao et al. (2024); Wallat et al. (2024). Despite this, there are not enough evaluation benchmarks to probe for change modeling, with TemporalWiki Jang et al. (2022) partly addressing this issue by curating a Wikipedia diffs dataset, although the authors themselves admit it is not trivial to tell if a change in Wikipedia or Wikidata content signify meaningful world changes. In this paper, we deep dive on the notion of change by proposing CHEW (CHanging Events in Wikipedia), a temporally grounded dataset from Wikipedia that focuses on finding important changes to events and entities, starting from a collection of Wikipedia events and entities, and their associated changes over time extracted from Wikipedia lists and originally curated in the TAQA dataset Zhao et al. (2024). We report generation, classification and downstream results using CHEW, shedding light on LLMs’ capabilities to handle temporal information in various settings, and their potential for temporal alignment.

Wikipedia Title Timestamp 1 Text 1 Timestamp 2 Text 2 Label
Andrés Iniesta 05-01-2017 Andrés Iniesta Luján (born 11 May 1984) is a Spanish professional footballer who plays as a central midfielder for FC Barcelona and the Spain national team. He serves as the captain for Barcelona… 27-12-2018 Andrés Iniesta Luján (born 11 May 1984) is a Spanish professional footballer who plays as a central midfielder for Japanese club Vissel Kobe… change
Sonotone 30-12-2009 The Sonotone 1010 hearing aid, introduced December 1952, was the was the first commercial product to use transistors … 29-12-2010 The Sonotone 1010 hearing aid, introduced on 29 December 1952, was the first commercial product to use transistors … no change
Table 1: Examples of CHEW which illustrate the significant changes in entities contained in the positive examples, as opposed to stylistic differences in negative examples.

2 Building CHEW

The TAQA dataset Zhao et al. (2024), denoted as DTAQAsubscript𝐷TAQAD_{\text{TAQA}}italic_D start_POSTSUBSCRIPT TAQA end_POSTSUBSCRIPT, comprises Wikipedia articles with temporal question-answer pairs. Our goal is to derive from it C=PN𝐶𝑃𝑁C=P\cup Nitalic_C = italic_P ∪ italic_N, a set of Wikipedia page pairs representing temporal changes P𝑃Pitalic_P and their corresponding negative examples N𝑁Nitalic_N. Let Q𝑄Qitalic_Q be the set of all questions and T𝑇Titalic_T the set of all timestamps. For a question qQ𝑞𝑄q\in Qitalic_q ∈ italic_Q and timestamps t1,t2Tsubscript𝑡1subscript𝑡2𝑇t_{1},t_{2}\in Titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_T where t1<t2subscript𝑡1subscript𝑡2t_{1}<t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, let A(q,t)𝐴𝑞𝑡A(q,t)italic_A ( italic_q , italic_t ) denote the answer to question q𝑞qitalic_q at time t𝑡titalic_t. We define ΔA(q,t1,t2)=A(q,t1)A(q,t2)Δ𝐴𝑞subscript𝑡1subscript𝑡2𝐴𝑞subscript𝑡1𝐴𝑞subscript𝑡2\Delta A(q,t_{1},t_{2})=A(q,t_{1})\neq A(q,t_{2})roman_Δ italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) to identify changes in answers. From DTAQAsubscript𝐷TAQAD_{\text{TAQA}}italic_D start_POSTSUBSCRIPT TAQA end_POSTSUBSCRIPT, we extract pairs (q,(t1,t2))𝑞subscript𝑡1subscript𝑡2(q,(t_{1},t_{2}))( italic_q , ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) such that ΔA(q,t1,t2)Δ𝐴𝑞subscript𝑡1subscript𝑡2\Delta A(q,t_{1},t_{2})roman_Δ italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) holds. For each valid pair, we obtain the revisions of the corresponding Wikipedia articles at t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Specifically, consider the ranked list of revisions at t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, {Rt11,Rt12,,Rt1n}superscriptsubscript𝑅subscript𝑡11superscriptsubscript𝑅subscript𝑡12superscriptsubscript𝑅subscript𝑡1𝑛\{R_{t_{1}}^{1},R_{t_{1}}^{2},\ldots,R_{t_{1}}^{n}\}{ italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }, sorted by their timestamps in ascending order. We select the first revision from this list, denoted as R1=Rt11subscript𝑅1superscriptsubscript𝑅subscript𝑡11R_{1}=R_{t_{1}}^{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and similarly, the first revision from the list at t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, denoted as R2=Rt21subscript𝑅2superscriptsubscript𝑅subscript𝑡21R_{2}=R_{t_{2}}^{1}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. We then compute cosine similarities using SBERT Reimers and Gurevych (2019), specifically S(R1,R2)𝑆subscript𝑅1subscript𝑅2S(R_{1},R_{2})italic_S ( italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), S(A(q,t1),A(q,t2))𝑆𝐴𝑞subscript𝑡1𝐴𝑞subscript𝑡2S(A(q,t_{1}),A(q,t_{2}))italic_S ( italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ), and S(A(q,t2),R1)𝑆𝐴𝑞subscript𝑡2subscript𝑅1S(A(q,t_{2}),R_{1})italic_S ( italic_A ( italic_q , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Pairs satisfying Sθ𝑆𝜃S\geq\thetaitalic_S ≥ italic_θ for all three similarities111We empirically set θ𝜃\thetaitalic_θ to 0.6. are retained, and form P𝑃Pitalic_P. This filtering step ensures the following criteria are met: (1) R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT reflect the change available in DTAQAsubscript𝐷TAQAD_{\text{TAQA}}italic_D start_POSTSUBSCRIPT TAQA end_POSTSUBSCRIPT, which was generated from a manually curated Wikipedia list, and therefore is accurate222Note that relying on Wikipedia lists has other advantages, since Wikipedia topics are popular and well structured, and their distribution is less biased than open knowledge graphs Piscopo et al. (2017).; (2) R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are sufficiently similar, which ensures to a great extent that no other important changes have occurred between t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; and (3) We can guarantee that the evidence for the change that originated that positive datapoint is contained in one sentence (as opposed to, e.g., split across several sentences using anaphoric references).

For curating N𝑁Nitalic_N (i.e., no change), we simply sample the first and last snapshot Rt1subscript𝑅subscript𝑡1R_{t_{1}}italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Rt2subscript𝑅subscript𝑡2R_{t_{2}}italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT of a period in the timeline of a Wikipedia page that was highly edited. We then filter out those pairs where the cosine similarity of their definition sentences, S(Rt1def,Rt2def)𝑆superscriptsubscript𝑅subscript𝑡1defsuperscriptsubscript𝑅subscript𝑡2defS(R_{t_{1}}^{\text{def}},R_{t_{2}}^{\text{def}})italic_S ( italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT def end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT def end_POSTSUPERSCRIPT ), is sufficiently high, but not 1 (again, empirically, between 0.8 and 1). After manual inspection of a sample, we found that this step ensures that the negative examples represent pairs of Wikipedia pages that are different, but with no significant change. Statistics on CHEW are provided in Table 2, for several splits: Random, where we randomly separate all pairs between predefined train/validation/test splits; No overlap between entities (NoOv); Time-forward (TFwd) and Time-reversed (TRvsd), the latter two for no temporal overlap across splits (see Fig. Figure 1).

Refer to caption
(a) Time Forward Split (TFwd).
Refer to caption
(b) Time Reversed Split (TRvsd).
Figure 1: Barplots with the time forward and time forward data splits.
Data split Set No ch. Change Total
Random Train 1,632 2,649 4,281
Val 470 141 611
Test 942 282 1,224
NoOv Train 2,106 2,193 4,299
Val 307 301 608
Test 631 578 1,209
TFwd Train 2,828 2,076 4,904
Val 90 341 431
Test 126 655 781
TRvsd Train 1,498 2,283 3,781
Val 1,407 299 1,706
Test 139 490 629
Table 2: CHEW statistics for the four splits we introduce.

3 Experiments

Prompting for timeline knowledge

Given timestamps t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we submit to an LLM LLM𝐿𝐿𝑀LLMitalic_L italic_L italic_M a prompt p𝑝pitalic_p with a tuple (i,wt1,t2)𝑖subscript𝑤𝑡1subscript𝑡2(i,w_{t1},t_{2})( italic_i , italic_w start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where i𝑖iitalic_i is the instruction (further details can be found in Appendix A), wt1subscript𝑤𝑡1w_{t1}italic_w start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT is the revision of w𝑤witalic_w at timestamp t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where t1<t2subscript𝑡1subscript𝑡2t_{1}<t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We then estimate the accuracy of the LLMs’ response LLM(i,wt1,t2)𝐿𝐿𝑀𝑖subscript𝑤𝑡1subscript𝑡2LLM(i,w_{t1},t_{2})italic_L italic_L italic_M ( italic_i , italic_w start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) by retrieving the maximum SBERT text similarity S𝑆Sitalic_S between the response and the content in wt2subscript𝑤𝑡2w_{t2}italic_w start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT as follows:

S=max{SBERT(r,wt2)rLLM(i,wt1,t2)}𝑆conditionalSBERT𝑟subscript𝑤𝑡2𝑟𝐿𝐿𝑀𝑖subscript𝑤𝑡1subscript𝑡2S=\max\{\text{SBERT}(r,w_{t2})\mid r\in LLM(i,w_{t1},t_{2})\}italic_S = roman_max { SBERT ( italic_r , italic_w start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT ) ∣ italic_r ∈ italic_L italic_L italic_M ( italic_i , italic_w start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) }

where SBERT(r,wt2)SBERT𝑟subscript𝑤𝑡2\text{SBERT}(r,w_{t2})SBERT ( italic_r , italic_w start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT ) is the cosine similarity between the response r𝑟ritalic_r and the content in wt2subscript𝑤𝑡2w_{t2}italic_w start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT. The higher the similarity score, the more accurate the LLM’s response was. For this and subsequent experiments, we consider the following popular open source models: LLama2-7B and LLama2-13B Touvron et al. (2023b), LLama3-8B Touvron et al. (2023a) and Mistral-7B Jiang et al. (2023). After collecting the responses and performing some ad-hoc cleanup, we visualize the similarities obtained for all models in Figure 2. The most capable model in this setup is LLam2-13B, whereas Mistral-7B seems to be struggling the most to generate appropriate responses. An interesting observation is that Llama3-8b and Llama2-13b are actually capable to reproduce verbatim the content of w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as we can see from the large number of comparisons with cosine similarity being 1. This is a somewhat surprising fact, and suggests that these models are in principle capable to generate a reasonable portion of timestamped Wikipedia articles.

Refer to caption
Figure 2: Similarities comparing ground truth and generations after probing for temporal knowledge.
CHEW-Random CHEW-NoOv CHEW-TFwd CHEW-TRvsd
P R F1 P R F1 P R F1 P R F1
n.i. llama2-7b 0.82 0.82 0.82 0.65 0.62 0.61 0.53 0.54 0.53 0.61 0.51 0.55
llama2-13b 0.81 0.81 0.81 0.59 0.55 0.48 0.50 0.49 0.49 0.41 0.50 0.45
llama3-8b 0.87 0.88 0.87 0.66 0.66 0.66 0.55 0.57 0.56 0.57 0.50 0.53
mistral-7b 0.66 0.65 0.65 0.59 0.59 0.59 0.52 0.52 0.52 0.63 0.57 0.59
i. llama2-7b 0.81 0.81 0.81 0.77 0.77 0.78 0.52 0.53 0.52 0.55 0.51 0.53
llama2-13b 0.93 0.93 0.93 0.77 0.77 0.78 0.46 0.46 0.46 0.64 0.52 0.57
llama3-8b 0.87 0.87 0.87 0.78 0.79 0.78 0.56 0.58 0.57 0.42 0.50 0.45
mistral-7b 0.93 0.93 0.93 0.69 0.69 0.69 0.61 0.70 0.65 0.75 0.64 0.69
Table 3: SFT fine-tuning results on the different CHEW splits, with and without temporal system prompt instruction. Best F1 scores per split are highglighted in bold, best results within the specific instruction setting are underlined.

Prompt-based change detection

We now explore a complementary dimension of this probing experiment: binary text classification. Here, we are still in the non parametric space, and we simply prompt LLMs for a binary label, given the pair (w1,w2)subscript𝑤1subscript𝑤2\left(w_{1},w_{2}\right)( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Figure 3 shows the results of this experiment. We report results on the test sets of all splits (simply to have a point of comparison with the supervised approaches we will discuss later), and just like in the previous experiment, we report results of “bare” prompting (pzerosubscript𝑝zerop_{\text{zero}}italic_p start_POSTSUBSCRIPT zero end_POSTSUBSCRIPT) as well as with in-context examples (pfewsubscript𝑝fewp_{\text{few}}italic_p start_POSTSUBSCRIPT few end_POSTSUBSCRIPT) (see Appendix B for details of these prompts and the examples provided). Our results yield two immediate observations. First, all models struggle significantly more when prompted with newer entities, and on the other hand, Llama-2 seems to struggle more than the others across the board in all splits. Interestingly, while struggling to accurately generate new information, Mistral is in fact often the best model in this classification task, which motivates us to explore it further in a downstream task (cf. Section 4).

Fine-tuning experiments

We proceed to fine-tune the considered models using standard LoRa Hu et al. (2021) and SFT Ouyang et al. (2022) techniques on the training sets (Appendix C for implementation details). The results provided in Table 3 yield a number of interesting insights. Note that we study two different system prompts: one where we simply refer to the task as a binary classification problem (n.i), and one where the models are given more context about the task (i.). Mistral-7b, which was already showing signs of being capable of handling this task, benefits substantially from the fine-tuning strategy, outperforming the Llama models in three out of four splits. Moreover, we find that Llama3-8b is only the best of the Llama models in 2 out of 4 settings, which highlights the potential of the older Llama-2 models to enhance their temporal knowledge via fine-tuning.

Refer to caption
Figure 3: Prompt-based classification change prediction results.

4 Better temporal embeddings with CHEW

Enhancing an LLM’s temporal capabilities is one of the long-standing goals of creating temporal datasets. We find that TempoWiC Loureiro et al. (2022b) is a suitable benchmark to test whether this is the case. This is a binary classification problem where given a pair of timestamped tweets and a target word, models must determine whether the meaning of the target word has changed in context to the tweets. This idea closely aligns with our proposed approach of detecting changes, and so we test whether LLMs pre-trained on CHEW improve over base models. The upper bound for the test set in this dataset is around 77% macro-F1 Lyu et al. (2022), and 50% for a random baseline.

Specifically, we generate embeddings with both base and CHEW-finetuned LLMs, and use them in a number of ways: as features for a logistic regression classifier (contextualized embeddings concatenated and averaged), as well as deriving cosine similarities, which are then themselves the features for the logistic regression classifier (following Loureiro et al. (2022b)’s official baselines). We consider different layers (last, last 4, or all layers), and we select Mistral-7b base and CHEW-finetuned as our target model, since it showed to benefit the most from the finetuning step. The results clearly indicate an improvement on the quality of the embeddings after CHEW-finetuning, making this model on par with encoder-only baselines based on RoBERTa, which are known to be much better than decoder-only models for text representations Tang et al. (2019); BehnamGhader et al. (2024).

Training data Experiment Acc P R F1
CHEW+TW Similarity-last 0.68 0.69 0.68 0.68
Similarity-last4 0.67 0.7 0.67 0.67
Similarity-all 0.63 0.41 0.63 0.5
Concat 0.6 0.4 0.63 0.49
Averaged 0.6 0.5 0.6 0.54
TW Similarity-last 0.67 0.58 0.66 0.61
Similarity-last4 0.68 0.55 0.65 0.59
Similarity-all 0.58 0.5 0.57 0.5
Concat 0.56 0.49 0.56 0.49
Averaged 0.48 0.5 0.49 0.49
Table 4: TempoWiC Results.

5 Conclusions and Future Work

We have introduced CHEW, a dataset of Changing Events in Wikipedia, with which we hope to contribute to research in continual learning, temporal alingment, and other areas involving LLMs and their ability (or lack thereof) to make sense of temporal information at various levels, old and new, seen and not seen during pretraining, etc. We showed that aligning LLMs to temporal change makes them surprisingly competitive in the downstream task of word-in-context temporal classification when compared with their base counterparts.

6 Limitations

Our work does not extensively evaluate a wider range of LLMs. We also have made assumptions in the similarity comparisons between Wikipedia revisions, and while our automatic, manual and downstream tests are all consistent, further extending the comparisons between revisions could lead to a more accurate dataset. Moreover, we have only focused on the English Wikipedia, which is a significant limitation, especially for exploring “tail” entities.

7 Ethics statement

We believe that updating LLMs’ knoweldge over time, given their prevalence and how high the interest on Generative AI is today, is critical for deploying accurate and trustworthy AI models. Therefore, it should be noted that flagging critically new content in Wikipedia might conflict with Wikipedia quality standards, as well as raise misinformation concerns (if, for example, false information peppers a Wikipedia page in a way that a change detection model is unable to capture). Further research into ensuring highly accurate scans for change of community resources remains critical today more than ever, again, especially due to how prevalent GenAI tools are today.

References

  • Agarwal and Nenkova (2021) Oshin Agarwal and Ani Nenkova. 2021. Temporal effects on pre-trained models for language processing tasks. arXiv preprint arXiv:2111.12790.
  • BehnamGhader et al. (2024) Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
  • Cheng et al. (2024) Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. 2024. Dated data: Tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958.
  • Dai et al. (2021) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2021. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
  • De Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.
  • Del Tredici et al. (2018) Marco Del Tredici, Raquel Fernández, and Gemma Boleda. 2018. Short-term meaning shift: A distributional exploration. arXiv preprint arXiv:1809.03169.
  • Del Tredici et al. (2019) Marco Del Tredici, Raquel Fernández, and Gemma Boleda. 2019. Short-term meaning shift: A distributional exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2069–2075.
  • Dhingra et al. (2022) Bhuwan Dhingra, Jeremy R Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W Cohen. 2022. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273.
  • Giulianelli et al. (2020) Mario Giulianelli, Marco Del Tredici, and Raquel Fernández. 2020. Analysing lexical semantic change with contextualised word representations. arXiv preprint arXiv:2004.14118.
  • Giulianelli et al. (2023) Mario Giulianelli, Iris Luden, Raquel Fernandez, and Andrey Kutuzov. 2023. Interpretable word sense representations via definition generation: The case of semantic change analysis. In The 61st Annual Meeting Of The Association For Computational Linguistics.
  • Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
  • Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR).
  • Jang et al. (2022) Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, and Minjoon Seo. 2022. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. arXiv preprint arXiv:2204.14211.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Kasai et al. (2024) Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Ye** Choi, Kentaro Inui, et al. 2024. Realtime qa: What’s the answer right now? Advances in Neural Information Processing Systems, 36.
  • Lazaridou et al. (2021) Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, et al. 2021. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34.
  • Liska et al. (2022) Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, D’Autume Cyprien De Masson, Tim Scholtes, Manzil Zaheer, Susannah Young, et al. 2022. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, pages 13604–13622. PMLR.
  • Loureiro et al. (2022a) Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. 2022a. Timelms: Diachronic language models from twitter. arXiv preprint arXiv:2202.03829.
  • Loureiro et al. (2022b) Daniel Loureiro, Aminette D’Souza, Areej Nasser Muhajab, Isabella A White, Gabriel Wong, Luis Espinosa Anke, Leonardo Neves, Francesco Barbieri, and Jose Camacho-Collados. 2022b. Tempowic: An evaluation benchmark for detecting meaning shift in social media. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3353–3359.
  • Luden et al. (2024) Iris Luden, Mario Giulianelli, and Raquel Fernández. 2024. Beyond perplexity: Examining temporal generalization in large language models via definition generation. Computational Linguistics in the Netherlands Journal, 13:205–232.
  • Luu et al. (2021) Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, and Noah A Smith. 2021. Time waits for no one! analysis and challenges of temporal misalignment. arXiv preprint arXiv:2111.07408.
  • Lyu et al. (2022) Chenyang Lyu, Yongxin Zhou, and Tianbo Ji. 2022. Mllabs-lig at tempowic 2022: A generative approach for examining temporal meaning shift. In Proceedings of the The First Workshop on Ever Evolving NLP (EvoNLP), pages 1–6.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  • Piscopo et al. (2017) Alessandro Piscopo, Pavlos Vougiouklis, Lucie-Aimée Kaffee, Christopher Phethean, Jonathon Hare, and Elena Simperl. 2017. What do wikidata and wikipedia have in common? an analysis of their use of external references. In Proceedings of the 13th International Symposium on Open Collaboration, pages 1–10.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  • Rosin and Radinsky (2022) Guy D Rosin and Kira Radinsky. 2022. Temporal attention for language models. arXiv preprint arXiv:2202.02093.
  • Schlechtweg et al. (2020) Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. 2020. Semeval-2020 task 1: Unsupervised lexical semantic change detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation. International Committee for Computational Linguistics.
  • Tang et al. (2019) Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2019. Understanding neural machine translation by simplification: The case of encoder-free models. In Recent Advances in Natural Language Processing.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Wallat et al. (2024) Jonas Wallat, Adam Jatowt, and Avishek Anand. 2024. Temporal blind spots in large language models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 683–692.
  • Yu et al. (2021) Wenhao Yu, Chenguang Zhu, Yuwei Fang, Donghan Yu, Shuohang Wang, Yichong Xu, Michael Zeng, and Meng Jiang. 2021. Dict-bert: Enhancing language model pre-training with dictionary. arXiv preprint arXiv:2110.06490.
  • Zhao et al. (2024) Bowen Zhao, Zander Brumbaugh, Yizhong Wang, Hannaneh Hajishirzi, and Noah A Smith. 2024. Set the clock: Temporal alignment of pretrained language models. arXiv preprint arXiv:2402.16797.
  • Zhu et al. (2020) Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363.

Appendix A Prompt for Generation of changes

The instruction prompt that we use to generate the change from a LLM is given below:

You are a helpful knowledge management expert, and you excel at identifying critical and fundamental changes in Wikipedia entities. You will be given an Wikipedia entity: {ENTITYNAME} with text in timestamp1: {TIMESTAMP1}, and another timestamp timestamp2:{TIMESTAMP2} where there has been some critical change(s) to the entity in between {TIMESTAMP1} and {TIMESTAMP2}. Your task is to identify the changes that happened between {TIMESTAMP1} and {TIMESTAMP2} for the {ENTITYNAME}. Only give me the most critical information. The input contains the following: entity: The name of the entity timestamp1: The timestamp of the text, which is provided text1: The article text of the entity in timestamp1 timestamp2: The timestamp for which you have to identify the changes Your output must be in the format of a JSON with the {ENTITYNAME} as its key and python list of changes as value, formatted as follows: OUTPUT:
{
    ENTITYNAME: python list of changes

}
If nothing really meaningful happened to {ENTITYNAME} during that timespan, return an empty JSON. You must return only the JSON as the output. Do not return anything else except the JSON.

Appendix B Prompt-based change detection experiment

The instruction prompt that we use to generate the Change Label from a LLM is as follows.

You are a helpful knowledge management expert and a binary classifier, and you excel at identifying critical and fundamental changes in Wikipedia entities. You will be given an Wikipedia entity: {ENTITYNAME} with text in two different versions or in two different timestamps where there has been some critical change(s) to the entity in between the texts. Your task is to identify the changes that happened between two texts of the {ENTITYNAME} and provide a label.
The input contains the following:
entity: The name of the entity text1: The first revision text of the article in the timestamp given with the text. text2: The second revision text of the article in the timestamp given with the text. Both text1 and text2 will be formatted as follows:
<t> entity name </t> <y> timestamp </y> article text.
Your output must be in the format of a JSON with two keys label, formatted as follows:
OUTPUT: 0 or 1 Where Label: 0 if there is no critical change in the entity between the texts Label: 1 if there is a critical change. A syntactic or grammatical change is not a critical change, only information change/update is considered a critical change. You must return only the label, nothing else except the label(0 or 1).

The in-context examples along with this prompt used for Few-shot learning are as follows.

1. Positive Example:

INPUT: { ’entity’: ’Blake Harrison’,

’text1’: ’<t> Blake Harrison </t> <y>2018-01-13T23:28:45Z </y> Blake Harrison (born 1985) is an English actor, best known for playing Neil Sutherland in the BAFTA-winning E4 comedy The Inbetweeners. Blake starred in three series and two subsequent films of the multi-award-winning comedy The Inbetweeners. Harrison’s other television work includes the BBC Three comedies Way to Go and Him & Her, Comedy Central’s Big Bad World, The Bleak Old Shop of Stuff, and The Bill. Harrison also starred in all three seasons of The Increasingly Poor Decisions of Todd Margaret, created by David Cross.’,
’text2’: ’<t> Blake Harrison </t> <y>2019-12-19T15:57:21Z </y> Blake Harrison (born 22 July 1985) is an English actor and dancer. Harrison starred in three series and two subsequent films of the multi-award-winning comedy The Inbetweeners. Harrison’s other television work includes the BBC Three comedies Way to Go and Him & Her, Comedy Central’s Big Bad World, The Bleak Old Shop of Stuff, and The Bill. Harrison also starred in all three seasons of The Increasingly Poor Decisions of Todd Margaret, created by David Cross. ’
}
OUTPUT:
1

2. Negative Example:

INPUT: { ’entity’: ’Miss Virginia USA’,

’text1’: ’<t> Miss Virginia USA </t> <y>2019-01-23T03:45:45Z </y> The Miss Virginia USA competition is the pageant that selects the representative for the state of Virginia in the Miss USA pageant. Virginia has been only moderately successful in terms of number of semi-finalists. They have had two Miss USAs. They are one of only four states to have had two Miss USAs in succession (the others being Illinois, Texas, and District of Columbia).’,
’text2’: <t> Miss Virginia USA </t> <y>2020-12-17T23:48:27Z </y> The Miss Virginia USA competition is the pageant that selects the representative for the state of Virginia in the Miss USA pageant. Virginia has been only moderately successful in terms of number of semi-finalists. They have had two Miss USAs. They are one of only four states to have had two Miss USAs in succession (the others being Illinois, Texas, and District of Columbia). ’
}
OUTPUT:
0

Appendix C Models and training details

We use the chat/instruct version of the models from Huggingface in our experiment, fine-tuning them using LoRAHu et al. (2021). The model is loaded in 4-bit and for the task ‘SEQ_CLS’(Sequence Classification). We train the models using one NVIDIA A100 GPU and inference using one NVIDIA RTX4090, taking approximately 2 hours per epoch. The training details for the models are listed below.

  • Learning Rate: 2e-6

  • Optimizer: paged_adamw_8bit

  • Batch size (train/eval): 1

The list of models and their huggingface repository names are listed in Table 5.

Model Name Huggingface Repository
llama2-7b meta-llama/Llama-2-7b-chat-hf
llama2-13b meta-llama/Llama-2-13b-chat-hf
llama3-8b meta-llama/Meta-Llama-3-8B-Instruct
mistral-7b mistralai/Mistral-7B-Instruct-v0.3
Table 5: List of Models used in our experiments and their huggingface repositories.