Latent Logic Tree Extraction for Event Sequence Explanation from LLMs

Zitao Song    Chao Yang    Chaojie Wang    Bo An    Shuang Li
Abstract

Modern high-stakes systems, such as healthcare or robotics, often generate vast streaming event sequences. Our goal is to design an efficient, plug-and-play tool to elicit logic tree-based explanations from Large Language Models (LLMs) to provide customized insights into each observed event sequence. Built on the temporal point process model for events, our method employs the likelihood function as a score to evaluate generated logic trees. We propose an amortized Expectation-Maximization (EM) learning framework and treat the logic tree as latent variables. In the E-step, we evaluate the posterior distribution over the latent logic trees using an LLM prior and the likelihood of the observed event sequences. LLM provides a high-quality prior for the latent logic trees, however, since the posterior is built over a discrete combinatorial space, we cannot get the closed-form solution. We propose to generate logic tree samples from the posterior using a learnable GFlowNet, which is a diversity-seeking generator for structured discrete variables. The M-step employs the generated logic rules to approximate marginalization over the posterior, facilitating the learning of model parameters and refining the tunable LLM prior parameters. In the online setting, our locally built, lightweight model will iteratively extract the most relevant rules from LLMs for each sequence using only a few iterations. Empirical demonstrations showcase the promising performance and adaptability of our framework.

Machine Learning, ICML

1 Introduction

Modern systems, such as healthcare, finance, and social media, produce voluminous data that are represented as discrete events with irregular timestamps. Generating concise and human-readable knowledge to explain this intricate event data is of great scientific and practical value. The distilled knowledge can be generalized to other contexts (Ullman et al., 2012; Campero et al., 2018).

For example, in healthcare, electronic health records (EHRs) are often represented as discrete event sequences, containing fine-grained time and type information on doctors’ treatments, patients’ measurements, and symptoms. It is desirable to generate concise medical knowledge such as the disease phenotypes and therapies, to shed light on these messy events. This will facilitate a deeper understanding of each patient’s unique health journey and medical decisions, ultimately leading to more effective and individualized care.

However, the heterogeneity observed in each patient’s data poses a challenge – each event sequence may exhibit diverse natures of medical histories, treatments, and conditions (Henrich & McElreath, 2003; Laland, 2004). Generating the most relevant and accurate knowledge to explain such heterogeneous data requires sophisticated methods.

Refer to caption
Figure 1: GPT can help last event type prediction. We found replacing the semantic meaningful event names by numerical event ids in event history degrades the performance of event prediction.

Recently, Large Language Models (LLMs) have demonstrated promising human-like reasoning abilities as few-shot learners (Brown et al., 2020). When prompted with step-wise explanations of reasoning, these models excel in logical reasoning (Pan et al., 2023), abstract pattern induction (Webb et al., 2023), and social learning (Leng & Yuan, 2023). Despite their success in text-based reasoning tasks, LLMs still face challenges in extending their reasoning capabilities to tabular data (Hegselmann et al., 2023) and discrete event sequences (Shi et al., 2023).

In this paper, we propose to leverage LLMs, trained on general-domain data, as a prior to generate human-readable knowledge. Specifically, we will encourage LLMs to generate logic trees, from their prior distribution p()𝑝p(\mathcal{R})italic_p ( caligraphic_R ). Given the observed discrete event data 𝐗𝐗\mathbf{X}bold_X, the belief on the logic trees will be updated according to Bayes rule, i.e., p(|𝐗)p()p(𝐗|)proportional-to𝑝conditional𝐗𝑝𝑝conditional𝐗p(\mathcal{R}|\mathbf{X})\propto p(\mathcal{R})p(\mathbf{X}|\mathcal{R})italic_p ( caligraphic_R | bold_X ) ∝ italic_p ( caligraphic_R ) italic_p ( bold_X | caligraphic_R ) (Leng & Yuan, 2023; Acemoglu et al., 2011), where we will use a temporal logic point process (TL-PP) (Li et al., 2020) to model p(𝐗|)𝑝conditional𝐗p(\mathbf{X}|\mathcal{R})italic_p ( bold_X | caligraphic_R ). The inference procedure can be regarded as performing the reweighting of each logic tree from LLMs. The goal of our paper is to perform the logic tree inference using LLM prior in a tractable and efficient manner.

Performing inference on logic trees is challenging since the posterior distribution is intractable due to their discrete combinatorial space. Traditional solutions, such as MCMC, approximate intractable posteriors by sampling, yet is struggling with multi-modal distributions (Miao et al., 2019; Zhang et al., 2020; Lew et al., 2023). Reinforcement learning (RL) methods like proximal policy optimization (PPO) (Schulman et al., 2017), treating the sampling process as a policy, may also fail to capture the distribution’s full diversity (Zhu et al., 2023). The problem becomes worse when the target distribution is incorrectly specified, leading to an overoptimized policy. In our paper, we will address the inference challenge using the GFlowNet (Bengio et al., 2023), a recently proposed sound diversity-seeking generative model for structured discrete variables. As a deep RL algorithm adept at managing unnormalized rewards, GFlowNet has shown effectiveness in fine-tuning Large Language Models with intractable thought posteriors (Hu et al., 2023a). We wish to extend its success to Tree-of-Thoughts (ToT) reasoning (Yao et al., 2023), such as generating logic trees to explain event sequences.

Our overall learning follows an amortized EM algorithm, where we treat the logic tree as latent variables. In the E-step, we train a GFlowNet to generate logic tree samples from their posterior distribution. The GFlowNet model parameters are shared across all training event sequences, which is the reason why we term it amortized EM. In the M-step, we use the generated logic tree samples to approximate marginalization over the posterior. This process provides an objective function for learning the TL-PP model parameters and refining some tunable LLM parameters (assuming the LLM priors are also learnable). The algorithm iterates between the E-step and M-step until convergence. During the testing stage, given a new event sequence, we employ the trained GFlowNet to efficiently perform inference for explanatory logic trees by sampling from the posterior p(|𝐗)p()p(𝐗|)proportional-to𝑝conditional𝐗𝑝𝑝conditional𝐗p(\mathcal{R}|\mathbf{X})\propto p(\mathcal{R})p(\mathbf{X}|\mathcal{R})italic_p ( caligraphic_R | bold_X ) ∝ italic_p ( caligraphic_R ) italic_p ( bold_X | caligraphic_R ), leveraging well-trained priors and models from the training stage. This enables our method to efficiently and adaptively explain previously unseen event sequences. Our main contributions are:
(i) We introduce, LaTee, an amortized EM learning framework that can learn to infer and generate Latent logic Trees to explain observed event sequences, which leverages LLMs as prior;
(ii) In the E-step, we use GFlowNets to fine-tune LLMs and enable diverse logic tree generation, which better tackles the heterogeneity issue exhibited for event sequences;
(iii) Our method generates a relative margin of 20% over SOTA Attentioin-based temporal point process (TPP) models on future event prediction based on real-world behavior datasets. This shows that our interpretable and knowledge-driven TPP model is also flexible.

2 Related Works and Background

2.1 Knowledge Extraction from Event Sequences

Knowledge extraction refers to the process of refining, condensing, or summarizing large volumes of raw data to distill the most relevant and essential information. For noisy event sequences, we will represent our knowledge as a collection of symbolic logic trees, which is a hierarchical and structured representation of logical relationships among different elements or propositions (Campero et al., 2018). Our logic tree extraction from events is related to symbolic rule induction and semantic cognition.
Symbolic Rule Induction. Symbolic rule induction refers to the process of automatically discovering logical rules from observed data. Classic symbolic inductive logic programming (ILP) methods (Quinlan, 1990; Cropper & Tourret, 2020) mostly adopt discrete search in the space of logic programs and do very well at generalizing from just a few examples. Neuro-symbolic rule inductions (Evans & Grefenstette, 2018; Yang et al., 2017; Rocktäschel & Riedel, 2017; Campero et al., 2018) are differentiable ILP methods and are generally more robust to noisy input. In our approach, we take inspiration from a differentiable backward chaining algorithm (Rocktäschel & Riedel, 2017) and represent a symbolic logic tree starting from the target predicates. For instance, consider a Put-into task as our target predicate, in which we need to replace element X𝑋Xitalic_X from box Y1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, room Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT into box Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, room Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We can represent this actionable strategy as a set of ordering logic rules as:

Put-into(X,Y)Put-into𝑋𝑌\displaystyle\textit{Put-into}(X,Y)Put-into ( italic_X , italic_Y ) Open(Y)Pick-up(X),absentOpen𝑌Pick-up𝑋\displaystyle\leftarrow\textit{Open}(Y)\wedge\textit{Pick-up}(X),← Open ( italic_Y ) ∧ Pick-up ( italic_X ) , (1)
Pick-up(X)Pick-up𝑋\displaystyle\textit{Pick-up}(X)Pick-up ( italic_X ) Open(Y)absentOpen𝑌\displaystyle\leftarrow\textit{Open}(Y)← Open ( italic_Y ) (2)
Open(Y)Open𝑌\displaystyle\textit{Open}(Y)Open ( italic_Y ) Move-to(Z)absentMove-to𝑍\displaystyle\leftarrow\textit{Move-to}(Z)← Move-to ( italic_Z ) (3)

which is a logic tree, with Put-into(X,Y)Put-into𝑋𝑌\textit{Put-into}(X,Y)Put-into ( italic_X , italic_Y ) being the root and other predicates being its children. Many classic or differentiable ILP methods can automatically learn such rules from data, however, they require carefully hand-crafted rule templates for each ILP task in order to constrain and reduce the search space effectively (Glanois et al., 2022).
Semantic Cognition. Semantic cognition refers to the development of systems that can comprehend and manipulate meaning in a manner similar to human cognitive processes. It explores how knowledge is organized, represented, and utilized to derive semantic understanding from various forms of data. Previous research has described it as a process similar to reducing logical dimensions (Katz et al., 2008; Ullman et al., 2012) through employing probabilistic generative models. These models are capable of learning both logical rules and fundamental relationships that explain the data observed. Similar to ILP methods, they can perform deductive reasoning using logical rules. However, unlike traditional ILP methods, these models can also induce facts. While these approaches showed potential, they faced significant issues with scalability. The recent advancements in social learning in LLMs (Leng & Yuan, 2023) suggest that it might be beneficial to reexamine these concepts.

2.2 Knowledge-Driven Probabilistic Models for Event Sequences

Temporal point process (TPP) provides an elegant probabilistic model for continuous-time event sequences, which is characterized by an intensity function. The intensity function represents the occurrence rate of events, which is usually modeled as parametric, nonparametric, or deep neural network forms. Traditional parametric TPP models like the Hawkes process offer interpretability, but their simplicity limits flexibility. On the other hand, neural-based models, such as RMTPP (Du et al., 2016) and Transformer Hawkes (Zuo et al., 2020), provide expressiveness but are often criticized for their black-box nature and hinder their applications in high-stakes scenarios. In this paper, we aim to generate logic trees from a fine-tuned LLM to inform the functional form of the intensity, which strikes a balance between model flexibility and interpretability. The modeling idea takes inspiration from TL-PP (Li et al., 2020) , as detailed below.
Rule-informed Event Sequences. We will build a rule-informed conditional intensity function for the event sequences as:

λ(t;w,,𝐗t):=exp{fwfϕf(𝐗t)+b(t)},assign𝜆𝑡𝑤subscript𝐗𝑡expsubscript𝑓subscript𝑤𝑓subscriptitalic-ϕ𝑓subscript𝐗𝑡𝑏𝑡\displaystyle\lambda(t;w,\mathcal{R},\mathbf{X}_{t}):=\text{exp}\Big{\{}\sum_{% f\in\mathcal{R}}w_{f}\phi_{f}(\mathbf{X}_{t})+b(t)\Big{\}},italic_λ ( italic_t ; italic_w , caligraphic_R , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := exp { ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_R end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_b ( italic_t ) } , (4)

where f𝑓fitalic_f is a valid path from the symbolic logic tree \mathcal{R}caligraphic_R, ϕf(𝐗t)subscriptitalic-ϕ𝑓subscript𝐗𝑡\phi_{f}(\mathbf{X}_{t})italic_ϕ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the logic-informed feature derived from the number of ordered event combinations in event history 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfying the path f𝑓fitalic_f (with more details can be found in  (Li et al., 2020)), and wfsubscript𝑤𝑓w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the weight corresponding to rule f𝑓fitalic_f. Given this probabilistic model, we can use the negative log-likelihood of the temporal point process as a loss function to jointly learn weights w𝑤witalic_w and symbolic structure \mathcal{R}caligraphic_R. Given a event sequence 𝐗={(ti,ei)}i=1L𝐗superscriptsubscriptsubscript𝑡𝑖subscript𝑒𝑖𝑖1𝐿\mathbf{X}=\{(t_{i},e_{i})\}_{i=1}^{L}bold_X = { ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT observed over an interval [0,T]0𝑇[0,T][ 0 , italic_T ], the negative log-likelihood of 𝐗𝐗\mathbf{X}bold_X is expressed as:

w,(𝐗)=subscript𝑤𝐗absent\displaystyle\mathcal{L}_{w,\mathcal{R}}(\mathbf{X})=caligraphic_L start_POSTSUBSCRIPT italic_w , caligraphic_R end_POSTSUBSCRIPT ( bold_X ) = (5)
j=1Llogλ(tj;w,,𝐗j)+0Tλ(t;w,,𝐗t)𝑑t.superscriptsubscript𝑗1𝐿𝜆subscript𝑡𝑗𝑤subscript𝐗𝑗superscriptsubscript0𝑇𝜆𝑡𝑤subscript𝐗𝑡differential-d𝑡\displaystyle-\sum_{j=1}^{L}\log\lambda(t_{j};w,\mathcal{R},\mathbf{X}_{j})+% \int_{0}^{T}\lambda(t;w,\mathcal{R},\mathbf{X}_{t})dt.- ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_λ ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_w , caligraphic_R , bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ ( italic_t ; italic_w , caligraphic_R , bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_t .

where each tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the event trigger time and 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT refers to the event sequences up to tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Nevertheless, this learning problem is challenging because it requires learning the parameters w𝑤witalic_w in a continuous space as well as the symbolic structure \mathcal{R}caligraphic_R in a discrete space.

Refer to caption
Figure 2: The Architecture of Proposed Framework. The presented event history represents a typical human trajectory: beginning with relocating to a new place, opening a box, picking up an object, and culminating in placing the retrieved item in a specific location. In the training phase, we first convert this explicit event history into a textual format. Subsequently, we employ a LLM to execute conditional sampling and perform backward reasoning, starting from the predetermined goal (E-Step). The resultant reasoning pathway is then transformed into a symbolic logic tree, which aids in updating the event probabilities (M-Step). In this context, the campfire icon signifies that the model is being updated, while the snowflake icon indicates that the model is in a ‘frozen’ state. We use the thickness of a path in the symbolic logic tree to represent its posterior probability.

2.3 Human-Like Reasoning in LLMs

Recent developments in LLMs, such as GPT-4 (Achiam et al., 2023) and LlaMA 2 (Touvron et al., 2023), have extended the capabilities of AI beyond conventional predictive analytics to simulate sophisticated human-like interactions in various systems (Gao et al., 2023). In-context learning (ICL) within LLMs is a notable feature where the model performs tasks based on input-output examples without adjusting any parameters. Importantly, ICL can be understood through a Bayesian inference framework (Xie et al., 2021), where the augmented prompt serves as a semantic prior, guiding latent concepts acquired during pre-training for chain of thought reasoning and subsequent output (Wei et al., 2022; Kojima et al., 2022). Despite the transformative nature of ICL in LLMs, which allows them to adapt to new tasks without explicit retraining, challenges remain in explaining extrapolation to unseen tasks and understanding the impact of model architecture and optimization. Conversely, knowledge extraction from locally deployable LLMs, achieved through careful fine-tuning (Schick & Schütze, 2020) using Parameters Efficient Fine Tuning (PEFT) techniques (Hu et al., 2021; Dettmers et al., 2023), also provides valuable insights. We will adopt PEFT ideas in this paper.

2.4 Intractable Bayesian Inference in LLMs

The challenge in inferring the latent logical reasoning path from LLMs stems from the intractability of the posterior. Given question-answer pair (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), the posterior of the latent chain-of-thought pLM(Z|X,Y)=pLM(X,Z,Y)ZpLM(X,Z,Y)subscript𝑝𝐿𝑀conditional𝑍𝑋𝑌subscript𝑝𝐿𝑀𝑋𝑍𝑌subscriptsuperscript𝑍subscript𝑝𝐿𝑀𝑋superscript𝑍𝑌p_{LM}(Z|X,Y)=\frac{p_{LM}(X,Z,Y)}{\sum_{Z^{\prime}}p_{LM}(X,Z^{\prime},Y)}italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Z | italic_X , italic_Y ) = divide start_ARG italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_X , italic_Z , italic_Y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_X , italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Y ) end_ARG is intractable due to the discrete combinatorial space for thoughts Z𝑍Zitalic_Z  (Hu et al., 2023a). Existing approaches to address this intractable inference problem in language models often resort to tokenwise approximations using techniques like tempered and contrastive sampling (Malkin et al., 2021; Li et al., 2022), along with problem-specific strategies like beam search and local search techniques (Lu et al., 2021; Sha, 2020). In our paper, we will use GFlowNets to guide posterior sampling of logic trees and fine-tune LLMs.

GFlowNets as Posterior Samplers in LLMs. GFlowNets (Bengio et al., 2021, 2023) are originally introduced as a diversity-seeking probabilistic reinforcement learning algorithm for molecular discovery. A recent work (Hu et al., 2023a) connects GFlowNets to Chain-of-Thoughts (CoT) generation, by leveraging the amortized inference ability of GFlowNets. In their work, given an unnormalized density (reward) R:𝒵>0:𝑅𝒵subscriptabsent0R:\mathcal{Z}\to\mathbb{R}_{>0}italic_R : caligraphic_Z → blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT, GFlowNets learn policies to sample sequences in a token-level, i.e., 𝐙t=z1z2ztT𝒵subscript𝐙𝑡subscript𝑧1subscript𝑧2subscript𝑧𝑡T𝒵\mathbf{Z}_{t}=z_{1}z_{2}\cdots z_{t}\texttt{T}\in\mathcal{Z}bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT T ∈ caligraphic_Z (where zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a language token and T denotes End of Sentence token), as if they were sampling from a target distribution. The goal of GFlowNets is to fine-tune a token-level language generation qGFN(𝐙t|𝐙t1;θ)subscript𝑞𝐺𝐹𝑁conditionalsubscript𝐙𝑡subscript𝐙𝑡1𝜃q_{GFN}(\mathbf{Z}_{t}|\mathbf{Z}_{t-1};\theta)italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_Z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_θ ) initialized by LLM such that marginal qGFN(𝐙t)r(𝐙t)proportional-tosubscript𝑞𝐺𝐹𝑁subscript𝐙𝑡𝑟subscript𝐙𝑡q_{GFN}(\mathbf{Z}_{t})\propto r(\mathbf{Z}_{t})italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∝ italic_r ( bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), i.e., driving the marginal likelihood of generating a complete sequence is proportional to its reward. The learning objective for GFlowNets is defined by the subtrajectory balance (SubTB) objective, equivalent to the path consistency objective (Nachum et al., 2017; Deleu et al., 2024; Tiapkin et al., 2024) in Max-Entropy RL (Haarnoja et al., 2017).

3 Our Proposed LaTee using LLM prior

Instead of eliciting linear thoughts from LLMs, our focus is to extract and reweight symbolic logic trees generated from LLMs to explain the dynamics of the observed event sequences. We hope the obtained logic trees will not only offer personalized explanations for each event sequence but also enable accurate future events prediction.

3.1 LLM-Symbolic Integration by Latent Variables

Given event sequences X𝑋Xitalic_X and next event type Y𝑌Yitalic_Y, where X𝑋Xitalic_X records explicit event sequences with k𝑘kitalic_k events X={(ti,ei)}i=1k𝑋superscriptsubscriptsubscript𝑡𝑖subscript𝑒𝑖𝑖1𝑘X=\{(t_{i},e_{i})\}_{i=1}^{k}italic_X = { ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th event time, eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th event type, and Y=ek+1𝑌subscript𝑒𝑘1Y=e_{k+1}italic_Y = italic_e start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is the next event type after the k𝑘kitalic_k-th event. We are interested in finding a collection of latent symbolic logic trees \mathcal{R}caligraphic_R, which are composed of various event types that trigger subsequent events and best explain the likelihood of the observed event sequence:

p(X,Y)=p(X,Y|)p().𝑝𝑋𝑌subscript𝑝𝑋conditional𝑌𝑝p(X,Y)=\sum_{\mathcal{R}}p(X,Y|\mathcal{R})p(\mathcal{R}).italic_p ( italic_X , italic_Y ) = ∑ start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_X , italic_Y | caligraphic_R ) italic_p ( caligraphic_R ) . (6)

For this mixture latent variable model (LVM), we treat \mathcal{R}caligraphic_R as latent variables; p()𝑝p(\mathcal{R})italic_p ( caligraphic_R ) is the prior distribution for the latent logic tress; and the joint likelihood of the event sequences p(X,Y|)𝑝𝑋conditional𝑌p(X,Y|\mathcal{R})italic_p ( italic_X , italic_Y | caligraphic_R ) can be derived from temporal point process framework, as shown in Eq. (5).

We will employ LLMs as the prior p()𝑝p(\mathcal{R})italic_p ( caligraphic_R ) for logic trees. Additionally, if we aim to leverage the powerful reasoning and generation capabilities of LLMs to predict Y𝑌Yitalic_Y — for instance, in the context of symptom-treatment pairs or question-answer pairs for (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) — it becomes intriguing to explore the recommendations of Y𝑌Yitalic_Y given X𝑋Xitalic_X provided by LLMs. Consequently, we further decompose the mixture LVM (as shown in Eq. (6)) as:

p(X,Y)𝑝𝑋𝑌\displaystyle p(X,Y)italic_p ( italic_X , italic_Y ) =p(X,Y|)pLM(),absentsubscript𝑝𝑋conditional𝑌subscript𝑝𝐿𝑀\displaystyle=\sum_{\mathcal{R}}p(X,Y|\mathcal{R})p_{LM}(\mathcal{R}),= ∑ start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_X , italic_Y | caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R ) , (7)
=pLM(Y|X,)pw(X|)pLM(;ϕ).absentsubscript𝑝𝐿𝑀conditional𝑌𝑋subscriptsubscript𝑝𝑤conditional𝑋subscript𝑝𝐿𝑀italic-ϕ\displaystyle=p_{LM}(Y|X,\mathcal{R})\sum_{\mathcal{R}}p_{w}(X|\mathcal{R})p_{% LM}(\mathcal{R};\phi).= italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ) ∑ start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R ; italic_ϕ ) . (8)

We aim to jointly optimize the event likelihood parameter w𝑤witalic_w and the tunable parameters ϕitalic-ϕ\phiitalic_ϕ in the prior language model. The challenge in learning arises from the latent variables \mathcal{R}caligraphic_R. Fortunately, the EM algorithm provides an effective tool for learning mixture models with latent variables. However, in the E-step, we need to analytically evaluate the current posterior p(|X,Y)pLM(Y|X,)pw(X|)pLM()proportional-to𝑝conditional𝑋𝑌subscript𝑝𝐿𝑀conditional𝑌𝑋subscript𝑝𝑤conditional𝑋subscript𝑝𝐿𝑀p(\mathcal{R}|X,Y)\propto p_{LM}(Y|X,\mathcal{R})p_{w}(X|\mathcal{R})p_{LM}(% \mathcal{R})italic_p ( caligraphic_R | italic_X , italic_Y ) ∝ italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R ), which is intractable due to that the partition function requires the summation over the discrete space of \mathcal{R}caligraphic_R. To tackle this intractability, variational-EM algorithm (Dempster et al., 1977; Beal, 2003; Koller & Friedman, 2009) can be used to approximate the posterior by optimization. We will address this issue by introducing an amortized EM, where in the E-step we learn GFlowNets to sample from p(|X,Y)𝑝conditional𝑋𝑌p(\mathcal{R}|X,Y)italic_p ( caligraphic_R | italic_X , italic_Y ) without the need to calculate the partition function.

3.2 Amortized EM framework for Logic Tree Inference

The derivation of the Evidence Lower Bound (ELBO) for Eq. (7) is presented in Appendix E. It explains the rationale for analytically evaluating the posterior in the E-step to achieve a tight ELBO.

Specifically, in the E-step, we will draw samples from the posterior over the latent symbolic logic tree, denoted as pLM(|X,Y)subscript𝑝𝐿𝑀conditional𝑋𝑌p_{LM}(\mathcal{R}|X,Y)italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R | italic_X , italic_Y ), which comes from an amortized sampler of \mathcal{R}caligraphic_R with an LLM as its policy. In the M-step, we maximize the log-likelihood of the joint probability of the sampled latent variables 𝔼p(|X,Y)[logpLM(Y|X,)pw(X|)pLM()]subscript𝔼similar-to𝑝conditional𝑋𝑌delimited-[]subscript𝑝𝐿𝑀conditional𝑌𝑋subscript𝑝𝑤conditional𝑋subscript𝑝𝐿𝑀\mathbb{E}_{\mathcal{R}\sim p(\mathcal{R}|X,Y)}[\log p_{LM}(Y|X,\mathcal{R})p_% {w}(X|\mathcal{R})p_{LM}(\mathcal{R})]blackboard_E start_POSTSUBSCRIPT caligraphic_R ∼ italic_p ( caligraphic_R | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R ) ] with respect to the parameters of w𝑤witalic_w. This combination of amortized inference (learning to sample the symbolic logic tree from the language model) and supervised learning (optimizing the likelihood model with the ‘supervision’ involving \mathcal{R}caligraphic_R sampled from the amortized posterior) is presented in Fig. 2. We illustrate them in detail in the sections below.

E-Step: Amortized Inference with GFlowNets. For inference in the high-dimension discrete latent space, we leverage the probabilistic framework of GFlowNets (Bengio et al., 2021, 2023). Consider a symbolic logic tree \mathcal{R}caligraphic_R, we start from the root 0:={z0}assignsubscript0subscript𝑧0\mathcal{R}_{0}:=\{z_{0}\}caligraphic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }, in which z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the target predicate (can be composed by multiple language tokens). We follow backward chaining (Rocktäschel & Riedel, 2017) to form a symbolic proof tree in a top-down fashion by prompting LLMs. We grow the logic tree one level deeper at a time based on the previous paths. Concretely, suppose tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented by m𝑚mitalic_m paths, i.e., t:={z0(i)z1(i)zj(i)}i=1massignsubscript𝑡superscriptsubscriptsuperscriptsubscript𝑧0𝑖superscriptsubscript𝑧1𝑖superscriptsubscript𝑧𝑗𝑖𝑖1𝑚\mathcal{R}_{t}:=\{z_{0}^{(i)}z_{1}^{(i)}\cdots z_{j}^{(i)}\}_{i=1}^{m}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋯ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where zj(i)𝒵superscriptsubscript𝑧𝑗𝑖𝒵z_{j}^{(i)}\in\mathcal{Z}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_Z is the j𝑗jitalic_j-th predicate presented in the i𝑖iitalic_i-th path from the predefined predicate space 𝒵𝒵\mathcal{Z}caligraphic_Z. If the maximum number of nodes for each path to expand is constrained to W𝑊Witalic_W, the generative process from a symbolic logic tree tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to t+1subscript𝑡1\mathcal{R}_{t+1}caligraphic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT can be represented as:

logqGFN(t+1|t):=i=1mk=1W+1logqLM(zj+1(i),k|z0(i)zj(i)),assignsubscript𝑞𝐺𝐹𝑁conditionalsubscript𝑡1subscript𝑡superscriptsubscript𝑖1𝑚superscriptsubscript𝑘1𝑊1subscript𝑞𝐿𝑀conditionalsuperscriptsubscript𝑧𝑗1𝑖𝑘superscriptsubscript𝑧0𝑖superscriptsubscript𝑧𝑗𝑖\log q_{GFN}(\mathcal{R}_{t+1}|\mathcal{R}_{t}):=\sum_{i=1}^{m}\sum_{k=1}^{W+1% }\log q_{LM}(z_{j+1}^{(i),k}|z_{0}^{(i)}\cdots z_{j}^{(i)}),roman_log italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W + 1 end_POSTSUPERSCRIPT roman_log italic_q start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) , italic_k end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋯ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , (9)

where qLMsubscript𝑞𝐿𝑀q_{LM}italic_q start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT is the autoregressive sequence generation model and zj+1(i),ksuperscriptsubscript𝑧𝑗1𝑖𝑘z_{j+1}^{(i),k}italic_z start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) , italic_k end_POSTSUPERSCRIPT is the next level predicates chosen from 𝒵W{T}subscript𝒵𝑊T\mathcal{Z}_{W}\cup\{\texttt{T}\}caligraphic_Z start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∪ { T }, 𝒵W𝒵subscript𝒵𝑊𝒵\mathcal{Z}_{W}\subseteq\mathcal{Z}caligraphic_Z start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ⊆ caligraphic_Z, |𝒵W|=Wsubscript𝒵𝑊𝑊|\mathcal{Z}_{W}|=W| caligraphic_Z start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT | = italic_W, and T denotes a stop symbol. The nodes in the symbolic logic tree thus grow in O(Wn)𝑂superscript𝑊𝑛O(W^{n})italic_O ( italic_W start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and will not stop expanding until all the paths reach the termination state 𝚃𝚃\mathtt{T}typewriter_T, i.e.,

logqGFN(T|t):=i=1mlogqLM(T|z0(i)zj(i)).assignsubscript𝑞𝐺𝐹𝑁conditionalTsubscript𝑡superscriptsubscript𝑖1𝑚subscript𝑞𝐿𝑀conditionalTsuperscriptsubscript𝑧0𝑖superscriptsubscript𝑧𝑗𝑖\log q_{GFN}(\texttt{T}|\mathcal{R}_{t}):=\sum_{i=1}^{m}\log q_{LM}(\texttt{T}% |z_{0}^{(i)}\cdots z_{j}^{(i)}).roman_log italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( T | caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log italic_q start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( T | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋯ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) . (10)

The marginal likelihood of sampling a terminal logic tree tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by

qGFN(t𝚃)=subscript𝑞𝐺𝐹𝑁subscript𝑡𝚃absent\displaystyle q_{GFN}(\mathcal{R}_{t}\to\mathtt{T})=italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → typewriter_T ) =
τ=(0t)Πi=1tqGFN(i|i1)qGFN(T|t)𝑑τsubscript𝜏leads-tosubscript0subscript𝑡superscriptsubscriptΠ𝑖1𝑡subscript𝑞𝐺𝐹𝑁conditionalsubscript𝑖subscript𝑖1subscript𝑞𝐺𝐹𝑁conditionalTsubscript𝑡differential-d𝜏\displaystyle\int_{\tau=(\mathcal{R}_{0}\leadsto\mathcal{R}_{t})}\Pi_{i=1}^{t}% q_{GFN}(\mathcal{R}_{i}|\mathcal{R}_{i-1})q_{GFN}(\texttt{T}|\mathcal{R}_{t})d\tau∫ start_POSTSUBSCRIPT italic_τ = ( caligraphic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ↝ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( T | caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_τ

over trajectories τ𝜏\tauitalic_τ starting at 0subscript0\mathcal{R}_{0}caligraphic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ends at tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Notably, the goal of GFlowNet training is to fit the parametric policy qGFN(|;θ)q_{GFN}(\cdot|\cdot;\theta)italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( ⋅ | ⋅ ; italic_θ ) such that its terminating probability qGFN(t𝚃)subscript𝑞𝐺𝐹𝑁subscript𝑡𝚃q_{GFN}(\mathcal{R}_{t}\to\mathtt{T})italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → typewriter_T ) is proportional to a predefined reward r𝑟ritalic_r. In our case, GFlowNet’s reward r𝑟ritalic_r is defined as the posterior of logic trees p(|X,Y)𝑝conditional𝑋𝑌p(\mathcal{R}|X,Y)italic_p ( caligraphic_R | italic_X , italic_Y ), i.e., r(|X,Y)pLM(Y|X,)pw(X|)pLM()proportional-to𝑟conditional𝑋𝑌subscript𝑝𝐿𝑀conditional𝑌𝑋subscript𝑝𝑤conditional𝑋subscript𝑝𝐿𝑀r(\mathcal{R}|X,Y)\propto p_{LM}(Y|X,\mathcal{R})p_{w}(X|\mathcal{R})p_{LM}(% \mathcal{R})italic_r ( caligraphic_R | italic_X , italic_Y ) ∝ italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R ). By construction, GFlowNet’s marginal terminating distribution is proportional to its reward function r(|X,Y)𝑟conditional𝑋𝑌r(\mathcal{R}|X,Y)italic_r ( caligraphic_R | italic_X , italic_Y ), thus we will have the final samples \mathcal{R}caligraphic_R given by the GFlowNet’s policy qGFN(|,θ)q_{GFN}(\cdot|\cdot,\theta)italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( ⋅ | ⋅ , italic_θ ) following the distribution of unnormalized true posterior of p(|X,Y)𝑝conditional𝑋𝑌p(\mathcal{R}|X,Y)italic_p ( caligraphic_R | italic_X , italic_Y ). Here, the given reward function r𝑟ritalic_r can be decomposed as a product of likelihood terms that accumulated over steps of the sampling sequence. In this case, a forward-looking SubTB loss (Madan et al., 2023) for GFlowNet can help local credit assignment (Hu et al., 2023a, b). The SubTB learning objective for trajectory τ=(0,1,,t)𝜏subscript0subscript1subscript𝑡\tau=(\mathcal{R}_{0},\mathcal{R}_{1},\cdots,\mathcal{R}_{t})italic_τ = ( caligraphic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is:

SubTB(θ)=subscript𝑆𝑢𝑏𝑇𝐵𝜃absent\displaystyle\mathcal{L}_{SubTB}(\theta)=caligraphic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT ( italic_θ ) = (11)
0i<jt[logr(iT)Πk=i+1jqθ(k|k1)qθ(T|j)r(jT)qθ(T|i)]2,subscript0𝑖𝑗𝑡superscriptdelimited-[]𝑟superscriptsubscript𝑖TsuperscriptsubscriptΠ𝑘𝑖1𝑗subscript𝑞𝜃conditionalsubscript𝑘subscript𝑘1subscript𝑞𝜃conditionalTsubscript𝑗𝑟superscriptsubscript𝑗Tsubscript𝑞𝜃conditionalTsubscript𝑖2\displaystyle\sum_{0\leq i<j\leq t}\left[\log\frac{r(\mathcal{R}_{i}^{\texttt{% T}})\Pi_{k=i+1}^{j}q_{\theta}(\mathcal{R}_{k}|\mathcal{R}_{k-1})q_{\theta}(% \texttt{T}|\mathcal{R}_{j})}{r(\mathcal{R}_{j}^{\texttt{T}})q_{\theta}(\texttt% {T}|\mathcal{R}_{i})}\right]^{2},∑ start_POSTSUBSCRIPT 0 ≤ italic_i < italic_j ≤ italic_t end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_r ( caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ) roman_Π start_POSTSUBSCRIPT italic_k = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( T | caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_r ( caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ) italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( T | caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where qθsubscript𝑞𝜃q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the conditional GFlowNet policy initialized by a language model pLMsubscript𝑝𝐿𝑀p_{LM}italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT conditioned on prefix X𝑋Xitalic_X and Y𝑌Yitalic_Y. The detailed derivation of the SubTB loss is given in Appendix F. In practice, this loss can be minimized by gradient descent on θ𝜃\thetaitalic_θ sampled either on-policy or off-policy, just as in reinforcement learning. To predict the event type Y𝑌Yitalic_Y for an unseen event sequence X𝑋Xitalic_X, one can draw samples of \mathcal{R}caligraphic_R from qθsubscript𝑞𝜃q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT followed by sampling from pLM(Y|X,)subscript𝑝𝐿𝑀conditional𝑌𝑋p_{LM}(Y|X,\mathcal{R})italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ).

Algorithm 1 Bayesian Logic Tree Learning for Events
  Input: data pool {𝒳,𝒴}𝒳𝒴\{\mathcal{X},\mathcal{Y}\}{ caligraphic_X , caligraphic_Y }, rule weights w𝑤witalic_w, tunable parameters θ𝜃\thetaitalic_θ for LLM as the GFlowNet policy, tunable parameters ϕitalic-ϕ\phiitalic_ϕ for LLM as the prior policy, optimization and exploration hyperparameters, threshold α𝛼\alphaitalic_α
  repeat
     sample batch data pair (X,Y){𝒳,𝒴}similar-to𝑋𝑌𝒳𝒴(X,Y)\sim\{\mathcal{X},\mathcal{Y}\}( italic_X , italic_Y ) ∼ { caligraphic_X , caligraphic_Y }
     sample τqθ(τ|X,Y)similar-to𝜏subscript𝑞𝜃conditional𝜏𝑋𝑌\tau\sim q_{\theta}(\tau|X,Y)italic_τ ∼ italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ | italic_X , italic_Y ); τ=(0,,T)𝜏subscript0subscript𝑇\tau=(\mathcal{R}_{0},\cdots,\mathcal{R}_{T})italic_τ = ( caligraphic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , caligraphic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
     rtpw(X|t)p(Y|X,t)pϕ(t),t=0,,Tformulae-sequencesubscript𝑟𝑡subscript𝑝𝑤conditional𝑋subscript𝑡𝑝conditional𝑌𝑋subscript𝑡subscript𝑝italic-ϕsubscript𝑡𝑡0𝑇r_{t}\leftarrow p_{w}(X|\mathcal{R}_{t})p(Y|X,\mathcal{R}_{t})p_{\phi}(% \mathcal{R}_{t}),t=0,\cdots,Titalic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( italic_Y | italic_X , caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_t = 0 , ⋯ , italic_T
     SubTBsubscript𝑆𝑢𝑏𝑇𝐵absent\mathcal{L}_{SubTB}\leftarrowcaligraphic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT ← SubTB loss in Eq. (11) along τ𝜏\tauitalic_τ with reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     E-step: GD on θ𝜃\thetaitalic_θ with θSubTBsubscript𝜃subscript𝑆𝑢𝑏𝑇𝐵\nabla_{\theta}\mathcal{L}_{SubTB}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT
     if <α𝛼\mathcal{L}<\alphacaligraphic_L < italic_α then
        Sample τqθ(τ|X,Y)similar-to𝜏subscript𝑞𝜃conditional𝜏𝑋𝑌\tau\sim q_{\theta}(\tau|X,Y)italic_τ ∼ italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ | italic_X , italic_Y )
        M-step: GD on w𝑤witalic_w and ϕitalic-ϕ\phiitalic_ϕ with w,ϕllhsubscript𝑤italic-ϕsubscript𝑙𝑙\nabla_{w,\phi}\mathcal{L}_{llh}∇ start_POSTSUBSCRIPT italic_w , italic_ϕ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_l italic_h end_POSTSUBSCRIPT in Eq. (3.2)
     end if
  until some convergence criteria
Refer to caption
Figure 3: Empirical rule distributions sampled from the Language Model fine-tuned by three different approaches. We use 10,000 samples to depict the frequency distribution of a complete logic rules search space with support ||=25002500|\mathcal{R}|=2500| caligraphic_R | = 2500. The x-axis represents these logic rules in a nominal 1-D format, where each point corresponds to a specific rule. The ordering of these points is not indicative of any inherent sequence.

M-Step: Model Parameter Updating. The marginal terminal distribution of GFlowNet is used as a variational approximation to the intractable posterior p(|X,Y)𝑝conditional𝑋𝑌p(\mathcal{R}|X,Y)italic_p ( caligraphic_R | italic_X , italic_Y ) to perform updates to the generative model’s parameters. Thus, for the event demonstrations X𝑋Xitalic_X and next event type Y𝑌Yitalic_Y, we can uncover its underlying symbolic logic tree \mathcal{R}caligraphic_R from the policy of the conditional GFlowNet qGFNsubscript𝑞𝐺𝐹𝑁q_{GFN}italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT and perform in expectation gradient update on the parameters w𝑤witalic_w of event likelihood and tunable parameters ϕitalic-ϕ\phiitalic_ϕ for structure prior learning:

llh(w,ϕ)=𝔼subscript𝑙𝑙𝑤italic-ϕ𝔼\displaystyle\mathcal{L}_{llh}(w,\phi)=\mathbb{E}caligraphic_L start_POSTSUBSCRIPT italic_l italic_l italic_h end_POSTSUBSCRIPT ( italic_w , italic_ϕ ) = blackboard_E [logpw(X|)qGFN(𝚃){}_{\mathcal{R}\sim q_{GFN}(\mathcal{R}\to\mathtt{T})}[\log p_{w}(X|\mathcal{R})start_FLOATSUBSCRIPT caligraphic_R ∼ italic_q start_POSTSUBSCRIPT italic_G italic_F italic_N end_POSTSUBSCRIPT ( caligraphic_R → typewriter_T ) end_FLOATSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R )
+logpLM(Y|X,)+logpϕ()].\displaystyle+\log p_{LM}(Y|X,\mathcal{R})+\log p_{\phi}(\mathcal{R})].+ roman_log italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ) + roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_R ) ] . (12)

It should be noted that the evolving nature of the generative models pwsubscript𝑝𝑤p_{w}italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, pLMsubscript𝑝𝐿𝑀p_{LM}italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT, and pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT during joint optimization leads to a dynamic reward system for the GFlowNets. The training process involves alternating between E-steps and M-steps, with the frequency of GFlowNet updates between successive M-steps being a variable parameter that can be either predetermined or adaptively chosen. Following the approach outlined in (Hu et al., 2023b), adaptive E-steps are implemented through loss thresholding. This method uses a moving average of the GFlowNet’s training loss as a measure of the accuracy in approximating the true posterior. An M-step gradient update is executed following a GFlowNet update only if this moving average drops below a set loss threshold. The overall algorithm is presented in Alg. 1.

3.3 Discussion

Comparison with ILP systems. In our approach, we harness LLMs to generate latent logic trees, replacing traditional symbolic ILP systems. While traditional ILP systems that are based on discrete space search excel in rule learning from minimal examples, they are sensitive to noisy input, and a single error can lead to malfunction. Neural-symbolic rule induction systems, on the other hand, are more robust to noise but will struggle with few-shot learning and may face the risk of overfitting. Symbolic reasoning through LLMs integrates pretrained knowledge, addressing the challenge of learning rules from limited data. Additionally, our approach employs a step-by-step process reflecting human cognitive functions (Wei et al., 2022), enhancing semantic understanding from event sequences.
Comparison with GFlowNets-CoT. We share some similarities with GFlowNets-CoT (Hu et al., 2023a) in the Bayesian inference stage in which LLM is used as a probabilistic generative model that simultaneously generates logical rules and a set of core relations underlying them. However, the distinction between our approach and GFlowNets-CoT is that we extend GFlowNets fine-tuning to a sentence-level symbolic logic tree \mathcal{R}caligraphic_R, which is similar to sentence-level Tree-of-Thoughts (Yao et al., 2023), through directly querying sentence probabilities within a confined ‘sentence space’ for backward learning, as opposed to ToT’s forward-only inference approach. GFlowNets-CoT’s likelihood model relies only on pLM(X,Y,)subscript𝑝𝐿𝑀𝑋𝑌p_{LM}(X,Y,\mathcal{R})italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_X , italic_Y , caligraphic_R ) while ours is built upon a modular likelihood model by decomposing p(X,Y|)𝑝𝑋conditional𝑌p(X,Y|\mathcal{R})italic_p ( italic_X , italic_Y | caligraphic_R ) into event likelihood Lwsubscript𝐿𝑤L_{w}italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT along with the language likelihood pLMsubscript𝑝𝐿𝑀p_{LM}italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT, i.e., pLM()pw(X|)pLM(Y|X,)subscript𝑝𝐿𝑀subscript𝑝𝑤conditional𝑋subscript𝑝𝐿𝑀conditional𝑌𝑋p_{LM}(\mathcal{R})p_{w}(X|\mathcal{R})p_{LM}(Y|X,\mathcal{R})italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X | caligraphic_R ) italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ).

4 Experiments

Table 1: Last event prediction performance on three real-world behavior datasets using both attention-based TPP models and Language Model Opt-1.5B as predictors. ER stands for Error Rate and MR stands for Mean Rank. The performance is averaged over three different seeds and the standard deviation is stored in the parenthesis. The best performance is in bold and also highlighted in gray.

Dataset MIMIC-3 Epic-100 StackOverflow
Method Metrics ER (%) \downarrow MR \downarrow ER (%) \downarrow MR \downarrow ER (%) \downarrow MR \downarrow
AttNHP (Yang et al., 2021) 36.66(13.76) 1.25(0.00) 67.53(0.00) 2.45(0.00) 33.33(0.00) 1.95(0.10)
Pt-AttNHP (Xue et al., 2023b) 77.50(0.00) 1.75(0.00) 68.33(1.44) 2.27(0.16) 71.11(32.71) 3.40(0.81)
k𝑘kitalic_k-shot CoT k𝑘kitalic_k = 0 100.00(0.00) 2.24(0.02) 78.75(1.76) 4.75(0.00) 100.00(0.00) 3.33(0.00)
k𝑘kitalic_k = 1 100.00(0.00) 2.23(0.00) 76.25(1.57) 4.66(0.02) 100.00(0.00) 3.33(0.00)
k𝑘kitalic_k = 3 100.00(0.00) 2.23(0.00) 76.25(0.02) 4.63(0.00) 100.00(0.00) 3.13(0.00)
ToT (depth=3absent3=3= 3, width=3absent3=3= 3) 100.00(0.00) 2.23(0.00) 71.25(1.76) 4.69(0.12) 96.67(4.71) 3.07(0.09)
SFT fine-tuning 82.50(2.50) 2.14(0.03) 75.83(1.44) 4.38(0.11) 93.33(6.67) 4.44(0.30)
PPO fine-tuning 77.50(0.00) 2.55(0.00) 77.50(0.00) 3.99(0.03) 73.33(6.67) 3.29(0.32)
GFN fine-tuning 27.50(8.66) 1.14(0.05) 55.25(9.01) 2.15(0.48) 33.45(5.12) 2.23(0.23)

4.1 Experimental Setup

Datasets and Evaluation Setup. Our study involves one synthetic and three real-world event sequence datasets, containing both semantic and non-semantic information. We view events in these datasets as predicates that can form a symbolic logic tree. For each sequence, we focus on predicting the final event. For the synthetic dataset, we create sequences of event predicates sampled from a prespecified TL-PP using the thinning algorithm (Ogata, 1981). The functional form of the intensity is informed by the predefined logic rules. Regarding the real-world datasets, one is the MIMIC-III (Johnson et al., 2016), an electronic health record dataset from intensive care unit patients. We use various lab measurements and treatment approaches as event predicates. The other is EPIC-KITCHENS-100 (EPIC-100) (Damen et al., 2021), which documents everyday kitchen activities from a first-person perspective over several days, with actions labeled. We analyze these labeled actions in sequence to predict the human’s next action based on their past activities. The final one is StackOverflow (SO) (Leskovec & Krevl, 2014), which records a sequence of reward history with badges from the question-answering website StackOverflow to promote the engagement among its users. Each event in the sequence signifies the receipt of a particular metal. For all the datasets, We consider each sequence as a record pertaining to a single individual and partition each dataset into 80%, 10%, 10% train/dev/test splits by the total population. More details about these datasets can be found in the Appendix G.1.

Metrics. We follow the common next-event prediction task in TPPs (Du et al., 2016; Mei & Eisner, 2017) and emphasize the performance of last event type prediction k𝑘kitalic_k from its history \mathcal{H}caligraphic_H output by the language model. We evaluate the prediction k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG by Error Rate (ER) and Mean Rank (MR) that measures the average rank of the ground-truth type in the list; a smaller MR means a higher rank, and thus a better result.

Base models. In this study, we utilize three distinct sizes of language models from the OPT family (Zhang et al., 2022): Opt-125M (small), Opt-1.5B (medium), and Opt-6.7B (large), as our foundational language model backbones for latent logic tree extraction pLM(|X)subscript𝑝𝐿𝑀conditional𝑋p_{LM}(\mathcal{R}|X)italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( caligraphic_R | italic_X ) in the E-steps. These models are fine-tuned for logic tree learning using the LoRA adaptation layer and further optimized through quantization (Dettmers et al., 2023) to minimize GPU memory consumption during both forward and backward processing stages. We use Zephyr-3B (Tunstall et al., 2023) and Mistral-7B-Instruct (Jiang et al., 2023) as frozen inference models for pLM(Y|X,)subscript𝑝𝐿𝑀conditional𝑌𝑋p_{LM}(Y|X,\mathcal{R})italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X , caligraphic_R ) in the M-steps. Detailed methodologies and specifics regarding the fine-tuning of these Large Language Models (LLMs) can be found in the Appendix G.4.

Competitors. In our study, we categorize competitors into three distinct types. The first category includes prompt-based approaches applied to language models, such as k𝑘kitalic_k-shot Chain-of-Thought (CoT) (Wei et al., 2022) and Tree-of-Thoughts (ToT) (Yao et al., 2023), which are utilized to generate reasoning chains and make prediction of the last event. The second category involves fine-tuning methods for language models, notably supervised fine-tuning (SFT) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) fine-tuning. The final category consists of advanced neural Temporal Point Process (TPP) models specifically designed for event prediction. Within this group, we focus on AttNHP (Yang et al., 2021), an attention-based TPP whose performance is either on par with or superior to the Neural Hawkes Process (NHP) (Mei & Eisner, 2017) and other attention-based models (Xue et al., 2023a). Additionally, we consider PromptTPP (Xue et al., 2023b), a prompting model based on AttNHP (abbreviated as Pt-AttNHP), tailored for processing streaming events with a retrieval memory mechanism.

Table 2: Scalability of the proposed model on two real-world behavior datasets. The tree traversal depth d𝑑ditalic_d and tree expansion width w𝑤witalic_w is fixed to 3333 without further clarification in the Table. We use Opt-1.3B for Epic-100 and Stackoverflow as base model in E-step. Error Rate (%) is used as the evaluation metric for both Epic-100 and StackOverflow. The performance is averaged over three different seeds and the standard deviation is stored in the parenthesis.
Dataset Epic-100 w/ sc SO w/o sc
Method
LaTee (Tree Depth) d𝑑ditalic_d = 2 69.23(9.43) 73.31(23.29)
d𝑑ditalic_d = 3 69.40(9.21) 34.50(5.23)
d𝑑ditalic_d = 4 68.21(8.13) 34.41(5.08)
LaTee (Tree Width) w𝑤witalic_w = 3 69.40(9.21) 34.50(5.23)
w𝑤witalic_w = 5 61.31(9.24) 33.53(5.10)
w𝑤witalic_w = 7 55.25(9.01) 34.37(5.42)
LaTee (Model Size) opt-350M 69.40(9.21) 72.87(16.35)
opt-1.3B 66.40(9.52) 34.50(5.23)
opt-6.7B 61.40(8.36) 33.45(5.12)
Refer to caption
Figure 4: (a) Illustration of Scalability on the number of event types on four synthetic datasets. (b) Illustration of the performance of using semantic and not using semantic information on two real-world datasets.

4.2 Results and Analysis

The primary findings are summarized in Table 1. Notably, only using the local LLM for event prediction pLM(Y|X)subscript𝑝𝐿𝑀conditional𝑌𝑋p_{LM}(Y|X)italic_p start_POSTSUBSCRIPT italic_L italic_M end_POSTSUBSCRIPT ( italic_Y | italic_X ) (0-shot CoT) yields the least effective results across all three datasets. Intriguingly, incorporating examples and adopting a tree-like reasoning structure (ToT) in the prompt do help enhance performance on the EPIC-100 and StackOverflow datasets to some extend. Furthermore, Supervised Fine-Tuning (SFT) exhibits similarly weak performance, while Proximal Policy Optimization (PPO) Fine-Tuning on the LLM shows marginal improvement but still lags behind attention-based Temporal Point Process (TPP) models. We hypothesize this underperformance is due to SFT’s limited generalization capabilities and the shifted distribution map** inherent in PPO’s reward signals (refer to Analysis 2). It is noteworthy that Pt-AttNHP consistently falls short of AttNHP’s performance across all datasets. This may be attributed to Pt-AttNHP’s reliance on a prompt-like retrieval memory for time-horizon generalization, which potentially leads to overlooking individual-level characteristics in unseen event histories. Lastly, the proposed LaTee, fine-tuned with GFN objectives, matches AttNHP’s accuracy on StackOverflow by focusing solely on latent structure learning. Remarkably, it surpasses AttNHP by a relative margin of 25% on MIMIC-3 and 18% on EPIC-100 containing semantic content (as detailed in Analysis 1).

Table 3: Performance Evaluation for Alternate EM-loops Frequencies on Synthetic@5 (Earlystop was made at the fifth epoch).
NLL \downarrow ER (%) \downarrow MR \downarrow
E-steps only (with groundtruth likelihood) 1389.31 62.5 2.025
EM-loops (Alternate Freq = 1) 117.58 70.0 2.075
EM-loops (Alternate Freq = 20) 108.44 67.5 2.000
EM-loops (Alternate Freq = 50) 106.35 70.0 1.900
Table 4: Performance Evaluation for using different LMs for Inference (E-steps) and Generation (M-steps) on Synthetic@5 (Earlystop was made at the fifth epoch).
E-steps LM (Fine-tuned) M-steps LM (Frozen) NLL \downarrow ER (%) \downarrow MR \downarrow
Opt-1.3b Opt-1.3b 128.64 97.5 2.23
Opt-1.3b Zephyr-3b 149.25 87.5 2.50
Opt-1.3b Mistral-7B-Instruct 117.62 70.0 2.08
Zephyr-3b Mistral-7B-Instruct 116.21 70.0 1.95

Analysis 1: The Role of LLM in Enhancing Event Logic Discovery through Semantic Cognition. From the data presented in Fig.  1, it is evident that GPT enhances next events type prediction by substituting semantically meaningless numerical event IDs with meaningful event names. This study aims to explore whether the semantic content embedded in event history can bolster structure learning and, consequently, improve event prediction accuracy on a local deployable LLM. As shown in Fig. 4(b), we observe a noteworthy reduction in error rate (approximately 25%) for both EPIC-100 and MIMIC-3 datasets when employing semantic event names for reasoning and inference. This decrease is significantly more pronounced compared to the improvement seen when transitioning from attention-based TPP models to LaTee models that do not apply semantic information. Moreover, we illustrate two examples of semantic tree structures learned by LaTee in Appendix A.

Analysis 2: The Necessity of GFN Fine-Tuning in LLMs for Logic Tree Discovery and the Role of Prompts in Rule Discovery. As indicated by the baselines in Table 1, approaches such as zero-shot Chain-of-Thought (CoT) prompting, k𝑘kitalic_k-shot prompting, and Tree-of-Thoughts (ToT) prompting demonstrate limited efficacy in yielding meaningful results. Similarly, Supervised Fine-Tuning (SFT) and Proximal Policy Optimization (PPO) fine-tuning on Large Language Models (LLMs) for next events prediction are outperformed by attention-based Temporal Point Process (TPP) models. However, GFN fine-tuning, which focuses on teaching models how to reason rather than predict, enables LLMs to match and even exceed the prediction accuracy of attention-based TPP models, particularly when integrating semantic information. To understand this improvement, Fig. 3 offers a visualization of the diverse rule distributions generated by fine-tuned LLMs. We notice that rule distributions in both SFT and PPO fine-tuning are predominantly concentrated in five regions, whereas GFN fine-tuning exhibits a more diverse spread across the entire rule space.

Analysis 3: The Scalability of the Proposed Method and the Impact of LLM and Symbolic Logic Tree Sizes on Performance. This analysis explores the scalability of our proposed method by examining the effect of an increased number of event types across four synthetic datasets without any semantic information. As shown in Fig. 4, LaTee demonstrates comparable scaling abilities in Error Rates and Mean Rank to those of attention-based TPP models. Notably, LaTee consistently achieves a lower Mean Rank, likely due to the additional confidence imparted by the learned structure information in making predictions. Additionally, we analyze the impact of varying tree sizes and LLM sizes. Assuming the predefined predicate space 𝒵𝒵\mathcal{Z}caligraphic_Z has a cardinality |𝒵|=N𝒵𝑁|\mathcal{Z}|=N| caligraphic_Z | = italic_N, the maximum allowable depth and width of the logic tree are restricted to d𝑑ditalic_d and w𝑤witalic_w (w<<Nmuch-less-than𝑤𝑁w<<Nitalic_w < < italic_N), respectively, then the entirety of the search space can be approximated as O(Nwd)𝑂superscript𝑁superscript𝑤𝑑O(N^{{w}^{d}})italic_O ( italic_N start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ). In Table 2, we restrict depth d𝑑ditalic_d and width w𝑤witalic_w below 4444 and 7777 and the empirical findings suggest that increasing the tree widths has a more beneficial effect than increasing tree depth or model size on semantic event sequences. This could be attributed to the fact that ground-truth rules often consist of multiple short rules, and a wider tree is better equipped to encompass more semantically similar predicate events at the same level. It’s also important to note that for non-semantic event sequences, enlarging the model size tends to be more advantageous than increasing tree sizes.

Analysis 4: Ablating E-M Update Steps in LaTee. Unlike traditional EM algorithms where the E-step typically has a closed-form solution, E-step in GFlowNet-EM progressively moves closer to the target distribution p𝑝pitalic_p. This requires sufficient gradient steps in the ‘approximate E-step’ to closely align the approximate distribution with the target while it also should regularly switch to M-steps for updating likelihood functions using the new sampled latent variables in E-steps. This non-stationary update thus gives us a challenge of scheduling E-M steps for a better convergence rate.

Consequently, we added experiments comparing the convergence speed of both SubTB loss (E-steps) and NLL loss (M-steps) under varying frequencies of alternation. We provide the plot of convergence analysis for EM in the Appendix G.3 Fig. 8 and 9 and report final performance in Table 3. Interestingly, we observe that more frequent alternations of E-M loops lead to a faster convergence of the SubTB loss (E-steps) but a slower rate for M-step. Additionally, the frequency of alternation appears to have minimal impact on the overall evaluation performance.

Analysis 5: Ablating LLMs for E-M Steps. To investigate whether the world knowledge in the LM is most useful in the generation model (M-steps LM), the inference model (E-steps LM), or both, we compared the effects of using different sizes/versions of LMs for inference (E-steps) and generation (M-steps). In our experiment, we used Opt-1.3B as the base inference model (which has a minor language understanding ability on LM benchmark task), and used three different estimation (generation) models to make the event prediction, i.e., Opt-1.3B, Zephyr-3B, Mistral-7B-Instruct. The results are shown in Table 4.

Our evaluation strategy in Table 4 focused exclusively on altering the model size to guarantee fairness in comparison. We observe that employing larger language models (LMs) for both inference (E-steps) and generation (M-steps) phases can enhance event prediction performance. Notably, an increase in the size of the LM used for generation (M-steps) exhibited a more pronounced positive impact compared to enlarging the LM for inference (E-steps). The results suggest that the extensive world knowledge encoded in larger LMs is more beneficial for generation tasks (M-steps). This finding encourages future improvements in reasoning abilities in the M-steps by calling API-based LLMs like GPT-4 and Claude-3 with an extracted logic tree from a fine-tuned local lightweight LLMs as the prompt.

5 Conclusion

The incorporation of general knowledge from Large Language Models (LLMs) is key to deciphering complex structures in noisy event sequences. To facilitate this, we present LaTee, an amortized EM-style framework that leverages LLMs’ prior knowledge for latent tree structure learning for event sequence explanation. We simplify the complex posterior with GFlowNets and perform inference based on the learned structure without further gradient updates. Empirical results show that this method notably enhances generalization in event histories with semantic information.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgement

The authors thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions. Shuang Li’s research was in part supported by the National Science and Technology Major Project under grant No. 2022ZD0116004, the NSFC under grant No. 62206236, Shenzhen Science and Technology Program under grant No. JCYJ20210324120011032, Shenzhen Key Lab of Cross-Modal Cognitive Computing under grant No. ZDSYS20230626091302006, and Guangdong Key Lab of Mathematical Foundations for Artificial Intelligence. Zitao Song and Bo AN are supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-GC-2023-009).

References

  • Acemoglu et al. (2011) Acemoglu, D., Dahleh, M. A., Lobel, I., and Ozdaglar, A. Bayesian learning in social networks. The Review of Economic Studies, 78(4):1201–1236, 2011.
  • Achiam et al. (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Beal (2003) Beal, M. J. Variational algorithms for approximate Bayesian inference. University of London, University College London (United Kingdom), 2003.
  • Bengio et al. (2021) Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381–27394, 2021.
  • Bengio et al. (2023) Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. Gflownet foundations. Journal of Machine Learning Research, 24(210):1–55, 2023.
  • Boyd et al. (2020) Boyd, A., Bamler, R., Mandt, S., and Smyth, P. User-dependent neural sequence models for continuous-time event data. Advances in Neural Information Processing Systems, 33:21488–21499, 2020.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Campero et al. (2018) Campero, A., Pareja, A., Klinger, T., Tenenbaum, J., and Riedel, S. Logical rule induction and theory learning using neural theorem proving. arXiv preprint arXiv:1809.02193, 2018.
  • Chen et al. (2020) Chen, R. T., Amos, B., and Nickel, M. Neural spatio-temporal point processes. arXiv preprint arXiv:2011.04583, 2020.
  • Cropper & Tourret (2020) Cropper, A. and Tourret, S. Logical reduction of metarules. Machine Learning, 109:1323–1369, 2020.
  • Damen et al. (2021) Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., and Wray, M. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):4125–4141, 2021. doi: 10.1109/TPAMI.2020.2991965.
  • Deleu et al. (2024) Deleu, T., Nouri, P., Malkin, N., Precup, D., and Bengio, Y. Discrete probabilistic inference as control in multi-path environments. arXiv preprint arXiv:2402.10309, 2024.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  • Dettmers et al. (2023) Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  • Du et al. (2016) Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., and Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1555–1564, 2016.
  • Evans & Grefenstette (2018) Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
  • Feng et al. (2023) Feng, X., Wan, Z., Wen, M., Wen, Y., Zhang, W., and Wang, J. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  • Gao et al. (2023) Gao, C., Lan, X., Lu, Z., Mao, J., Piao, J., Wang, H., **, D., and Li, Y. S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984, 2023.
  • Glanois et al. (2022) Glanois, C., Jiang, Z., Feng, X., Weng, P., Zimmer, M., Li, D., Liu, W., and Hao, J. Neuro-symbolic hierarchical rule induction. In International Conference on Machine Learning, pp.  7583–7615. PMLR, 2022.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp.  1352–1361. PMLR, 2017.
  • Hao et al. (2023) Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  • Hegselmann et al. (2023) Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp.  5549–5581. PMLR, 2023.
  • Henrich & McElreath (2003) Henrich, J. and McElreath, R. The evolution of cultural evolution. Evolutionary Anthropology: Issues, News, and Reviews: Issues, News, and Reviews, 12(3):123–135, 2003.
  • Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Hu et al. (2023a) Hu, E. J., Jain, M., Elmoznino, E., Kaddar, Y., Lajoie, G., Bengio, Y., and Malkin, N. Amortizing intractable inference in large language models. arXiv preprint arXiv:2310.04363, 2023a.
  • Hu et al. (2023b) Hu, E. J., Malkin, N., Jain, M., Everett, K. E., Graikos, A., and Bengio, Y. Gflownet-em for learning compositional latent variable models. In International Conference on Machine Learning, pp.  13528–13549. PMLR, 2023b.
  • Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  • Katz et al. (2008) Katz, Y., Goodman, N. D., Kersting, K., Kemp, C., and Tenenbaum, J. B. Modeling semantic cognition as logical dimensionality reduction. In Proceedings of the annual meeting of the cognitive science society, volume 30, 2008.
  • Kojima et al. (2022) Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • Koller & Friedman (2009) Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Laland (2004) Laland, K. N. Social learning strategies. Animal Learning & Behavior, 32:4–14, 2004.
  • Leng & Yuan (2023) Leng, Y. and Yuan, Y. Do llm agents exhibit social behavior? arXiv preprint arXiv:2312.15198, 2023.
  • Leskovec & Krevl (2014) Leskovec, J. and Krevl, A. Snap datasets: Stanford large network dataset collection, 2014.
  • Lew et al. (2023) Lew, A. K., Zhi-Xuan, T., Grand, G., and Mansinghka, V. K. Sequential monte carlo steering of large language models using probabilistic programs. arXiv preprint arXiv:2306.03081, 2023.
  • Li et al. (2020) Li, S., Wang, L., Zhang, R., Chang, X., Liu, X., Xie, Y., Qi, Y., and Song, L. Temporal logic point processes. In International Conference on Machine Learning, pp.  5990–6000. PMLR, 2020.
  • Li et al. (2022) Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., and Lewis, M. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022.
  • Lu et al. (2021) Lu, X., Welleck, S., West, P., Jiang, L., Kasai, J., Khashabi, D., Bras, R. L., Qin, L., Yu, Y., Zellers, R., et al. Neurologic a* esque decoding: Constrained text generation with lookahead heuristics. arXiv preprint arXiv:2112.08726, 2021.
  • Madan et al. (2023) Madan, K., Rector-Brooks, J., Korablyov, M., Bengio, E., Jain, M., Nica, A. C., Bosc, T., Bengio, Y., and Malkin, N. Learning gflownets from partial episodes for improved convergence and stability. In International Conference on Machine Learning, pp.  23467–23483. PMLR, 2023.
  • Malkin et al. (2021) Malkin, N., Wang, Z., and Jojic, N. Coherence boosting: When your pretrained language model is not paying enough attention. arXiv preprint arXiv:2110.08294, 2021.
  • Mei & Eisner (2017) Mei, H. and Eisner, J. M. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
  • Miao et al. (2019) Miao, N., Zhou, H., Mou, L., Yan, R., and Li, L. Cgmh: Constrained sentence generation by metropolis-hastings sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  6834–6842, 2019.
  • Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. Advances in neural information processing systems, 30, 2017.
  • Ogata (1981) Ogata, Y. On lewis’ simulation method for point processes. IEEE transactions on information theory, 27(1):23–31, 1981.
  • Pan et al. (2023) Pan, L., Albalak, A., Wang, X., and Wang, W. Y. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295, 2023.
  • Quinlan (1990) Quinlan, J. R. Learning logical definitions from relations. Machine learning, 5:239–266, 1990.
  • Rocktäschel & Riedel (2017) Rocktäschel, T. and Riedel, S. End-to-end differentiable proving. Advances in neural information processing systems, 30, 2017.
  • Schick & Schütze (2020) Schick, T. and Schütze, H. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118, 2020.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sha (2020) Sha, L. Gradient-guided unsupervised lexically constrained text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  8692–8703, 2020.
  • Shchur et al. (2019) Shchur, O., Biloš, M., and Günnemann, S. Intensity-free learning of temporal point processes. arXiv preprint arXiv:1909.12127, 2019.
  • Shi et al. (2023) Shi, X., Xue, S., Wang, K., Zhou, F., Zhang, J. Y., Zhou, J., Tan, C., and Mei, H. Language models can improve event prediction by few-shot abductive reasoning. arXiv preprint arXiv:2305.16646, 2023.
  • Tiapkin et al. (2024) Tiapkin, D., Morozov, N., Naumov, A., and Vetrov, D. P. Generative flow networks as entropy-regularized rl. In International Conference on Artificial Intelligence and Statistics, pp.  4213–4221. PMLR, 2024.
  • Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Tunstall et al. (2023) Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  • Ullman et al. (2012) Ullman, T. D., Goodman, N. D., and Tenenbaum, J. B. Theory learning as stochastic search in the language of thought. Cognitive Development, 27(4):455–480, 2012.
  • van Krieken et al. (2023) van Krieken, E., Thanapalasingam, T., Tomczak, J., Van Harmelen, F., and Ten Teije, A. A-nesi: A scalable approximate method for probabilistic neurosymbolic inference. Advances in Neural Information Processing Systems, 36:24586–24609, 2023.
  • Wang et al. (2024) Wang, X., Zhu, W., Saxon, M., Steyvers, M., and Wang, W. Y. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Webb et al. (2023) Webb, T., Holyoak, K. J., and Lu, H. Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, 2023.
  • Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Xiao et al. (2017) Xiao, S., Yan, J., Yang, X., Zha, H., and Chu, S. Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  • Xie et al. (2021) Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  • Xie et al. (2023) Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q. Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633, 2023.
  • Xu (2023) Xu, H. No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function. arXiv preprint arXiv:2309.03224, 2023.
  • Xue et al. (2023a) Xue, S., Shi, X., Chu, Z., Wang, Y., Zhou, F., Hao, H., Jiang, C., Pan, C., Xu, Y., Zhang, J. Y., et al. Easytpp: Towards open benchmarking the temporal point processes. arXiv preprint arXiv:2307.08097, 2023a.
  • Xue et al. (2023b) Xue, S., Wang, Y., Chu, Z., Shi, X., Jiang, C., Hao, H., Jiang, G., Feng, X., Zhang, J. Y., and Zhou, J. Prompt-augmented temporal point process for streaming event sequence. arXiv preprint arXiv:2310.04993, 2023b.
  • Yang et al. (2021) Yang, C., Mei, H., and Eisner, J. Transformer embeddings of irregularly spaced events and their participants. arXiv preprint arXiv:2201.00044, 2021.
  • Yang et al. (2017) Yang, F., Yang, Z., and Cohen, W. W. Differentiable learning of logical rules for knowledge base completion. arXiv preprint arXiv:1702.08367, 2017.
  • Yao et al. (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  • Zhang et al. (2020) Zhang, M., Jiang, N., Li, L., and Xue, Y. Language generation via combinatorial constraint satisfaction: A tree search enhanced monte-carlo approach. arXiv preprint arXiv:2011.12334, 2020.
  • Zhang et al. (2022) Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  • Zhu et al. (2023) Zhu, B., Sharma, H., Frujeri, F. V., Dong, S., Zhu, C., Jordan, M. I., and Jiao, J. Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023.
  • Zhu et al. (2022) Zhu, X., Wang, J., Zhang, L., Zhang, Y., Gan, R., Zhang, J., and Yang, Y. Solving math word problem via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257, 2022.
  • Zuo et al. (2020) Zuo, S., Jiang, H., Li, Z., Zhao, T., and Zha, H. Transformer hawkes process. In International conference on machine learning, pp.  11692–11702. PMLR, 2020.

Appendix A Learned Logic Tree Examples

Refer to caption
Figure 5: Illustration of learned symbolic logic tree structures from event histories on two real-world datasets containing semantic information. Fig. (a)-(d) are learned from EPIC-100 and Fig. (e)-(f) are learned from MIMIC-3. We use the thickness of the edges to represent posterior probability and the color to represent the weights corresponding to each rule (Black color stands for activation and Red color stands for inhibition, i.e., low blood pressure will facilitate low urine while normal blood pressure suppresses low urine).

Appendix B Broader Impact

Differentiable Extraction of Non-Linear Structures from LLMs: Our approach extends the application of in-context learning in large language models (LLMs) beyond traditional posterior inference (Xie et al., 2021) and independent demonstrations (Wang et al., 2024). We focus on non-linear prompt structures, enabling the extraction of complex entities like symbolic proof trees and latent positions of political figures. This differentiable method enhances the versatility of LLMs in handling diverse, non-linear structures.

Advancing Neuro-Symbolic Inference with Foundation Models: Foundational models, including Vision-and-Language Models (VLMs) and LLMs, serve as informative belief priors for world modeling and latent concept understanding. Our work augments posterior inference capabilities, moving past models like A-NESI (van Krieken et al., 2023) that rely on uninformative Dirichlet priors. This progression is pivotal for tackling more intricate, multimodal, and scalable neurosymbolic challenges.

Enhancing Data Privacy in Event Sequence Explanation and Prediction: By fine-tuning locally accessible, lightweight LLMs (under 7B parameters) while maintaining data privacy, our model offers wide applications in sensitive areas like healthcare and credit card fraud detection. The logic trees extracted from local LLMs can be integrated with public LLMs for prediction tasks. This aspect also paves the way for exploring improvements in reasoning abilities for API-based LLMs like GPT-4 and Claude-3 using these extracted logic trees.

These facets of our research not only contribute to the evolution of language model applications but also pave the way for new advancements in privacy-sensitive areas and neurosymbolic computing.

Appendix C More Related Works

Temporal Point Processes. In recent decades, a diverse range of Neural Temporal Point Processes (TPPs) have been proposed to model event sequences with various properties. Many of these TPPs are based on a parametric intensity function that evolves through a series of latent states (Du et al., 2016; Xiao et al., 2017; Boyd et al., 2020; Chen et al., 2020). To effectively capture long-range dependencies within these sequences, the attention mechanism has been adapted for TPPs (Zuo et al., 2020; Yang et al., 2021; Mei & Eisner, 2017). Moreover, intensity-free TPP models have also shown promising results, particularly in the EasyTPP framework (Shchur et al., 2019; Xue et al., 2023a). However, the application of Large Language Models (LLMs) in learning event sequences remains largely unexplored. Recent research, such as LAMP (Shi et al., 2023), introduces a GPT-based abductive reasoning approach built upon attention-based TPP models for event prediction. This approach, however, necessitates additional textual data for event description and relies on costly API services. Our focus, instead, is on harnessing the reasoning capabilities of local LLMs for event prediction.

Non-Linear Reasoning in LLMs. Recent research has focused on exploring complex, non-linear reasoning paths such as tree structures within Large Language Models (LLMs) (Zhu et al., 2022; Xu, 2023; Yao et al., 2023; Hao et al., 2023; Xie et al., 2023). Various methods, including beam search (Xie et al., 2023), depth-/breadth-first search (Yao et al., 2023), Monte Carlo Tree Search (Hao et al., 2023), and MCTS with an enhanced value function (Feng et al., 2023), have been implemented to navigate these tree structures effectively using LLMs’ self-assessment capabilities to identify more effective reasoning pathways. Nonetheless, research on differentiable learning for non-linear reasoning within LLMs remains scarce. Recent studies, such as by Hu et al. (2023a), suggest fine-tuning LLMs using GFlowNets objectives to augment the diversity of reasoning chains. In our research, applying LLM-based tree search to discern the inherent structure in event sequences presents challenges due to the limited event data available for fine-tuning LLMs for specific event prediction tasks. Therefore, our focus shifts towards the development of differentiable logic trees to facilitate non-linear reasoning in LLMs, achieved by iteratively expanding and refining the logic tree structure.

Appendix D Limitations

Resource constraints limited our experiments to models with up to 6.7B parameters and event sequences of a maximum of 40 events. This limited the capacity of input events because of the constraints on the maximum number of input tokens for a language model. However, we anticipate our findings to be applicable to larger models and longer sequences. Notably, optimizing larger models with limited data presents challenges, and exploring more complex latent problems is an ongoing challenge.

Appendix E ELBO Derivation

Given data pair (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), we can represent write the joint likelihood logp(X,Y)𝑝𝑋𝑌\log p(X,Y)roman_log italic_p ( italic_X , italic_Y ) as

logp(X,Y)𝑝𝑋𝑌\displaystyle\log p(X,Y)roman_log italic_p ( italic_X , italic_Y ) =logp(X,Y,)p(|X,Y)absent𝑝𝑋𝑌𝑝conditional𝑋𝑌\displaystyle=\log\frac{p(X,Y,\mathcal{R})}{p(\mathcal{R}|X,Y)}= roman_log divide start_ARG italic_p ( italic_X , italic_Y , caligraphic_R ) end_ARG start_ARG italic_p ( caligraphic_R | italic_X , italic_Y ) end_ARG (13)
=logp(X,Y,)q(|X,Y)p(|X,Y)q(|X,Y)absent𝑝𝑋𝑌𝑞conditional𝑋𝑌𝑝conditional𝑋𝑌𝑞conditional𝑋𝑌\displaystyle=\log\frac{p(X,Y,\mathcal{R})q(\mathcal{R}|X,Y)}{p(\mathcal{R}|X,% Y)q(\mathcal{R}|X,Y)}= roman_log divide start_ARG italic_p ( italic_X , italic_Y , caligraphic_R ) italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG start_ARG italic_p ( caligraphic_R | italic_X , italic_Y ) italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG (14)
Rq(|X,Y)logp(X,Y)subscript𝑅𝑞conditional𝑋𝑌𝑝𝑋𝑌\displaystyle\sum_{R}q(\mathcal{R}|X,Y)\log p(X,Y)∑ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_q ( caligraphic_R | italic_X , italic_Y ) roman_log italic_p ( italic_X , italic_Y ) =Rq(|X,Y)logp(X,Y,)q(|X,Y)p(|X,Y)q(|X,Y)absentsubscript𝑅𝑞conditional𝑋𝑌𝑝𝑋𝑌𝑞conditional𝑋𝑌𝑝conditional𝑋𝑌𝑞conditional𝑋𝑌\displaystyle=\sum_{R}q(\mathcal{R}|X,Y)\log\frac{p(X,Y,\mathcal{R})q(\mathcal% {R}|X,Y)}{p(\mathcal{R}|X,Y)q(\mathcal{R}|X,Y)}= ∑ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_q ( caligraphic_R | italic_X , italic_Y ) roman_log divide start_ARG italic_p ( italic_X , italic_Y , caligraphic_R ) italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG start_ARG italic_p ( caligraphic_R | italic_X , italic_Y ) italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG (15)
logp(X,Y)𝑝𝑋𝑌\displaystyle\log p(X,Y)roman_log italic_p ( italic_X , italic_Y ) =Rq(|X,Y)logq(|X,Y)p(|X,Y)+Rq(|X,Y)logp(X,Y,)q(|X,Y)absentsubscript𝑅𝑞conditional𝑋𝑌𝑞conditional𝑋𝑌𝑝conditional𝑋𝑌subscript𝑅𝑞conditional𝑋𝑌𝑝𝑋𝑌𝑞conditional𝑋𝑌\displaystyle=\sum_{R}q(\mathcal{R}|X,Y)\log\frac{q(\mathcal{R}|X,Y)}{p(% \mathcal{R}|X,Y)}+\sum_{R}q(\mathcal{R}|X,Y)\log\frac{p(X,Y,\mathcal{R})}{q(% \mathcal{R}|X,Y)}= ∑ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_q ( caligraphic_R | italic_X , italic_Y ) roman_log divide start_ARG italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG start_ARG italic_p ( caligraphic_R | italic_X , italic_Y ) end_ARG + ∑ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_q ( caligraphic_R | italic_X , italic_Y ) roman_log divide start_ARG italic_p ( italic_X , italic_Y , caligraphic_R ) end_ARG start_ARG italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG (16)
=DKL(q||p)+𝔼q(|X,Y)[logp(X,Y|)p()q(|X,Y)]ELBO\displaystyle=D_{\text{KL}}(q||p)+\underbrace{\mathbb{E}_{\mathcal{R}\sim q(% \mathcal{R}|X,Y)}[\log\frac{p(X,Y|\mathcal{R})p(\mathcal{R})}{q(\mathcal{R}|X,% Y)}]}_{\text{ELBO}}= italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q | | italic_p ) + under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT caligraphic_R ∼ italic_q ( caligraphic_R | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( italic_X , italic_Y | caligraphic_R ) italic_p ( caligraphic_R ) end_ARG start_ARG italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG ] end_ARG start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT (17)
𝔼q(|X,Y)[logp(X,Y|)p()q(|X,Y)]absentsubscript𝔼similar-to𝑞conditional𝑋𝑌delimited-[]𝑝𝑋conditional𝑌𝑝𝑞conditional𝑋𝑌\displaystyle\geq\mathbb{E}_{\mathcal{R}\sim q(\mathcal{R}|X,Y)}[\log\frac{p(X% ,Y|\mathcal{R})p(\mathcal{R})}{q(\mathcal{R}|X,Y)}]≥ blackboard_E start_POSTSUBSCRIPT caligraphic_R ∼ italic_q ( caligraphic_R | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( italic_X , italic_Y | caligraphic_R ) italic_p ( caligraphic_R ) end_ARG start_ARG italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG ] (18)

Thus, the ELBO \mathcal{L}caligraphic_L for the joint likelihood of p(X,Y)𝑝𝑋𝑌p(X,Y)italic_p ( italic_X , italic_Y ) is 𝔼q(|X,Y)[logp(X,Y|)p()q(|X,Y)]subscript𝔼similar-to𝑞conditional𝑋𝑌delimited-[]𝑝𝑋conditional𝑌𝑝𝑞conditional𝑋𝑌\mathbb{E}_{\mathcal{R}\sim q(\mathcal{R}|X,Y)}[\log\frac{p(X,Y|\mathcal{R})p(% \mathcal{R})}{q(\mathcal{R}|X,Y)}]blackboard_E start_POSTSUBSCRIPT caligraphic_R ∼ italic_q ( caligraphic_R | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( italic_X , italic_Y | caligraphic_R ) italic_p ( caligraphic_R ) end_ARG start_ARG italic_q ( caligraphic_R | italic_X , italic_Y ) end_ARG ].

Appendix F GFlowNets Learning Objective

We learn the amortized sampler of posterior distribution p(|X,Y)𝑝conditional𝑋𝑌p(\mathcal{R}|X,Y)italic_p ( caligraphic_R | italic_X , italic_Y ) by a Sub-Trajectory Balance Objective (Madan et al., 2023) of GFlowNet. The original Sub Trajectory objective is given by:

SubTB(τm:n)subscript𝑆𝑢𝑏𝑇𝐵subscript𝜏:𝑚𝑛\displaystyle\mathcal{L}_{SubTB}(\tau_{m:n})caligraphic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_m : italic_n end_POSTSUBSCRIPT ) =(logF(sm;θ)Πi=mn1pF(si+1|si;θ)F(sn;θ)Πi=mn1pB(si|si+1;θ))2absentsuperscript𝐹subscript𝑠𝑚𝜃superscriptsubscriptΠ𝑖𝑚𝑛1subscript𝑝𝐹conditionalsubscript𝑠𝑖1subscript𝑠𝑖𝜃𝐹subscript𝑠𝑛𝜃superscriptsubscriptΠ𝑖𝑚𝑛1subscript𝑝𝐵conditionalsubscript𝑠𝑖subscript𝑠𝑖1𝜃2\displaystyle=\Bigg{(}\log\frac{F(s_{m};\theta)\Pi_{i=m}^{n-1}p_{F}(s_{i+1}|s_% {i};\theta)}{F(s_{n};\theta)\Pi_{i=m}^{n-1}p_{B}(s_{i}|s_{i+1;\theta})}\Bigg{)% }^{2}= ( roman_log divide start_ARG italic_F ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; italic_θ ) roman_Π start_POSTSUBSCRIPT italic_i = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) end_ARG start_ARG italic_F ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_θ ) roman_Π start_POSTSUBSCRIPT italic_i = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i + 1 ; italic_θ end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (19)
(τ)𝜏\displaystyle\mathcal{L}(\tau)caligraphic_L ( italic_τ ) =0i<jnλjiLSubTB(τi:j)0i<jnλjiabsentsubscript0𝑖𝑗𝑛superscript𝜆𝑗𝑖subscript𝐿𝑆𝑢𝑏𝑇𝐵subscript𝜏:𝑖𝑗subscript0𝑖𝑗𝑛superscript𝜆𝑗𝑖\displaystyle=\frac{\sum_{0\leq i<j\leq n}\lambda^{j-i}L_{SubTB}(\tau_{i:j})}{% \sum_{0\leq i<j\leq n}\lambda^{j-i}}= divide start_ARG ∑ start_POSTSUBSCRIPT 0 ≤ italic_i < italic_j ≤ italic_n end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT 0 ≤ italic_i < italic_j ≤ italic_n end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT end_ARG (20)

In our case, we enforce F(sn;θ)=R(sn)𝐹subscript𝑠𝑛𝜃𝑅subscript𝑠𝑛F(s_{n};\theta)=R(s_{n})italic_F ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_θ ) = italic_R ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is terminal, so we have R(snT)=F(sn)pF(T|sn)𝑅superscriptsubscript𝑠𝑛T𝐹subscript𝑠𝑛subscript𝑝𝐹conditionalTsubscript𝑠𝑛R(s_{n}^{\texttt{T}})=F(s_{n})p_{F}(\texttt{T}|s_{n})italic_R ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ) = italic_F ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( T | italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Since we are generating a tree structure level by level, thus the backward probability is one, i.e., pB(s|s)=1subscript𝑝𝐵conditional𝑠superscript𝑠1p_{B}(s|s^{\prime})=1italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1, and λ=1𝜆1\lambda=1italic_λ = 1, we have

SubTB(0:n)subscript𝑆𝑢𝑏𝑇𝐵subscript:0𝑛\displaystyle\mathcal{L}_{SubTB}(\mathcal{R}_{0:n})caligraphic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT 0 : italic_n end_POSTSUBSCRIPT ) =0i<jn(logF(i;θ)Πk=i+1jpF(k|k1)F(j;θ)Πk=i+1jpF(k1|k))2absentsubscript0𝑖𝑗𝑛superscript𝐹subscript𝑖𝜃superscriptsubscriptΠ𝑘𝑖1𝑗subscript𝑝𝐹conditionalsubscript𝑘subscript𝑘1𝐹subscript𝑗𝜃superscriptsubscriptΠ𝑘𝑖1𝑗subscript𝑝𝐹conditionalsubscript𝑘1subscript𝑘2\displaystyle=\sum_{0\leq i<j\leq n}\Bigg{(}\log\frac{F(\mathcal{R}_{i};\theta% )\Pi_{k=i+1}^{j}p_{F}(\mathcal{R}_{k}|\mathcal{R}_{k-1})}{F(\mathcal{R}_{j};% \theta)\Pi_{k=i+1}^{j}p_{F}(\mathcal{R}_{k-1}|\mathcal{R}_{k})}\Bigg{)}^{2}= ∑ start_POSTSUBSCRIPT 0 ≤ italic_i < italic_j ≤ italic_n end_POSTSUBSCRIPT ( roman_log divide start_ARG italic_F ( caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ ) roman_Π start_POSTSUBSCRIPT italic_k = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_F ( caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_θ ) roman_Π start_POSTSUBSCRIPT italic_k = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (21)
=0i<jn(logR(iT)Πk=i+1jqθ(k|k1)qθ(T|j)R(jT)qθ(T|i))2,absentsubscript0𝑖𝑗𝑛superscript𝑅superscriptsubscript𝑖TsuperscriptsubscriptΠ𝑘𝑖1𝑗subscript𝑞𝜃conditionalsubscript𝑘subscript𝑘1subscript𝑞𝜃conditionalTsubscript𝑗𝑅superscriptsubscript𝑗Tsubscript𝑞𝜃conditionalTsubscript𝑖2\displaystyle=\sum_{0\leq i<j\leq n}\Bigg{(}\log\frac{R(\mathcal{R}_{i}^{% \texttt{T}})\Pi_{k=i+1}^{j}q_{\theta}(\mathcal{R}_{k}|\mathcal{R}_{k-1})q_{% \theta}(\texttt{T}|\mathcal{R}_{j})}{R(\mathcal{R}_{j}^{\texttt{T}})q_{\theta}% (\texttt{T}|\mathcal{R}_{i})}\Bigg{)}^{2},= ∑ start_POSTSUBSCRIPT 0 ≤ italic_i < italic_j ≤ italic_n end_POSTSUBSCRIPT ( roman_log divide start_ARG italic_R ( caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ) roman_Π start_POSTSUBSCRIPT italic_k = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_R start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( T | caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_R ( caligraphic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ) italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( T | caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (22)

We train the GFlowNet with stochastic gradient

𝔼0:nqθ[θSubTB(0:n)]subscript𝔼similar-tosubscript:0𝑛subscript𝑞𝜃delimited-[]subscript𝜃subscript𝑆𝑢𝑏𝑇𝐵subscript:0𝑛\mathbb{E}_{\mathcal{R}_{0:n}\sim q_{\theta}}[\nabla_{\theta}\mathcal{L}_{% SubTB}(\mathcal{R}_{0:n})]blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT 0 : italic_n end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_u italic_b italic_T italic_B end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT 0 : italic_n end_POSTSUBSCRIPT ) ] (23)

Appendix G Experimental Details

G.1 Dataset Details

# Target Predicates # Body Predicates Events Average Length
Synthetic@5 (w/o sc) 2 3 30.19
Synthetic@10 (w/o sc) 5 5 30.34
Synthetic@20 (w/o sc) 7 13 30.29
Synthetic@40 (w/o sc) 8 32 30.82
StackOverflow (w/o sc) 10 22 40.00
EPIC-KITCHEN-100 (w/ sc) 7 60 36.76
MIMIC3 (w/ sc) 3 62 20.01
Table 5: Event Dataset Statistics

We evaluate our methods on one synthetic dataset and three user behavior datasets. We consider each event type presented in the event history as a unique predicate and emphasize on the model’s ability to predict only pertinent target predicates. The overall data statistics is presented in Table 5. We provide details on the preparation and utilization of each below.

Synthetic Dataset. This dataset comprises four sets of synthetic event history data generated using the Temporal Logic Point Process (Li et al., 2020). Specifically, we employ pre-defined logical rules along with their weights, as outlined in Eq. (4), to construct the intensity function, and then apply thinning algorithms to generate new events. To evaluate the scalability of the proposed model, we have created four distinct groups of synthetic data, with the number of event types varying from five to forty, and an average sequence length of 30 events.

StackOverflow (Leskovec & Krevl, 2014). This dataset encompasses two years of user awards from a question-and-answer website, documenting each user’s sequence of badges. There are 22 distinct types of badges in total. However, since each event type is represented solely by a numerical ID, the dataset lacks semantically meaningful information. We focus on a subset of 142 records, each with an average sequence length of 40 event tokens.

EPIC-KITCHEN-100 (Damen et al., 2021). This dataset originates from a large-scale, first-person (egocentric) vision dataset, featuring multi-faceted, audio-visual, non-scripted recordings in natural settings, specifically the wearers’ homes. It captures daily kitchen activities over multiple days. We have utilized the annotated action sequences, focusing only on text, and extracted them to create a temporal event history of cooking verbs. This was achieved by omitting the entities that the human subjects interacted with. The frequencies of each verb, derived from the Epic-100 dataset, are visualized in Fig. 6. In this dataset, we specifically focus on eight verbs: put-in, rinse, put-on, pour, stir, peel, chop, and slice, as our target predicates. The model is tasked with reasoning about the actions preceding each target verb and learning the underlying structure that culminates in these targets. We concentrated on a subset of 400 event histories, each with an average sequence length of 36.76 events, resulting in 60 distinct event types in total.

MIMIC-3 (Johnson et al., 2016). This dataset comprises electronic health records of patients admitted to the intensive care unit (ICU). We specifically focus on patients diagnosed with sepsis, extracting medications, lab tests, outputs, and diagnoses to form text-based temporal event histories. The frequencies of the various event types related to sepsis are illustrated in 7. In this dataset, we concentrate on three key event types: survival, urine_output_low, and normal_blood_pressure. Our analysis is based on a subset of 477 event histories, each with an average sequence length of 20 event tokens, resulting in a total of 62 unique event types.

Refer to caption
Figure 6: Distribution of Semantic Event types in EPIC-100
Refer to caption
Figure 7: Distribution of Semantic Event types in MIMIC-3

G.2 Praising LLM Outputs

We detail two distinct methodologies for parsing outputs from large language models (LLMs) and querying corresponding probabilities.

  1. 1.

    For LLMs that are locally accessible, such as OPT series, our approach aligns with that of (Hu et al., 2023a). Here, we directly query the probability of subsequent tokens given a target sentence, avoiding the need for parsing. To illustrate, consider an ’action space’ defined by {A,B,C,D}𝐴𝐵𝐶𝐷\{A,B,C,D\}{ italic_A , italic_B , italic_C , italic_D }. To get the probability of action A𝐴Aitalic_A, we tokenize it into k tokens [w1A,w2A,,wkA]subscriptsuperscript𝑤𝐴1subscriptsuperscript𝑤𝐴2subscriptsuperscript𝑤𝐴𝑘[w^{A}_{1},w^{A}_{2},...,w^{A}_{k}][ italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]. Then p(A)𝑝𝐴p(A)italic_p ( italic_A ) is computed as:

    p(A)=p(wkA|wk1A,,w1A)p(wk1A|wk2A,,w1A)p(w1A).𝑝𝐴𝑝conditionalsubscriptsuperscript𝑤𝐴𝑘subscriptsuperscript𝑤𝐴𝑘1subscriptsuperscript𝑤𝐴1𝑝conditionalsubscriptsuperscript𝑤𝐴𝑘1subscriptsuperscript𝑤𝐴𝑘2subscriptsuperscript𝑤𝐴1𝑝subscriptsuperscript𝑤𝐴1p(A)=p(w^{A}_{k}|w^{A}_{k-1},...,w^{A}_{1})p(w^{A}_{k-1}|w^{A}_{k-2},...,w^{A}% _{1})...p(w^{A}_{1}).italic_p ( italic_A ) = italic_p ( italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) … italic_p ( italic_w start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .
  2. 2.

    For LLMs that are not locally accessible, like GPT, we employ a different technique. Regular Expressions are used to isolate target sentences. Then, we leverage the ’logprobs’ parameter in the OpenAI Chat Completions API to ascertain the probabilities of target tokens. For example, a common pattern in our analysis is ’#Event NAME#’, which allows us to capture the output event by extracting the NAME component. In cases where the NAME fails to be parsed, we adopt the approach from logic-LM (Pan et al., 2023), making a random guess across all possible event types.

G.3 Additional Experiments

Refer to caption
Figure 8: M-Steps converges rate under different EM-loops alternating frequencies
Refer to caption
Figure 9: E-Steps converge rate under different EM-loops alternating frequencies. Left: Four different alternating frequencies including E-steps only. Right: Excluding E-steps only from Left.

G.4 Training Details

Implementation Details. All models are implemented using the PyTorch framework. All the experiments were conducted on a server with 512G RAM, two 64 logical cores CPUS (AMD Ryzen Threadripper PRO 5995WX 64-Cores), and four NVIDIA RTX A6000 GPUs with 50G memory.

Hyperparameters Selection. We present the selected hyperparameters on synthetic datasets and three real-world datasets in Table 6 and Table 7 respectively.

Fine-tuning Quantized Large Language Model.

In our experiment, we implement QLoRA (Dettmers et al., 2023) to fine-tune the Language Model, effectively reducing memory requirements during LLM finetuning without compromising on performance, as compared to the conventional 16-bit model finetuning process. Specifically, QLoRA employs 4-bit quantization to condense a pre-existing language model. This model’s parameters are then set as unchangeable, and a limited set of modifiable parameters are incorporated via Low-Rank Adapters. During the finetuning phase, QLoRA directs gradient updates through these unmodifiable 4-bit quantized pre-trained language model parameters to the Low-Rank Adapters. Only the LoRA layers are adjusted during the training process.

Prompt Details. We present the prompts utilized for reasoning, denoted as P(|X,Y)𝑃conditional𝑋𝑌P(\mathcal{R}|X,Y)italic_P ( caligraphic_R | italic_X , italic_Y ), and inference, represented by P(Y|X,)𝑃conditional𝑌𝑋P(Y|X,\mathcal{R})italic_P ( italic_Y | italic_X , caligraphic_R ), in Tables 8 and 9, respectively. We iteratively grow the logic tree by applying the structure learning prompt to the successive node within the current tree. Importantly, our prompts are crafted using a simple, predefined template. In this template, events_history represents an in-text version of the observed event sequence X𝑋Xitalic_X, and target_event corresponds to the event id/name associated with Y𝑌Yitalic_Y. We also provide three inference examples in Table 9.

Table 6: Decriptions and values of hyperparameters used for models trained on the four synthetic datasets.
HYPERPARAMETERS VALUE USED
SYNTHETIC@5 SYNTHETIC@10 SYNTHETIC@20 SYNTHETIC@40
EPOCHS 10 10 10 10
ALTERNATE EVERY 1 1 1 1
BATCH SIZE 8 8 8 8
LLM LR 5e-4 5e-4 5e-4 5e-4
LLM SIZE (E-STEP) opt-1.3b opt-1.3b opt-1.3b opt-1.3b
LLM SIZE (M-STEP) zephyr-3b zephyr-3b zephyr-3b zephyr-3b
LOGIC MODEL UPDATE STEPS 1 1 1 1
LOGIC MODEL LR 0.001 0.001 0.001 0.001
LOGIC TREE DEPTH 3 3 3 3
LOGIC TREE WIDTH 5 8 5 3
TOP K 2 2 2 2
WRAMUP LEARNING RATE True True True True
LoRA RANK 512 512 512 512
LoRA SCALING FACTOR 512 512 512 512
LoRA DROPOUT 0. 0. 0. 0.
Table 7: Decriptions and values of hyperparameters used for models trained on the three real-world datasets.
HYPERPARAMETERS VALUE USED
EPIC-100 STACKOVERFLOW MIMIC-3
EPOCHS 20 20 20
ALTERNATE EVERY 1 1 1
BATCH SIZE 2 2 2
LLM LR 5e-4 5e-4 5e-4
LLM SIZE (E-STEP) opt-1.3b opt-1.3b opt-1.3b
LLM SIZE (M-STEP) mistral-7b zephyr-3b zephyr-3b
LOGIC MODEL UPDATE STEPS 1 1 1
LOGIC MODEL LR 0.001 0.001 0.001
TREE DEPTH 3 3 3
TREE WIDTH 4 3 2
TOP K 2 3 2
WRAMUP LEARNING RATE True True True
LoRA RANK 512 512 512
LoRA SCALING FACTOR 512 512 512
LoRA DROPOUT 0. 0. 0.
Bayesion Structure Learning P(|X,Y)𝑃conditional𝑋𝑌P(\mathcal{R}|X,Y)italic_P ( caligraphic_R | italic_X , italic_Y )
Template I want you to do the reasoning over social events. Given event list: {total_events}

We have the observations:
{events_history}

If the activation time of one event happens before Event {target_event}, it means that event could have caused Event {target_event} to be activated.
If the activation time of one event do not happens before Event {target_event}, it means that event cannot cause the other event to be activated.
Using this logic and based on the previous observation, You need to reason all possible events from above that can cause Event {target_event} to be activated.
Start your answer from the most confident one and stop if you cannot find any other events.
Answer: Event
Table 8: Prompts used for structure learning
Table 9: Prompts Used for Next Event Inference. rationales store the text representation of the logic tree by going over all the paths. The reasoning path is highlighted in red color.
Direct Inference P(Y|X)𝑃conditional𝑌𝑋P(Y|X)italic_P ( italic_Y | italic_X ) Reasoning based Inference P(Y|X,)𝑃conditional𝑌𝑋P(Y|X,\mathcal{R})italic_P ( italic_Y | italic_X , caligraphic_R )
Template I want you to perform inference over social events.
{examples}
Now you have event: {total_events}
We have the observations: {events_history}
then, the most likely event (chosen from event list : {possible_events}) to happen after {time} is Event:
I want you to perform inference over social events.
{examples}
Now you have event: {total_events}
and rules: {rationales}
We have the observations: {events_history}
then, the most likely event (chosen from event list : {possible_events}) to happen after {time} is Event:
Example 1 Given Events 0, 1
We have the observations:
1. Event 0 is activated at time 0.4

then, the most likely event (choose from event list: 0, 1) to happen after 0.4 is Event 1
Given Events 0, 1 and rules:
1. Event 1 \leftarrow (Event 0) and (Time of Event 1 after Time of Event 0)

We have the observations:
1. Event 0 is activated at time 0.4

then, the most likely event (choose from event list: 0, 1) to happen after 0.4 is Event 1
Example 2 Given Events 0, 1, 2
We have the observations:
1. Event 1 is activated at time 0.2

then, the most likely event (chosen from event list : 0, 1, 2) to happen after 0.2 is Event 0
Given Events 0, 1, 2 and rules:

1. Event 0 \leftarrow (Event 1) and (Time of Event 0 after Time of Event 1),
2. Event 0 \leftarrow (Event 2) and (Time of Event 0 after Time of Event 2)
We have the observations:
1. Event 1 is activated at time 0.2
then, the most likely event (chosen from event list : 0, 1, 2) to happen after 0.2 is Event 0
Example 3 Given Events 0, 1, 2

We have the following observation:
1. Event 0 is activated at time 0.2, 0.3, 0.5
2. Event 1 is activated at time 0.5, 0.6
3. Event 2 is activated at time 0.1, 0.4

then, the most likely event (chosen from event list : 0, 1, 2) to happen after 0.8 is Event 2
Given Events 0, 1, 2 and rules:

1. Event 2 \leftarrow (Event 1) and (Event 0) and (Time of Event 2 after Time of Event 1) and (Time of Event 1 after Event 0)

We have the following observation:
1. Event 0 is activated at time 0.2, 0.3, 0.5
2. Event 1 is activated at time 0.5, 0.6
3. Event 2 is activated at time 0.1, 0.4

then, the most likely event (chosen from event list : 0, 1, 2) to happen after 0.8 is Event 2