Amortized Equation Discovery in Hybrid Dynamical Systems

Yongtuo Liu    Sara Magliacane    Miltiadis Kofinas    Efstratios Gavves
Abstract

Hybrid dynamical systems are prevalent in science and engineering to express complex systems with continuous and discrete states. To learn laws of systems, all previous methods for equation discovery in hybrid systems follow a two-stage paradigm, i.e. they first group time series into small cluster fragments and then discover equations in each fragment separately through methods in non-hybrid systems. Although effective, these methods do not take fully advantage of the commonalities in the shared dynamics of multiple fragments that are driven by the same equations. Besides, the two-stage paradigm breaks the interdependence between categorizing and representing dynamics that jointly form hybrid systems. In this paper, we reformulate the problem and propose an end-to-end learning framework, i.e. Amortized Equation Discovery (AMORE), to jointly categorize modes and discover equations characterizing the dynamics of each mode by all segments of the mode. Experiments on four hybrid and six non-hybrid systems show that our method outperforms previous methods on equation discovery, segmentation, and forecasting.

Machine Learning, ICML

1 Introduction

Complex systems in science and engineering often exhibit behaviors and patterns that change over time. Hybrid dynamical systems (Van Der Schaft & Schumacher, 2007) characterize these systems by continuous time series which are interleaved with structural changes producing discrete modes. For instance, consider the motions of antelopes in a herd and how these suddenly change in the presence of lions. Hybrid systems are researched widely with applications in epidemiology (Keeling et al., 2001), legged locomotion (Holmes et al., 2006), robotics (Cortes, 2008), the designs of cyber-physical systems (Sanfelice et al., 2016), and systems with interacting objects (Liu et al., 2023).

A major challenge with hybrid dynamical systems is that one cannot know a priori the number of possible modes and when the switching happens within them. The dynamic modes might alternate from one to another constantly and at any time, due to either internal mechanisms or external influences. When modeling generalized time series as hybrid dynamical systems, it is thus crucial that we categorize the complex dynamics into the most likely discrete modes while characterizing the continuous motion dynamics in between.

Refer to caption
Figure 1: (a) Previous methods for equation discovery in hybrid dynamical systems typically follow a two-stage paradigm, i.e. they first group time series into small cluster fragments and then apply methods proposed in non-hybrid systems, e.g. SINDy (Brunton et al., 2016) to discover equations in each fragment separately. (b) Different from all previous methods, we reformulate the problem and propose a one-stage end-to-end learning framework, Amortized Equation Discovery (a.k.a. AMORE), to jointly categorize hybrid systems into discrete modes and discover equations characterizing motion dynamics of each mode based on all segments belonging to the mode.

Another challenge with characterizing dynamics in hybrid systems, especially physical ones, is that predictive models are often not interpretable. We are often interested in the underlying laws that govern the dynamics, thus preferring analytic models, usually in the form of closed-form ordinary differential equations. Equation discovery from first principles is a challenging problem in all fields of science. To bypass expensive and cumbersome targeted experimentation, researchers have explored using data-driven methods for equation discovery of systems from observations (Langley, 1981; Lemos et al., 2023). They distill parsimonious equations from data and find that compared with black-box neural networks, learned equations can provide insight into the essential dynamics of systems and tend to generalize better (Lutter et al., 2019; Karniadakis et al., 2021).

Equation discovery for hybrid dynamical systems has been a topic of interest for a long time  (Vidal et al., 2003; Ozay et al., 2008; Bako, 2011; Ohlsson & Ljung, 2013). Recently, Mangan et al. (2019); Novelli et al. (2022) proposed methods for equation discovery in non-linear hybrid systems. Both methods consist of two stages: they first group time series fragments into a large number of small cluster fragments and then apply an equation discovery method proposed in non-hybrid systems, e.g. SINDy (Brunton et al., 2016), to discover equations in each fragment separately. The separate multi-stage processing limits the potential performance because it does not leverage the commonalities across fragments from the same mode and splits learning into two separate stages, i.e. categorizing and then representing motion dynamics which jointly form hybrid systems.

In this paper, we reformulate the problem of equation discovery in hybrid dynamical systems and propose a one-stage end-to-end learning framework, Amortized Equation Discovery (a.k.a. AMORE), to jointly categorize motion dynamics and discover equations by modeling categorical modes and mode-switching behaviors within systems. Equations are discovered to characterize the dynamics of each mode based on all segments that are assigned to the mode, by learning combinations of candidate basis functions and encouraging parsimony. To model switching behaviors, inspired by REDSDS (Ansari et al., 2021), we infer latent categorical variables, i.e. mode variables, to categorize motion dynamics into discrete modes and learn probabilistic transition behaviors within them. Equations, mode variables, and mode-switching behaviors are jointly learned in the proposed end-to-end learning framework by maximizing the system observation likelihood. We also consider another challenge in previous methods for equation discovery for hybrid systems: they are limited to single-object scenarios where the dynamics of only one object or objects as a whole are considered. We extend our method to multi-object scenarios, AMORE-MIO, where multiple objects interact with each other and change their dynamics accordingly. Extensive experiments on single- and multi-object hybrid systems demonstrate the superior performance of our method on equation discovery, segmentation, and forecasting. The code and datasets are available at https://github.com/yongtuoliu/Amortized-Equation-Discovery-in-Hybrid-Dynamical-Systems.

2 Related Work

Equation discovery in hybrid dynamical systems.

Prior works focus on the simplest hybrid dynamical models, i.e. piece-wise affine systems with linear transition rules (Vidal et al., 2003; Ferrari-Trecate et al., 2003; Roll et al., 2004; Juloski et al., 2005; Paoletti et al., 2007; Ozay et al., 2008; Bako, 2011; Ohlsson & Ljung, 2013). Recently,  Mangan et al. (2019); Novelli et al. (2022) relieve these constraints and propose methods for non-linear hybrid systems. Among them, Hybrid-SINDy (Mangan et al., 2019) proposes a two-stage method, i.e. it first uses k-NN to group time series into small cluster fragments and then discovers governing equations separately in each fragment by models proposed in non-hybrid systems, e.g. SINDy (Brunton et al., 2016). Based on Hybrid-SINDy, Novelli et al. (2022) also follows a two-stage paradigm while introducing the number of discontinuities in hybrid systems as a known prior for better performance. Although effective, these two-stage methods learn the mode of each segment individually and do not leverage the commonalities across segments. In this paper, we reformulate the problem and propose an amortized end-to-end learning framework to jointly categorize modes, discover equations, and learn mode-switching behaviors.

Equation discovery in non-hybrid dynamical systems.

Many methods have been proposed for equation discovery in non-hybrid dynamical systems. Bongard & Lipson (2007) and Schmidt & Lipson (2009) leverage genetic programming (Koza et al., 1994) to discover nonlinear differential equations from data. SINDy (Brunton et al., 2016) uses symbolic sparse regression on a library of candidate model terms to select the fewest terms required to describe the observed dynamics. Several methods extend this approach to new settings, e.g. identifying partial differential equations (Rudy et al., 2017), considering additional physical constraints (Loiseau & Brunton, 2018), including control signals (Kaiser et al., 2018), and introducing integral terms for denoising (Schaeffer & McCalla, 2017). These methods cannot be directly applied to hybrid systems because they cannot model an unknown number of modes and unknown mode-switching behaviors.

Switching dynamical systems.

Switching dynamical systems refer to the same systems as hybrid dynamical systems, but highlight different aspects in the literature. Hybrid systems denote systems with a mixture of continuous and discrete states, while switching dynamical systems highlight the switching behaviors of system states. Many methods focus on switching linear dynamical systems where they set matrix calculations to model linear state transitions (Ackerson & Fu, 1970; Ghahramani & Hinton, 2000; Oh et al., 2005). Recently, switching non-linear dynamical systems model state transitions as neural networks, e.g. SNLDS (Dong et al., 2020), REDSDS (Ansari et al., 2021), and GRASS (Liu et al., 2023). While effective in modeling state-switching behaviors, they cannot discover closed-form equations from data. To categorize dynamics, our method is inspired by previous switching dynamical systems (Dong et al., 2020; Ansari et al., 2021; Liu et al., 2023) to infer latent mode variables. Differently, our method can jointly discover parsimonious closed-form equations to characterize dynamics and infer the values of the mode variables.

3 Equation Discovery in Dynamical Systems

In dynamical systems, the dynamics can be expressed by sets of differential equations in the form:

𝐲t˙d𝐲tdt=𝐟(𝐲t).˙subscript𝐲𝑡𝑑subscript𝐲𝑡𝑑𝑡𝐟subscript𝐲𝑡\displaystyle\dot{\mathbf{y}_{t}}\coloneqq\frac{d\mathbf{y}_{t}}{dt}=\mathbf{f% }(\mathbf{y}_{t}).over˙ start_ARG bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ≔ divide start_ARG italic_d bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = bold_f ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1)

Equation discovery in dynamical systems is the task of learning the function 𝐟:MM:𝐟superscript𝑀superscript𝑀\mathbf{f}:\mathbb{R}^{M}\rightarrow\mathbb{R}^{M}bold_f : blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT from time-series observations 𝐲={𝐲1,,𝐲T}𝐲subscript𝐲1subscript𝐲𝑇\mathbf{y}=\{\mathbf{y}_{1},\cdots,\mathbf{y}_{T}\}bold_y = { bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } where each state 𝐲t=[yt1,,ytM]Msubscript𝐲𝑡superscriptsubscript𝑦𝑡1superscriptsubscript𝑦𝑡𝑀superscript𝑀\mathbf{y}_{t}=[y_{t}^{1},\cdots,y_{t}^{M}]\in\mathbb{R}^{M}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Following SINDy (Brunton et al., 2016), we approximate each dimension y˙tmsuperscriptsubscript˙𝑦𝑡𝑚\dot{y}_{t}^{m}over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ] of 𝐲˙tsubscript˙𝐲𝑡\dot{\mathbf{y}}_{t}over˙ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. (1) as

y˙tm=dytmdt=fm(𝐲t)Θ(𝐲t)ξm,superscriptsubscript˙𝑦𝑡𝑚𝑑superscriptsubscript𝑦𝑡𝑚𝑑𝑡subscript𝑓𝑚subscript𝐲𝑡Θsubscript𝐲𝑡subscript𝜉𝑚\displaystyle\dot{y}_{t}^{m}=\frac{dy_{t}^{m}}{dt}=f_{m}(\mathbf{y}_{t})% \approx\Theta(\mathbf{y}_{t})\xi_{m},over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = divide start_ARG italic_d italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ roman_Θ ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (2)

where Θ(𝐲t)=[θ1(𝐲t),θ2(𝐲t),,θP(𝐲t))]\Theta(\mathbf{y}_{t})=[\theta_{1}(\mathbf{y}_{t}),\theta_{2}(\mathbf{y}_{t}),% \cdots,\theta_{P}(\mathbf{y}_{t}))]roman_Θ ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = [ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ⋯ , italic_θ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] is a set of candidate basis functions and ξmsubscript𝜉𝑚\xi_{m}italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a sparse vector indicating which of these function terms are active in characterizing the dynamics. We encourage the sparsity of ξmsubscript𝜉𝑚\xi_{m}italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT based on Occam’s razor principle, where the simplest model is likely the correct one. Ideally, we could encourage this principle by minimizing the L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm of the coefficients and solving the following constrained minimization problem

minξξ0subjecttoΘ(𝐲t)ξ𝐲˙tϵ,subscriptmin𝜉subscriptnorm𝜉0subjecttonormΘsubscript𝐲𝑡𝜉subscript˙𝐲𝑡italic-ϵ\mathop{\rm{min}}\limits_{\xi}||\xi||_{0}\;\;\;{\rm subject\;to}\;\;||\Theta(% \mathbf{y}_{t})\xi-\dot{\mathbf{y}}_{t}||\leq\epsilon,roman_min start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT | | italic_ξ | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_subject roman_to | | roman_Θ ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_ξ - over˙ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | ≤ italic_ϵ , (3)

where ϵitalic-ϵ\epsilonitalic_ϵ is a hyperparameter representing maximal optimization errors. The L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization penalizes the number of non-zero entries to encourage sparsity. However, optimization under this penalty is computationally intractable due to the non-differentiability and the combinatorial nature of all possible states. Various continuous relaxation methods are proposed in the literature to handle the optimization problems of L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm, e.g. L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, etc. As our focus in this paper is not to design advanced optimization methods, we utilize the simple and effective L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm to optimize Eq. (3).

We implement the coefficients ξmsubscript𝜉𝑚\xi_{m}italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as learnable weights in neural networks. We set the polynomial degree as D𝐷Ditalic_D and use a set of learnable weights 𝐰=[w1,,wC]𝐰subscript𝑤1subscript𝑤𝐶\mathbf{w}=[{w}_{1},\cdots,{w}_{C}]bold_w = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ] to model the coefficients of C𝐶Citalic_C candidate basis functions. For instance, if the observation 𝐲t=[a,b]subscript𝐲𝑡𝑎𝑏\mathbf{y}_{t}=[a,b]bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_a , italic_b ] is a two-dimensional vector and we set the polynomial degree D𝐷Ditalic_D as 2, the candidate basis polynomial functions are Θ(𝐲t)=[1,a,b,a2,b2,ab]Θsubscript𝐲𝑡1𝑎𝑏superscript𝑎2superscript𝑏2𝑎𝑏\Theta(\mathbf{y}_{t})=[1,a,b,a^{2},b^{2},ab]roman_Θ ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = [ 1 , italic_a , italic_b , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_a italic_b ]. In this case, C=6𝐶6C=6italic_C = 6 and 𝐰=[w1,,w6]𝐰subscript𝑤1subscript𝑤6\mathbf{w}=[{w}_{1},\cdots,{w}_{6}]bold_w = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ].

4 Equation Discovery in Hybrid Systems

Hybrid dynamical systems produce generalized time series with continuous states and discrete events that need to be modeled, featuring multiple modes that represent different types of dynamics. Instead of learning a single equation for each dimension m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ], as described in Sec. 3, we learn K𝐾Kitalic_K sets of equations for each dimension m𝑚mitalic_m that represent K𝐾Kitalic_K different modes in hybrid systems. We first introduce how we model mode-switching behaviors and then introduce our whole framework for equation discovery in hybrid systems.

Mode variables.

To model modes and mode-switching behaviors in hybrid systems, inspired by REDSDS (Ansari et al., 2021), we introduce latent categorical variables, i.e. mode variables, to learn categorical distributions of modes and index each set of equations representing each type of dynamics. Specifically, mode variables are discrete variables 𝐳𝐳1:T𝐳subscript𝐳:1𝑇\mathbf{z}\coloneqq\mathbf{z}_{1:T}bold_z ≔ bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, where each zt{1,,K}subscript𝑧𝑡1𝐾z_{t}\in\{1,\dots,K\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , … , italic_K } categorizes the mode at time step t{1,,T}𝑡1𝑇t\in\{1,\dots,T\}italic_t ∈ { 1 , … , italic_T }.

Count variables.

Besides mode variables, we also model latent count variables to learn the duration distributions of each mode. Count variables can help us avoid frequent mode switching, thanks to the fact that mode durations typically follow a geometric distribution (Ansari et al., 2021). They are modeled as discrete categorical variables 𝐜𝐜1:T𝐜subscript𝐜:1𝑇\mathbf{c}\coloneqq\mathbf{c}_{1:T}bold_c ≔ bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, where each state ct{1,,dmax}subscript𝑐𝑡1subscript𝑑maxc_{t}\in\{1,\dots,d_{\rm max}\}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , … , italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } explicitly models the run-length of the currently active mode at time t𝑡titalic_t and dmaxsubscriptdmax\rm d_{max}roman_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximal number of steps before a mode switch. Count variables ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are incremented by 1 when the mode zt=ksubscript𝑧𝑡𝑘z_{t}=kitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k stays the same at the next time step zt+1=ksubscript𝑧𝑡1𝑘z_{t+1}=kitalic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_k, or they reset to 1 when mode zt=ksubscript𝑧𝑡𝑘z_{t}=kitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k switches to another one ztksubscript𝑧𝑡𝑘z_{t}\neq kitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_k.

Mode-specific equation discovery.

Each mode k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] is assigned its own set of candidate basis functions Θk(𝐲t)subscriptΘ𝑘subscript𝐲𝑡\Theta_{k}(\mathbf{y}_{t})roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and learnable coefficient weights 𝐰ksubscript𝐰𝑘\mathbf{w}_{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which we will use to discover its equation. For instance, at time t𝑡titalic_t, the mode variable zt=ksubscript𝑧𝑡𝑘z_{t}=kitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k indexes the candidate basis function Θk(𝐲t)subscriptΘ𝑘subscript𝐲𝑡\Theta_{k}(\mathbf{y}_{t})roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the learnable weights 𝐰ksubscript𝐰𝑘\mathbf{w}_{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which together define the equation representing the dynamics of mode k𝑘kitalic_k. In practice, different modes share the same candidate basis functions, i.e. Θj(𝐲t)=Θk(𝐲t),j,k{1,,K}formulae-sequencesubscriptΘ𝑗subscript𝐲𝑡subscriptΘ𝑘subscript𝐲𝑡for-all𝑗𝑘1𝐾\Theta_{j}(\mathbf{y}_{t})=\Theta_{k}(\mathbf{y}_{t}),\forall j,k\in\{1,\cdots% ,K\}roman_Θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∀ italic_j , italic_k ∈ { 1 , ⋯ , italic_K }, unless we have some prior knowledge of the hybrid system. However, the learnable coefficient weights of different modes are individual and never shared, i.e. 𝐰j𝐰k,j,k{1,,K}formulae-sequencesubscript𝐰𝑗subscript𝐰𝑘for-all𝑗𝑘1𝐾\mathbf{w}_{j}\neq\mathbf{w}_{k},\forall j,k\in\{1,\cdots,K\}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∀ italic_j , italic_k ∈ { 1 , ⋯ , italic_K }. We collect all the candidate basis functions in a single vector Θ(𝐲t)=(Θ1(𝐲t),,ΘK(𝐲t))Θsubscript𝐲𝑡subscriptΘ1subscript𝐲𝑡subscriptΘ𝐾subscript𝐲𝑡\Theta(\mathbf{y}_{t})=(\Theta_{1}(\mathbf{y}_{t}),\cdots,\Theta_{K}(\mathbf{y% }_{t}))roman_Θ ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ⋯ , roman_Θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and similarly we collect all learnable coefficient weights 𝐰=(𝐰1,,𝐰K)𝐰subscript𝐰1subscript𝐰𝐾\mathbf{w}=(\mathbf{w}_{1},\cdots,\mathbf{w}_{K})bold_w = ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ).

Refer to caption
Figure 2: Generative model for amortized equation discovery. p(ct|ct1,zt1)𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1p(c_{t}|c_{t-1},z_{t-1})italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) and p(zt|zt1,ct,𝐲t1)𝑝conditionalsubscript𝑧𝑡subscript𝑧𝑡1subscript𝑐𝑡subscript𝐲𝑡1p(z_{t}|z_{t-1},c_{t},\mathbf{y}_{t-1})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) are count and mode transition probabilities, respectively. p(𝐲t|𝐲t1,zt)𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the observation transition probability where equations are discovered to characterize the dynamics of each mode.
Refer to caption
Figure 3: (a) Generative model of AMORE-MIO. p(𝐞t1:N2|𝐞t11:N2,𝐳t11:N,𝐜t11:N,𝐲t11:N)𝑝conditionalsuperscriptsubscript𝐞𝑡:1superscript𝑁2superscriptsubscript𝐞𝑡1:1superscript𝑁2superscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐜𝑡1:1𝑁superscriptsubscript𝐲𝑡1:1𝑁p(\mathbf{e}_{t}^{1:N^{2}}\!|\mathbf{e}_{t-1}^{1:N^{2}}\!\!,\mathbf{z}_{t-1}^{% 1:N},\mathbf{c}_{t-1}^{1:N},\mathbf{y}_{t-1}^{1:N})italic_p ( bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ), p(𝐜t1:N|𝐜t11:N,𝐳t11:N)𝑝conditionalsuperscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐜𝑡1:1𝑁superscriptsubscript𝐳𝑡1:1𝑁p(\mathbf{c}_{t}^{1:N}\!|\mathbf{c}_{t-1}^{1:N},\mathbf{z}_{t-1}^{1:N})italic_p ( bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ), p(𝐳t1:N|𝐳t11:N,𝐜t1:N,𝐲t11:N,𝐞t1:N2)𝑝conditionalsuperscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐲𝑡1:1𝑁superscriptsubscript𝐞𝑡:1superscript𝑁2p(\mathbf{z}_{t}^{1:N}|\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t}^{1:N}\!\!,\mathbf% {y}_{t-1}^{1:N},\mathbf{e}_{t}^{1:N^{2}})italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ), and p(𝐲t1:N|𝐲t11:N,𝐳t1:N)𝑝conditionalsuperscriptsubscript𝐲𝑡:1𝑁superscriptsubscript𝐲𝑡1:1𝑁superscriptsubscript𝐳𝑡:1𝑁p(\mathbf{y}_{t}^{1:N}|\mathbf{y}_{t-1}^{1:N},\mathbf{z}_{t}^{1:N})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) denotes the edge, count, mode, and observation transition probabilities, respectively. Equations are modeled at p(𝐲t1:N|𝐲t11:N,𝐳t1:N)𝑝conditionalsuperscriptsubscript𝐲𝑡:1𝑁superscriptsubscript𝐲𝑡1:1𝑁superscriptsubscript𝐳𝑡:1𝑁p(\mathbf{y}_{t}^{1:N}|\mathbf{y}_{t-1}^{1:N},\mathbf{z}_{t}^{1:N})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) which characterize object-shared and mode-specific dynamics. (b) Inference model of AMORE-MIO. Left: posterior approximate inference of edge variables 𝐞t1:N2superscriptsubscript𝐞𝑡:1superscript𝑁2\mathbf{e}_{t}^{1:N^{2}}bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Right: Exact inference of discrete mode and count variables 𝐳t1:Nsuperscriptsubscript𝐳𝑡:1𝑁\mathbf{z}_{t}^{1:N}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT and 𝐜t1:Nsuperscriptsubscript𝐜𝑡:1𝑁\mathbf{c}_{t}^{1:N}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT based on observations 𝐲t1:Nsuperscriptsubscript𝐲𝑡:1𝑁\mathbf{y}_{t}^{1:N}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT and the approximate edge variables 𝐞t1:N2superscriptsubscript𝐞𝑡:1superscript𝑁2\mathbf{e}_{t}^{1:N^{2}}bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Orange arrows denote the approximate inference flow.
Generative model for AMORE.

Assuming Markovian dynamics, the joint generative probability of hybrid systems in our model is described as

p(𝐲,𝐳,𝐜)=p(𝐲1|z1)p(z1)InitialStatest=2T[p(𝐲t|𝐲t1,zt)\displaystyle p(\mathbf{y},\mathbf{z},\mathbf{c})=\underbrace{p(\mathbf{y}_{1}% |z_{1})\,p(z_{1})}_{\rm{Initial\,\,States}}\cdot\prod_{t=2}^{T}\bigg{[}p(% \mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})italic_p ( bold_y , bold_z , bold_c ) = under⏟ start_ARG italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Initial roman_States end_POSTSUBSCRIPT ⋅ ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
p(zt|zt1,ct,𝐲t1)p(ct|ct1,zt1)]\displaystyle p(z_{t}|z_{t-1},c_{t},\mathbf{y}_{t-1})p(c_{t}|c_{t-1},z_{t-1})% \bigg{]}italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] (4)

In the initial states, count variables are ignored as they are always 1 when starting. p(z1)𝑝subscript𝑧1p(z_{1})italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the initial distribution over all possible modes. p(𝐲1|z1)𝑝conditionalsubscript𝐲1subscript𝑧1p(\mathbf{y}_{1}|z_{1})italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) models the initial observation states conditioned on the initial modes. For later time steps t2𝑡2t\geq 2italic_t ≥ 2, the count transition probability p(ct|ct1,zt1)𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1p(c_{t}|c_{t-1},z_{t-1})italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) models how the count variables at time t𝑡titalic_t change over time depending on their previous values and mode variables at time t1𝑡1t-1italic_t - 1. The mode transition probability p(zt|zt1,ct,𝐲t1)𝑝conditionalsubscript𝑧𝑡subscript𝑧𝑡1subscript𝑐𝑡subscript𝐲𝑡1p(z_{t}|z_{t-1},c_{t},\mathbf{y}_{t-1})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) models mode-switching behaviors on how modes switch at time t𝑡titalic_t conditioned on the previous mode and observation states at time t1𝑡1t-1italic_t - 1 as well as the updated count state ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t. The observation transition probability p(𝐲t|𝐲t1,zt)𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) models how the observations at time t𝑡titalic_t are influenced by their previous values at time t1𝑡1t-1italic_t - 1 conditioned on the updated mode variables at time t𝑡titalic_t. Equations are amortized and learned at p(𝐲t|𝐲t1,zt)𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by all segments of each mode to characterize mode dynamics. More specifically, conditioned on motion mode zt=ksubscript𝑧𝑡𝑘z_{t}=kitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k, p(𝐲t|𝐲t1,zt)𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) first indexes a set of candidate basis functions ΘksubscriptΘ𝑘\Theta_{k}roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and coefficient weights 𝐰ksubscript𝐰𝑘\mathbf{w}_{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which are used together to obtain derivatives 𝐲˙t1=Θk𝐰ksubscript˙𝐲𝑡1subscriptΘ𝑘subscript𝐰𝑘\dot{\mathbf{y}}_{t-1}=\Theta_{k}\cdot\mathbf{w}_{k}over˙ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of 𝐲t1subscript𝐲𝑡1\mathbf{y}_{t-1}bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT at time t1𝑡1t-1italic_t - 1. With known time intervals ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we finally achieve 𝐲t=𝐲˙t1Δt+𝐲t1subscript𝐲𝑡subscript˙𝐲𝑡1subscriptΔ𝑡subscript𝐲𝑡1\mathbf{y}_{t}=\dot{\mathbf{y}}_{t-1}\cdot\Delta_{t}+\mathbf{y}_{t-1}bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over˙ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⋅ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT assuming the dynamics do not change much in short time intervals. For inference of the latent mode and count variables, we conduct exact inference similar to the forward-backward algorithm in HMM (Eddy, 1996). The graphical model for the exact inference is the same as the generative model, which is illustrated in Figure 2. The neural network implementations and details of the inference model are in Appendix A.1 and A.2.

Learnable parameters of AMORE are optimized by maximizing the observation likelihood with sparse regularization on coefficient weights 𝐰𝐰\mathbf{w}bold_w of candidate basis functions

AMOREsubscriptAMORE\displaystyle\mathcal{L}_{\textrm{AMORE}}caligraphic_L start_POSTSUBSCRIPT AMORE end_POSTSUBSCRIPT =logpθ(𝐲)+𝐰1absentlogsubscriptp𝜃𝐲subscriptnorm𝐰1\displaystyle=\rm-log\,p_{\theta}(\mathbf{y})+||\mathbf{w}||_{1}= - roman_log roman_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ) + | | bold_w | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
=𝔼pθ(𝐳,𝐜|𝐲)[logpθ(𝐲,𝐳,𝐜)]+𝐰1.absentsubscript𝔼subscript𝑝𝜃𝐳conditional𝐜𝐲delimited-[]logsubscript𝑝𝜃𝐲𝐳𝐜subscriptnorm𝐰1\displaystyle=-\mathbb{E}_{p_{\theta}(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[% {\rm log}\,p_{\theta}(\mathbf{y},\mathbf{z},\mathbf{c})\right]+||\mathbf{w}||_% {1}.= - blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_z , bold_c ) ] + | | bold_w | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (5)

The derivatives of the training objective and further expansions over time are detailed in Appendix A.3.

5 Equation Discovery in Multi-object Hybrid Systems

While equation discovery in hybrid dynamical systems has been researched in single-object scenarios, the more general setting of systems with multiple potentially interacting objects is an unexplored yet natural setting. In this section, we elaborate on how our model can be extended for multi-object scenarios, and present AMORtized Equation discovery in MultI-Object hybrid systems, a.k.a. AMORE-MIO. We first introduce how our method models interactions and then introduce the whole framework of AMORE-MIO.

Edge variables.

Assume that N𝑁Nitalic_N objects and K𝐾Kitalic_K motion modes exist in multi-object hybrid systems. Inspired by Kipf et al. (2018); Liu et al. (2023), we model interactions between objects by latent edge variables 𝐞=𝐞1:T1:N2={et1,,etN2}t=1T𝐞superscriptsubscript𝐞:1𝑇:1superscript𝑁2superscriptsubscriptsuperscriptsubscript𝑒𝑡1superscriptsubscript𝑒𝑡superscript𝑁2𝑡1𝑇\mathbf{e}=\mathbf{e}_{1:T}^{1:N^{2}}=\{e_{t}^{1},\cdots,e_{t}^{N^{2}}\}_{t=1}% ^{T}bold_e = bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = { italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT including self-loop, thus totally N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for N𝑁Nitalic_N objects. For each pair of objects, interactions etmnsuperscriptsubscript𝑒𝑡𝑚𝑛e_{t}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT model whether object m𝑚mitalic_m interacts with object n𝑛nitalic_n at time t𝑡titalic_t. The edge variables are modeled in a latent temporal graph 𝒢t=(𝒱t,t)subscript𝒢𝑡subscript𝒱𝑡subscript𝑡\mathcal{G}_{t}=(\mathcal{V}_{t},\mathcal{E}_{t})caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where edges etmntsuperscriptsubscript𝑒𝑡𝑚𝑛subscript𝑡e_{t}^{m\rightarrow n}\in\mathcal{E}_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and nodes 𝒱tsubscript𝒱𝑡\mathcal{V}_{t}caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT summarize states of objects. For instance, 𝐯tm={ztm,ctm,𝐲tm}subscriptsuperscript𝐯𝑚𝑡subscriptsuperscript𝑧𝑚𝑡subscriptsuperscript𝑐𝑚𝑡subscriptsuperscript𝐲𝑚𝑡\mathbf{v}^{m}_{t}=\{z^{m}_{t},c^{m}_{t},\mathbf{y}^{m}_{t}\}bold_v start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } for 𝐯tm𝒱tsubscriptsuperscript𝐯𝑚𝑡subscript𝒱𝑡\mathbf{v}^{m}_{t}\in\mathcal{V}_{t}bold_v start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defines one graph node summarizing states of object m𝑚mitalic_m at time t𝑡titalic_t which includes observation {𝐲tm}subscriptsuperscript𝐲𝑚𝑡\{\mathbf{y}^{m}_{t}\}{ bold_y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and latent states {ztm,ctm}subscriptsuperscript𝑧𝑚𝑡subscriptsuperscript𝑐𝑚𝑡\{z^{m}_{t},c^{m}_{t}\}{ italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. Edge etmnsuperscriptsubscript𝑒𝑡𝑚𝑛e_{t}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT signals interaction relationships between node 𝐯tmsuperscriptsubscript𝐯𝑡𝑚\mathbf{v}_{t}^{m}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and node 𝐯tnsuperscriptsubscript𝐯𝑡𝑛\mathbf{v}_{t}^{n}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in graph 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Object-shared and mode-specific equation discovery.

We set the number of all possible motion modes as K𝐾Kitalic_K in multi-object hybrid dynamical systems. The K𝐾Kitalic_K motion modes are shared across N𝑁Nitalic_N objects, which are implemented by K𝐾Kitalic_K sets of candidate basis functions Θ(𝐲t)=(Θ1(𝐲t),,ΘK(𝐲t))Θsubscript𝐲𝑡subscriptΘ1subscript𝐲𝑡subscriptΘ𝐾subscript𝐲𝑡\Theta(\mathbf{y}_{t})=(\Theta_{1}(\mathbf{y}_{t}),\cdots,\Theta_{K}(\mathbf{y% }_{t}))roman_Θ ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ⋯ , roman_Θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and learnable coefficient weights 𝐰=(𝐰1,,𝐰K)𝐰subscript𝐰1subscript𝐰𝐾\mathbf{w}=(\mathbf{w}_{1},\cdots,\mathbf{w}_{K})bold_w = ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). Each mode k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] has its own Θk(𝐲t)subscriptΘ𝑘subscript𝐲𝑡\Theta_{k}(\mathbf{y}_{t})roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and 𝐰ksubscript𝐰𝑘\mathbf{w}_{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as in Sec. 4. Thus there are K sets of learnable weights for learning dynamics of K modes across N objects. Both the time and space complexity of AMORE-MIO regarding learnable weights of basis functions is 𝒪(K)𝒪𝐾\mathcal{O}(K)caligraphic_O ( italic_K ). For instance, the mode variable ztn=ksuperscriptsubscript𝑧𝑡𝑛𝑘z_{t}^{n}=kitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k of the object n𝑛nitalic_n at time t𝑡titalic_t indexes Θk(𝐲t)subscriptΘ𝑘subscript𝐲𝑡\Theta_{k}(\mathbf{y}_{t})roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and 𝐰ksubscript𝐰𝑘\mathbf{w}_{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT which together form equations to represent the dynamics of mode k𝑘kitalic_k. Different from single-object scenarios, the mode-switching behaviors of each object are not only influenced by their own evolving nature but also by external influences of potentially interacting objects. We model the influences of interactions on the mode-switching behaviors between objects, which are detailed in the following generative model of AMORE-MIO.

Generative model for amortized equation discovery in multi-object settings.

Assuming Markovian dynamics, the joint generative probability of multi-object hybrid systems in AMORE-MIO is calculated as

p(𝐲,𝐞,𝐳,𝐜)=p(𝐲11:N|𝐳11:N)p(𝐳11:N)InitialStates\displaystyle p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})=\underbrace{p(% \mathbf{y}_{1}^{1:N}|\mathbf{z}_{1}^{1:N})p(\mathbf{z}_{1}^{1:N})}_{\rm{% Initial\,\,States}}\cdotitalic_p ( bold_y , bold_e , bold_z , bold_c ) = under⏟ start_ARG italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Initial roman_States end_POSTSUBSCRIPT ⋅
t=2T[p(𝐲t1:N|𝐲t11:N,𝐳t1:N)p(𝐳t1:N|𝐳t11:N,𝐲t11:N,𝐜t1:N,𝐞t1:N2)\displaystyle\prod_{t=2}^{T}\!\bigg{[}p(\mathbf{y}_{t}^{1:N}|\mathbf{y}_{t-1}^% {1:N},\mathbf{z}_{t}^{1:N})p(\mathbf{z}_{t}^{1:N}|\mathbf{z}_{t-1}^{1:N},% \mathbf{y}_{t-1}^{1:N},\mathbf{c}_{t}^{1:N}\!,\mathbf{e}_{t}^{1:N^{2}})\!\!∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
p(𝐜t1:N|𝐜t11:N,𝐳t11:N)p(𝐞t1:N2|𝐞t11:N2,𝐜t11:N,𝐳t11:N,𝐲t11:N)]\displaystyle p(\mathbf{c}_{t}^{1:N}|\mathbf{c}_{t-1}^{1:N},\mathbf{z}_{t-1}^{% 1:N})p(\mathbf{e}_{t}^{1:N^{2}}|\mathbf{e}_{t-1}^{1:N^{2}},\mathbf{c}_{t-1}^{1% :N},\mathbf{z}_{t-1}^{1:N},\mathbf{y}_{t-1}^{1:N})\!\bigg{]}italic_p ( bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) ] (6)

where the initial states and count transition probability are defined as in Eq. 4 of single-object scenarios but with n𝑛nitalic_n objects. For later time steps t2𝑡2t\geq 2italic_t ≥ 2, the edge transition probability p(𝐞t1:N2|𝐞t11:N2,𝐜t11:N,𝐳t11:N,𝐲t11:N)𝑝conditionalsuperscriptsubscript𝐞𝑡:1superscript𝑁2superscriptsubscript𝐞𝑡1:1superscript𝑁2superscriptsubscript𝐜𝑡1:1𝑁superscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐲𝑡1:1𝑁p(\mathbf{e}_{t}^{1:N^{2}}|\mathbf{e}_{t-1}^{1:N^{2}},\mathbf{c}_{t-1}^{1:N},% \mathbf{z}_{t-1}^{1:N},\mathbf{y}_{t-1}^{1:N})italic_p ( bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) models how the edge variables evolve depending on all the states at the previous time step. We model the influences of interactions on the mode transition probability p(𝐳t1:N|𝐳t11:N,𝐲t11:N,𝐜t1:N,𝐞t1:N2)𝑝conditionalsuperscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐲𝑡1:1𝑁superscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐞𝑡:1superscript𝑁2p(\mathbf{z}_{t}^{1:N}|\mathbf{z}_{t-1}^{1:N},\mathbf{y}_{t-1}^{1:N},\mathbf{c% }_{t}^{1:N},\mathbf{e}_{t}^{1:N^{2}})italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ), which characterizes the mode-switching behaviors of multi-object hybrid dynamical systems. Based on the updated modes of each object, the observation transition probability p(𝐲t1:N|𝐲t11:N,𝐳t1:N)𝑝conditionalsuperscriptsubscript𝐲𝑡:1𝑁superscriptsubscript𝐲𝑡1:1𝑁superscriptsubscript𝐳𝑡:1𝑁p(\mathbf{y}_{t}^{1:N}|\mathbf{y}_{t-1}^{1:N},\mathbf{z}_{t}^{1:N})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) can be factorized over objects n=1Np(𝐲tn|𝐲t1n,ztn)superscriptsubscriptproduct𝑛1𝑁𝑝conditionalsuperscriptsubscript𝐲𝑡𝑛superscriptsubscript𝐲𝑡1𝑛superscriptsubscript𝑧𝑡𝑛\prod_{n=1}^{N}p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n})∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) where equations of each mode are amortized and learned by all segments from all objects belonging to the same mode. The further expansion over objects of the joint generative probability is in Appendix B.1. For inference of latent mode, count, and edge variables, we conduct posterior approximate inference for edge variables qϕe(𝐞|𝐲)subscript𝑞subscriptitalic-ϕ𝑒conditional𝐞𝐲q_{\phi_{e}}(\mathbf{e}|\mathbf{y})italic_q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_e | bold_y ) conditioned on observations 𝐲𝐲\mathbf{y}bold_y, and then conduct exact inference of mode and count variables pθ(𝐳,𝐜|𝐲,𝐞~)subscript𝑝𝜃𝐳conditional𝐜𝐲~𝐞p_{\theta}(\mathbf{z},\mathbf{c}|\mathbf{y},\tilde{\mathbf{e}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c | bold_y , over~ start_ARG bold_e end_ARG ) conditioned on observations 𝐲𝐲\mathbf{y}bold_y and the approximate edge variables 𝐞~qϕe(𝐞|𝐲)similar-to~𝐞subscript𝑞subscriptitalic-ϕ𝑒conditional𝐞𝐲\tilde{\mathbf{e}}\sim q_{\phi_{e}}(\mathbf{e}|\mathbf{y})over~ start_ARG bold_e end_ARG ∼ italic_q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_e | bold_y ). The generative and inference models of AMORE-MIO are illustrated in Fig. 3. Learnable parameters of AMORE-MIO are optimized by maximizing the evidence lower bound with sparse regularization on the learnable coefficient weights

 AMORE-MIO=logpθ(𝐲)+subscript AMORE-MIOlimit-fromlogsubscript𝑝𝜃𝐲\displaystyle\mathcal{L}_{\textrm{ AMORE-MIO}}=-{\rm log}\,p_{\theta}(\mathbf{% y})+caligraphic_L start_POSTSUBSCRIPT AMORE-MIO end_POSTSUBSCRIPT = - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ) +
DKL[qϕ(𝐳,𝐜,𝐞|𝐲)pθ(𝐳,𝐜,𝐞|𝐲)]+||𝐰||1\displaystyle\;\;\;\;\;\;\;\;D_{K\!L}\left[q_{\phi}(\mathbf{z},\mathbf{c},% \mathbf{e}|\mathbf{y})\,\|\,p_{\theta}(\mathbf{z},\mathbf{c},\mathbf{e}|% \mathbf{y})\right]+||\mathbf{w}||_{1}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) ] + | | bold_w | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (7)

Neural network implementations, the derivations and the detailed inference model are in Appendix B.2,  B.4. and B.3.

6 Experiments

We extensively validate our method on 10 dynamical systems. Specifically, we validate on single-object scenarios using the Mass-spring Hopper dataset, and the Susceptible, Infected and Recovered (SIR) disease dataset from Hybrid-SINDy (Mangan et al., 2019). We validate on multi-object scenarios using the ODE-driven particle dataset and Salsa-dancing dataset from GRASS (Liu et al., 2023). Further, we test the robustness of our methods on non-hybrid systems using datasets of the Coupled linear, Cubic oscillator, Lorenz’ 63, Hopf bifurcation, Seklov glycolysis, and Duffing oscillator from Course & Nair (2023). Detailed settings of the datasets are in Appendix C.1.

Refer to caption
Figure 4: Qualitative time series segmentation results of AMORE compared to Hybrid-SINDy (Brunton et al., 2016) on the Mass-spring Hopper dataset. For Hybrid-SINDy, we aggregate the discovered equations with the same number of coefficients as one mode. We can see that with joint learning of modes and equations, AMORE can categorize the exact number of modes and achieve superior segmentation results with fewer switching errors.
Table 1: Segmentation results on Mass-spring Hopper dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.426 0.383 0.705 0.691
AMORE (ours) 0.928 0.967 0.991 0.993
Table 2: Forecasting results of Location/Velocity on the Mass-spring Hopper dataset.
Method NMAE \downarrow NRMSE \downarrow
LLMTime 0.113 / 0.305 0.417 / 0.454
TimeGPT 0.092 / 0.217 0.322 / 0.340
SVI 0.068 / 0.075 0.148 / 0.262
Hybrid-SINDy 0.240 / 0.314 0.336 / 0.372
AMORE (ours) 0.008 / 0.039 0.026 / 0.059
Table 3: Segmentation results on the SIR disease dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.296 0.283 0.538 0.519
AMORE (ours) 0.475 0.483 0.731 0.735
Table 4: Forecasting results of Susceptible/Infected on the SIR disease dataset.
Method NMAE \downarrow NRMSE \downarrow
LLMTime 0.352 / 0.396 0.481 / 0.523
TimeGPT 0.301 / 0.347 0.403 / 0.452
SVI 0.257 / 0.273 0.355 / 0.401
Hybrid-SINDy 0.316 / 0.363 0.414 / 0.453
AMORE (ours) 0.088 / 0.113 0.142 / 0.181
Implementation Details.

We train all datasets with a fixed batch size of 40 for 20,000 training steps. We use the Adam optimizer with 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT weight-decay and clip gradients norm to 10. The learning rate is warmed up linearly from 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT to 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the first 2,000 steps, and then decays following a cosine manner with a rate of 0.99. Each experiment is running on one Nvidia GeForce RTX 3090 GPU. dminsubscript𝑑𝑚𝑖𝑛d_{min}italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and dmaxsubscript𝑑𝑚𝑎𝑥d_{max}italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT of the count variables are simply set as 20 and 50, respectively for all datasets. The number of edge types L𝐿Litalic_L is set as 2, containing one no-interaction type and one with-interaction type. More details are in Appendix C.2.

Evaluation metrics.

For evaluation of discovered equations, following Course & Nair (2023), we use the reconstruction error between the discovered coefficients of equations and ground truth, i.e. RER=1Tt=1T(𝐰tξt2/ξt2)RER1𝑇superscriptsubscript𝑡1𝑇subscriptnormsubscript𝐰𝑡subscript𝜉𝑡2subscriptnormsubscript𝜉𝑡2{\rm RER}=\frac{1}{T}\sum_{t=1}^{T}(||\mathbf{w}_{t}-\xi_{t}||_{2}\;/\;||\xi_{% t}||_{2})roman_RER = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( | | bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / | | italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) where 𝐰tsubscript𝐰𝑡\mathbf{w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ξtsubscript𝜉𝑡\xi_{t}italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the learned and ground-truth coefficients at time t𝑡titalic_t. For evaluation of segmentation, following Ansari et al. (2021), we use frame-wise segmentation accuracy, i.e. Accuracy and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT after matching the labels using the Hungarian algorithm (Kuhn, 1955), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) to measure similarities. For evaluation of forecasting, we use Normalized Mean Absolute Error (NMAE) and Normalized Root Mean Squared Error (NRMSE). We conducted each experiment with 5 random seeds. We report the average score of each experiment in the main paper and put the statistics (error bars) in the appendix due to limited space.

Baselines.

Hybrid-SINDy (Mangan et al., 2019) uses a two-stage paradigm and cannot perform forecasting, thus we compare with it on discovered equations and segmentation. To compare with Hybrid-SINDy on forecasting, we continue the value of the last observable time point as forecasting results of Hybrid-SINDy. For forecasting, we compare with other three recent representative methods, i.e. SVI (Course & Nair, 2023) which is designed for equation discovery in non-hybrid systems and can perform forecasting, LLMTime (Gruver et al., 2023) which utilize pre-trained large language models (LLM) to do forecasting, and TimeGPT (Garza & Mergenthaler-Canseco, 2023) which is the first foundation model for time series. GRASS (Liu et al., 2023) does not discover equations, but models multi-object switching dynamical systems, so it is used for comparison in multi-object systems.

6.1 Single-object Dynamical Systems

6.1.1 Mass-spring Hopper

In the mass-spring hopper system, a mass and spring connect and hop on the ground with two modes, i.e. flight and compression. Details of the dataset are in Appendix C.1.1. Comparison results of time series segmentation on the dataset are in Table 1. We can see that AMORE can achieve significant and consistent performance improvements across all metrics. AMORE categorizes exactly two modes from the system and discovers equations for each mode

{l˙=vandv˙=11.0310.08ll˙=vandv˙=1cases˙𝑙𝑣and˙𝑣11.0310.08𝑙otherwise˙𝑙𝑣and˙𝑣1otherwise\displaystyle\begin{cases}\dot{l}=v\;\;\textrm{and}\;\;\dot{v}=11.03-10.08l\\ \dot{l}=v\;\;\textrm{and}\;\;\dot{v}=-1\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_l end_ARG = italic_v and over˙ start_ARG italic_v end_ARG = 11.03 - 10.08 italic_l end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_l end_ARG = italic_v and over˙ start_ARG italic_v end_ARG = - 1 end_CELL start_CELL end_CELL end_ROW

which are nearly identical to the ground truth in Eq. (8). In Hybrid-SINDy, equations are discovered in each cluster fragment, thus producing a massive number of equations. To quantitatively compare discovered equations, we compute RERRER\rm RERroman_RER for Hybrid-SINDy and AMORE which are 7.5e37.5superscript𝑒37.5e^{-3}7.5 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and 2.4e42.4superscript𝑒42.4e^{-4}2.4 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, respectively.

Qualitative segmentation results of Hybrid-SINDy and AMORE are shown in Fig. 4. Thanks to the amortized joint learning of modes and equations, AMORE can categorize the exact number of modes, achieve superior segmentation results, and discover high-quality equations. In these experiments, the maximal number of possible modes K𝐾Kitalic_K is set as 3 in our model. After learning, our model chooses 2 modes to be enough to categorize and express the dynamics of the specific hybrid systems. Note that the number of discovered equations in Hybrid-SINDy is the same as the number of time points, which is much larger than the fixed number of modes, e.g. K=3𝐾3K=3italic_K = 3 in our model. To visualize discovered modes of Hybrid-SINDy, we aggregate the discovered equations with the same type of function terms as one mode, thus appearing more than 3 modes in the system. Besides, “AMORE w/o count” represents our model without setting count variables. We can see that count variables can help AMORE learn fewer false-positive mode-switching behaviors. More quantitative ablation studies on count variables are in Appendix C.4.3.

We summarize time series forecasting results on the Mass-spring Hopper dataset in Table 2. We can see that our method significantly outperforms SVI which is designed for non-hybrid systems, and LLMTime as well as TimeGPT which utilizes pre-trained large models for forecasting, thanks to the proposed joint learning framework originally designed for hybrid systems.

Table 5: Forecasting results on non-hybrid dynamical systems. Results are shown in log10(NRMSE)subscriptlog10NRMSE{\rm log}_{10}({\rm NRMSE})roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( roman_NRMSE ) where lower is better.
System LLMTime SVI AMORE (ours)
Coupled linear -0.39 -1.13 -1.18
Cubic oscillator -0.45 -1.02 -1.06
Lorenz’63 -0.41 -1.27 -1.23
Hopf bifurcation -0.32 -0.94 -1.03
Selkov glycolysis -0.68 -1.55 -1.49
Duffing oscillator -0.53 -1.12 -1.17

6.1.2 SIR Disease Dataset

The Susceptible, Infected and Recovered (SIR) disease model is an epidemiological model used to understand the spread of infectious diseases. Numbers of susceptible, infected, and recovered individuals are involved in model dynamics where some external events describe the modes, e.g. school in session or not. Detailed settings for this dataset are in Appendix C.1.2.

We summarize the segmentation and forecasting results on the dataset in Tables 3 and 4. We can observe similar findings as in the Mass-spring Hopper dataset. AMORE can achieve consistently higher segmentation accuracy and lower forecasting errors across all metrics compared to Hybrid-SINDy, SVI, LLMTime, and TimeGPT. AMORE categorizes exactly two modes from the system and discovers equations for each mode

{S˙=2.740.0172IS0.0024S,I˙=0.0171IS0.2IS˙=2.740.0057IS0.0021S,I˙=0.0051IS0.2Icasesformulae-sequence˙𝑆2.740.0172𝐼𝑆0.0024𝑆˙𝐼0.0171𝐼𝑆0.2𝐼otherwiseformulae-sequence˙𝑆2.740.0057𝐼𝑆0.0021𝑆˙𝐼0.0051𝐼𝑆0.2𝐼otherwise\displaystyle\begin{cases}\dot{S}\!=\!2.74\!-\!0.0172\;IS\!-\!0.0024\;S,\;\dot% {I}\!=\!0.0171\;IS\!-\!0.2\;I\\ \dot{S}\!=\!2.74\!-\!0.0057\;IS\!-\!0.0021\;S,\;\dot{I}\!=\!0.0051\;IS\!-\!0.2% \;I\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_S end_ARG = 2.74 - 0.0172 italic_I italic_S - 0.0024 italic_S , over˙ start_ARG italic_I end_ARG = 0.0171 italic_I italic_S - 0.2 italic_I end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_S end_ARG = 2.74 - 0.0057 italic_I italic_S - 0.0021 italic_S , over˙ start_ARG italic_I end_ARG = 0.0051 italic_I italic_S - 0.2 italic_I end_CELL start_CELL end_CELL end_ROW

which are nearly exact to the ground truth in Eq. (9). Quantitative comparisons of the discovered equations are calculated by RERRER\rm RERroman_RER where Hybrid-SINDy and AMORE are 3.4e33.4superscript𝑒33.4e^{-3}3.4 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and 1.8e41.8superscript𝑒41.8e^{-4}1.8 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, respectively. We can see that AMORE can discover high-quality equations, and achieve superior segmentation and forecasting results thanks to the proposed joint learning framework designed for equation discovery in hybrid dynamical systems.

6.1.3 Non-hybrid Dynamical Systems

In some cases, we have prior knowledge of the dynamical systems whether they are hybrid or not. To answer the question of whether our method, which is originally designed for hybrid systems, can still perform well if we know the systems are non-hybrid in advance, we conduct experiments on six non-hybrid physical systems (Course & Nair, 2023), including Coupled linear, Cubic oscillator, Lorenz’63, Hopf bifurcation, Selkov glycolysis, and Duffing oscillator. Detailed settings of the datasets are in Appendix C.1.3. As we have the prior, we set the maximal possible number of modes in AMORE as 1 for all physical systems. Following Course & Nair (2023), we summarize the forecasting results in Table 5. We can see that although our model is not specialized for non-hybrid systems, AMORE can still achieve better forecasting results on 4 out of 6 non-hybrid physical systems, which verifies the robustness of our model to non-hybrid dynamical systems.

Refer to caption
Figure 5: Discovered equations on the Salsa-dancing dataset. Locations (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) of the hip joints are used as observations.

6.2 Multi-object Hybrid Dynamical Systems

Equation discovery in multi-object hybrid dynamical systems is an unexplored but more general setting. In this section, we verify the effectiveness of the multi-object variant of our method, i.e. AMORE-MIO, on two multi-object datasets (Liu et al., 2023), i.e. the ODE-driven Particle dataset and the Salsa-dancing dataset.

Table 6: Segmentation results on ODE-driven Particle Dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.205 0.192 0.414 0.407
GRASS 0.445 0.437 0.732 0.726
AMORE (ours) 0.418 0.405 0.692 0.684
AMORE-MIO (ours) 0.453 0.442 0.741 0.735
Table 7: Forecasting results of in terms of NMAE / NRMSE on ODE-driven Particle dataset.
Method One-step Multi-step
LLMTime 0.335 / 0.438 0.370 / 0.473
TimeGPT 0.351 / 0.445 0.392 / 0.490
SVI 0.319 / 0.432 0.346 / 0.465
Hybrid-SINDy 0.340 / 0.431 0.372 / 0.487
GRASS 0.151 / 0.224 0.193 / 0.270
AMORE (ours) 0.184 / 0.265 0.217 / 0.302
AMORE-MIO (ours) 0.146 / 0.217 0.186 / 0.259

6.2.1 ODE-driven Particle Dataset

In ODE-driven particle systems, trajectories of particles are driven by Ordinary Differential Equations where particles switch their driven equations/modes when they collide with each other. Detailed settings of the ODE-driven Particle dataset are in Appendix C.1.4. We summarize the segmentation results on the dataset in Table 6. We can see that our methods including both AMORE and AMORE-MIO achieve better time series segmentation results compared to Hybrid-SINDy and GRASS. Besides, AMORE-MIO can outperform AMORE consistently across all metrics. AMORE-MIO categorizes 4 modes from the system and the discovered equations for each mode are

{x˙=1.08x0.92xy;y˙=0.93y+1.11xyx˙=0.17x3+2.00y3;y˙=2.13x30.06y3x˙=0;y˙=2.00x˙=0;y˙=2.00casesformulae-sequence˙𝑥1.08𝑥0.92𝑥𝑦˙𝑦0.93𝑦1.11𝑥𝑦otherwiseformulae-sequence˙𝑥0.17superscript𝑥32.00superscript𝑦3˙𝑦2.13superscript𝑥30.06superscript𝑦3otherwiseformulae-sequence˙𝑥0˙𝑦2.00otherwiseformulae-sequence˙𝑥0˙𝑦2.00otherwise\displaystyle\begin{cases}\dot{x}=1.08x-0.92xy;\,\,\dot{y}=-0.93y+1.11xy\\ \dot{x}=-0.17x^{3}+2.00y^{3};\,\,\dot{y}=-2.13x^{3}-0.06y^{3}\\ \dot{x}=0;\,\,\dot{y}=2.00\\ \dot{x}=0;\,\,\dot{y}=-2.00\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_x end_ARG = 1.08 italic_x - 0.92 italic_x italic_y ; over˙ start_ARG italic_y end_ARG = - 0.93 italic_y + 1.11 italic_x italic_y end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_x end_ARG = - 0.17 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + 2.00 italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ; over˙ start_ARG italic_y end_ARG = - 2.13 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - 0.06 italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_x end_ARG = 0 ; over˙ start_ARG italic_y end_ARG = 2.00 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_x end_ARG = 0 ; over˙ start_ARG italic_y end_ARG = - 2.00 end_CELL start_CELL end_CELL end_ROW

which share the same number of coefficients and similar values as the ground truth in Eq. (10). RERRER\rm RERroman_RER of discovered equations by Hybrid-SINDy, AMORE, and AMORE-MIO are 2.7e22.7superscript𝑒22.7e^{-2}2.7 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 6.1e36.1superscript𝑒36.1e^{-3}6.1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and 4.3e34.3superscript𝑒34.3e^{-3}4.3 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, respectively, which shows that as a multi-object extension of AMORE, AMORE-MIO consistently outperforms AMORE and Hybrid-SINDy for equation discovery and mode categorization in multi-object hybrid systems thanks to the specially-designed interaction modeling of AMORE-MIO. We further show the forecasting results in Table 7. We can see that AMORE-MIO consistently achieves the lowest forecasting errors for both one-step and multi-step predictions. Compared with GRASS, AMORE-MIO can obtain better results thanks to the introduced equation priors on the latent motion dynamics.

Table 8: Segmentation results on the Salsa-dancing dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.102 0.097 0.325 0.309
GRASS 0.173 0.177 0.579 0.526
AMORE (ours) 0.167 0.173 0.565 0.518
AMORE-MIO (ours) 0.179 0.182 0.583 0.531
Table 9: Forecasting results in terms of NMAE / NRMSE on the Salsa-dancing dataset.
Method One-step Multi-step
LLMTime 0.402 / 0.452 0.449 / 0.480
TimeGPT 0.341 / 0.417 0.394 / 0.446
SVI 0.384 / 0.441 0.423 / 0.465
Hybrid-SINDy 0.362 / 0.405 0.416 / 0.433
GRASS 0.285 / 0.344 0.313 / 0.359
AMORE (ours) 0.291 / 0.361 0.334 / 0.373
AMORE-MIO (ours) 0.272 / 0.335 0.301 / 0.352
Table 10: Analyses on robustness to different orders of polynomial as candidate basis functions on Mass-spring Hopper dataset.
Polynomial order 2 3 5
NMI\uparrow RER\downarrow NMI\uparrow RER\downarrow NMI\uparrow RER\downarrow
Hybrid-SINDy 0.426 7.5e37.5superscript𝑒37.5e^{-3}7.5 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.384 8.1e38.1superscript𝑒38.1e^{-3}8.1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.316 9.7e39.7superscript𝑒39.7e^{-3}9.7 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
AMORE (ours) 0.934 2.1𝐞𝟒2.1superscript𝐞4\mathbf{2.1e^{-4}}bold_2.1 bold_e start_POSTSUPERSCRIPT - bold_4 end_POSTSUPERSCRIPT 0.936 2.3𝐞𝟒2.3superscript𝐞4\mathbf{2.3e^{-4}}bold_2.3 bold_e start_POSTSUPERSCRIPT - bold_4 end_POSTSUPERSCRIPT 0.933 2.8𝐞𝟒2.8superscript𝐞4\mathbf{2.8e^{-4}}bold_2.8 bold_e start_POSTSUPERSCRIPT - bold_4 end_POSTSUPERSCRIPT
Table 11: Analyses on robustness to different maximal numbers of predefined modes on Mass-spring Hopper dataset.
Number of modes 3 5 10
NMI\uparrow RER\downarrow NMI\uparrow RER\downarrow NMI\uparrow RER\downarrow
AMORE (ours) 0.934 2.1e42.1superscript𝑒42.1e^{-4}2.1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.932 2.0e42.0superscript𝑒42.0e^{-4}2.0 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.937 2.1e42.1superscript𝑒42.1e^{-4}2.1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT

6.2.2 Salsa-dancing Dataset

The Salsa-dancing dataset contains four modes, i.e. “moving forward”, “moving backward”, “clockwise turning”, and “counter-clockwise turning”. Details of the dataset are in Appendix C.1.5. We summarize the segmentation and forecasting results on the Salsa-dancing dataset in Table 8 and Table 9. We observe similar findings in this real-world video dataset, as with the ODE-driven particle dataset. AMORE-MIO achieves significantly higher segmentation accuracies compared to Hybrid-SINDy and AMORE. Different from previous datasets, the salsa-dancing system is not generated synthetically by equations while results show that structural learning in the form of equations still benefits forecasting compared to purely autoregressive data-driven methods, i.e. LLMTime, SVI, and GRASS. Qualitative results of the discovered equations on the dancing dataset are in Figure 5.

6.3 Ablation Studies

Sensitivity to order of polynomial functions.

To test the sensitivity of our method to different orders of polynomials as candidate basis functions, we conduct experiments on the Mass-spring Hopper dataset by changing the order of polynomial functions to 2, 3, and 5. We present results in Table 10. We observe that AMORE consistently outperforms Hybrid-SINDy, while AMORE is not sensitive to the polynomial orders compared to Hybrid-SINDy.

Sensitivity to number of dynamic modes

We test the robustness of our method to different maximum numbers of modes, that is 3, 5, and 10, while the true number is 2 on the Mass-spring Hopper dataset. The results of segmentation and discovered equations are in Table 11. We can see that AMORE is impervious to this misspecification, which indicates that we can set a large number of possible modes while AMORE can still learn those needed.

Sensitivity to more complex dynamical systems

We originally followed the setup of Hybrid-SINDy, where all of the dynamics can be approximated by polynomial basis functions. However, our model is not limited to these functions. To show results on more complex dynamical systems, we conduct experiments on a synthetic dataset where two modes are driven by x˙=x+x2+cos(x)˙𝑥𝑥superscript𝑥2𝑐𝑜𝑠𝑥\dot{x}=x+x^{2}+cos(x)over˙ start_ARG italic_x end_ARG = italic_x + italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c italic_o italic_s ( italic_x ) and x˙=x+ex˙𝑥𝑥superscript𝑒𝑥\dot{x}=x+e^{x}over˙ start_ARG italic_x end_ARG = italic_x + italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, respectively. We set the basis functions as polynomials order 3 together with {cos(x),sin(x),ex}𝑐𝑜𝑠𝑥𝑠𝑖𝑛𝑥superscript𝑒𝑥\{cos(x),sin(x),e^{x}\}{ italic_c italic_o italic_s ( italic_x ) , italic_s italic_i italic_n ( italic_x ) , italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT }. The discovered equations by our model are

{x˙=0.97x+1.02x2+1.08cos(x)x˙=0.05+1.12x+0.96excases˙𝑥0.97𝑥1.02superscript𝑥21.08𝑐𝑜𝑠𝑥otherwise˙𝑥0.051.12𝑥0.96superscript𝑒𝑥otherwise\displaystyle\begin{cases}\dot{x}=0.97x+1.02x^{2}+1.08cos(x)\\ \dot{x}=0.05+1.12x+0.96e^{x}\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_x end_ARG = 0.97 italic_x + 1.02 italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1.08 italic_c italic_o italic_s ( italic_x ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_x end_ARG = 0.05 + 1.12 italic_x + 0.96 italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW

When we set the basis functions as polynomials order 3 without {cos(x),sin(x),ex}𝑐𝑜𝑠𝑥𝑠𝑖𝑛𝑥superscript𝑒𝑥\{cos(x),sin(x),e^{x}\}{ italic_c italic_o italic_s ( italic_x ) , italic_s italic_i italic_n ( italic_x ) , italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT }. The discovered equations by our model are x˙=0.92+x+0.76x2˙𝑥0.92𝑥0.76superscript𝑥2\dot{x}=0.92+x+0.76x^{2}over˙ start_ARG italic_x end_ARG = 0.92 + italic_x + 0.76 italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and x˙=1.26+1.31x+0.83x2+0.34x3˙𝑥1.261.31𝑥0.83superscript𝑥20.34superscript𝑥3\dot{x}=1.26+1.31x+0.83x^{2}+0.34x^{3}over˙ start_ARG italic_x end_ARG = 1.26 + 1.31 italic_x + 0.83 italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.34 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, respectively. We can see that our model can be extended to equation discovery with more complex basis functions. When the candidate basis functions are limited to polynomial functions, our model can discover approximated ones with more terms, complexity, and errors.

Model complexity analysis

The numbers of parameters used in baseline methods are summarized in Table 12. As Hybrid-SINDy is not a deep learning method and does not use neural networks, it does not involve learnable parameters and does not need much data for the training of any little parameters the model has. This comes at the expense that Hybrid-SINDy tends to not generalize beyond simple dynamical settings, as shown in our experiments in the main paper. When given a complex dynamical setting with sufficient data, AMORE and AMORE-MIO perform better and have slightly fewer parameters than the other deep learning-based approaches, except LLMTime.

Table 12: Comparisons on model complexity regarding the numbers of learnable parameters.
Method Number of parameters
Hybrid-SINDy 0
AMORE (ours) 2,240
AMORE-MIO (ours) 2,512
GRASS 4,628
SVI 2,826
LLMTime 175 billion (GPT-3)

7 Conclusion and Future work

In this paper, we reformulate the problem of equation discovery in hybrid dynamical systems and propose an end-to-end learning framework, i.e. Amortized Equation Discovery (AMORE) to jointly categorize motion dynamics and discover equations by modeling categorical modes and mode-switching behaviors. Besides, we extend our method to multi-object scenarios, i.e. AMORE-MIO, which is unexplored by previous methods and a more natural setting. Extensive experiments on 10 hybrid and non-hybrid systems demonstrate the effectiveness of our method. Future work can include equation discovery with partial known knowledge, equation discovery from videos of hybrid systems, and more complex candidate basis functions.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning and Dynamical Systems. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgements

This work is financially supported by NWO TIMING VI.Vidi.193.129. We also thank SURF for the support in using the National Supercomputer Snellius.

References

  • Ackerson & Fu (1970) Ackerson, G. and Fu, K. On state estimation in switching environments. IEEE transactions on automatic control, 15(1):10–17, 1970.
  • Ansari et al. (2021) Ansari, A. F., Benidis, K., Kurle, R., Turkmen, A. C., Soh, H., Smola, A. J., Wang, B., and Januschowski, T. Deep explicit duration switching models for time series. Advances in Neural Information Processing Systems, 34:29949–29961, 2021.
  • Bako (2011) Bako, L. Identification of switched linear systems via sparse optimization. Automatica, 47(4):668–677, 2011.
  • Bongard & Lipson (2007) Bongard, J. and Lipson, H. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 104(24):9943–9948, 2007.
  • Brunton et al. (2016) Brunton, S. L., Proctor, J. L., and Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15):3932–3937, 2016.
  • Cortes (2008) Cortes, J. Discontinuous dynamical systems. IEEE Control systems magazine, 28(3):36–73, 2008.
  • Course & Nair (2023) Course, K. and Nair, P. B. State estimation of a physical system with unknown governing equations. Nature, 622(7982):261–267, 2023.
  • Dong et al. (2020) Dong, Z., Seybold, B., Murphy, K., and Bui, H. Collapsed amortized variational inference for switching nonlinear dynamical systems. In International Conference on Machine Learning, pp.  2638–2647. PMLR, 2020.
  • Eddy (1996) Eddy, S. R. Hidden markov models. Current opinion in structural biology, 6(3):361–365, 1996.
  • Ferrari-Trecate et al. (2003) Ferrari-Trecate, G., Muselli, M., Liberati, D., and Morari, M. A clustering technique for the identification of piecewise affine systems. Automatica, 39(2):205–217, 2003.
  • Garza & Mergenthaler-Canseco (2023) Garza, A. and Mergenthaler-Canseco, M. Timegpt-1. arXiv preprint arXiv:2310.03589, 2023.
  • Ghahramani & Hinton (2000) Ghahramani, Z. and Hinton, G. E. Variational learning for switching state-space models. Neural computation, 12(4):831–864, 2000.
  • Gruver et al. (2023) Gruver, N., Finzi, M., Qiu, S., and Wilson, A. G. Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
  • Holmes et al. (2006) Holmes, P., Full, R. J., Koditschek, D., and Guckenheimer, J. The dynamics of legged locomotion: Models, analyses, and challenges. SIAM review, 48(2):207–304, 2006.
  • Juloski et al. (2005) Juloski, A. L., Weiland, S., and Heemels, W. M. H. A bayesian approach to identification of hybrid systems. IEEE Transactions on Automatic Control, 50(10):1520–1533, 2005.
  • Kaiser et al. (2018) Kaiser, E., Kutz, J. N., and Brunton, S. L. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proceedings of the Royal Society A, 474(2219):20180335, 2018.
  • Karniadakis et al. (2021) Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
  • Keeling et al. (2001) Keeling, M. J., Rohani, P., and Grenfell, B. T. Seasonally forced disease dynamics explored as switching between attractors. Physica D: Nonlinear Phenomena, 148(3-4):317–335, 2001.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kipf et al. (2018) Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In International conference on machine learning, pp.  2688–2697. PMLR, 2018.
  • Koza et al. (1994) Koza, J. R. et al. Genetic programming II, volume 17. MIT press Cambridge, 1994.
  • Kuhn (1955) Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • Langley (1981) Langley, P. Data-driven discovery of physical laws. Cognitive Science, 5(1):31–54, 1981.
  • Lemos et al. (2023) Lemos, P., Jeffrey, N., Cranmer, M., Ho, S., and Battaglia, P. Rediscovering orbital mechanics with machine learning. Machine Learning: Science and Technology, 4(4):045002, 2023.
  • Liu et al. (2023) Liu, Y., Magliacane, S., Kofinas, M., and Gavves, E. Graph switching dynamical systems. arXiv preprint arXiv:2306.00370, 2023.
  • Loiseau & Brunton (2018) Loiseau, J.-C. and Brunton, S. L. Constrained sparse galerkin regression. Journal of Fluid Mechanics, 838:42–67, 2018.
  • Lutter et al. (2019) Lutter, M., Ritter, C., and Peters, J. Deep lagrangian networks: Using physics as model prior for deep learning. arXiv preprint arXiv:1907.04490, 2019.
  • Mangan et al. (2019) Mangan, N. M., Askham, T., Brunton, S. L., Kutz, J. N., and Proctor, J. L. Model selection for hybrid dynamical systems via sparse regression. Proceedings of the Royal Society A, 475(2223):20180534, 2019.
  • McMahon et al. (2020) McMahon, A., Robb, N. C., et al. Reinfection with sars-cov-2: Discrete sir (susceptible, infected, recovered) modeling using empirical infection data. JMIR public health and surveillance, 6(4):e21168, 2020.
  • Novelli et al. (2022) Novelli, N., Lenci, S., and Belardinelli, P. Boosting the model discovery of hybrid dynamical systems in an informed sparse regression approach. Journal of Computational and Nonlinear Dynamics, 17(5):051007, 2022.
  • Oh et al. (2005) Oh, S. M., Ranganathan, A., Rehg, J. M., and Dellaert, F. A variational inference method for switching linear dynamic systems. TR GIT-GVU-05-16, 2005.
  • Ohlsson & Ljung (2013) Ohlsson, H. and Ljung, L. Identification of switched linear regression models using sum-of-norms regularization. Automatica, 49(4):1045–1050, 2013.
  • Ozay et al. (2008) Ozay, N., Sznaier, M., Lagoa, C., and Camps, O. A sparsification approach to set membership identification of a class of affine hybrid systems. In 2008 47th IEEE Conference on Decision and Control, pp.  123–130. IEEE, 2008.
  • Paoletti et al. (2007) Paoletti, S., Juloski, A. L., Ferrari-Trecate, G., and Vidal, R. Identification of hybrid systems a tutorial. European journal of control, 13(2-3):242–260, 2007.
  • Roll et al. (2004) Roll, J., Bemporad, A., and Ljung, L. Identification of piecewise affine systems via mixed-integer programming. Automatica, 40(1):37–50, 2004.
  • Rudy et al. (2017) Rudy, S. H., Brunton, S. L., Proctor, J. L., and Kutz, J. N. Data-driven discovery of partial differential equations. Science advances, 3(4):e1602614, 2017.
  • Sanfelice et al. (2016) Sanfelice, R. G. et al. Analysis and design of cyber-physical systems. a hybrid control systems approach. In Cyber-physical systems: From theory to practice, pp.  1–29. CRC Press Boca Raton, FL, USA, 2016.
  • Schaeffer & McCalla (2017) Schaeffer, H. and McCalla, S. G. Sparse model selection via integral terms. Physical Review E, 96(2):023302, 2017.
  • Schmidt & Lipson (2009) Schmidt, M. and Lipson, H. Distilling free-form natural laws from experimental data. science, 324(5923):81–85, 2009.
  • Toda (2020) Toda, A. A. Susceptible-infected-recovered (sir) dynamics of covid-19 and economic impact. arXiv preprint arXiv:2003.11221, 2020.
  • Van Der Schaft & Schumacher (2007) Van Der Schaft, A. J. and Schumacher, H. An introduction to hybrid dynamical systems, volume 251. springer, 2007.
  • Vidal et al. (2003) Vidal, R., Soatto, S., Ma, Y., and Sastry, S. An algebraic geometric approach to the identification of a class of linear hybrid systems. In 42nd IEEE International Conference on Decision and Control (IEEE Cat. No. 03CH37475), volume 1, pp.  167–172. IEEE, 2003.

Appendix

Appendix A More Details of AMORE

A.1 Neural Network Implementation

We use neural networks to model the joint generative probabilities of hybrid systems in our model, i.e. Eq. (4). For the initial states, we model the initial prior distributions as:

p(z1)𝑝subscript𝑧1\displaystyle p(z_{1})italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =Cat(z1;𝝅),absentCatsubscriptz1𝝅\displaystyle=\rm{Cat}(z_{1};\bm{\pi}),= roman_Cat ( roman_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_π ) ,
p(𝐲1|z1)𝑝conditionalsubscript𝐲1subscript𝑧1\displaystyle p(\mathbf{y}_{1}|z_{1})italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =𝒩(𝐲1;𝝁z1;𝚺z1),absent𝒩subscript𝐲1subscript𝝁subscript𝑧1subscript𝚺subscript𝑧1\displaystyle=\mathcal{N}(\mathbf{y}_{1};\bm{\mu}_{z_{1}};\bm{\Sigma}_{z_{1}}),= caligraphic_N ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_Σ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where CatCat\rm{Cat}roman_Cat and 𝒩𝒩\mathcal{N}caligraphic_N denote categorical and multivariate Gaussian distributions, respectively. We set the prior distribution of p(z1)𝑝subscript𝑧1p(z_{1})italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) as uniform to encourage diversity.

Count variables and count transition probability.

To implement the count variables, we set a categorical distribution over {dmin,,dmax}subscript𝑑minsubscript𝑑max\{d_{\rm min},\cdots,d_{\rm max}\}{ italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } for each mode, where dminsubscript𝑑mind_{\rm min}italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and dmaxsubscript𝑑maxd_{\rm max}italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT are the minimal and maximal numbers of time steps before making a mode switch. The count transition probability p(ct|ct1,zt1)𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1p(c_{t}|c_{t-1},z_{t-1})italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is modeled as a learnable matrix 𝐏K×(dmaxdmin+1)𝐏superscript𝐾subscript𝑑maxsubscript𝑑min1\mathbf{P}\in\mathbb{R}^{K\times(d_{\rm max}-d_{\rm min}+1)}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × ( italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT, which is fixed across all time steps. Each term ρk(c)subscript𝜌𝑘𝑐\rho_{k}(c)italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c ) in 𝐏𝐏\mathbf{P}bold_P represents the probability of the k𝑘kitalic_k-th mode switching to another mode when its current count is c𝑐citalic_c. The probability of a count increment at count c𝑐citalic_c for mode k𝑘kitalic_k can be calculated as

μk(c)=1ρk(c)d=cdmaxρk(d).subscript𝜇𝑘𝑐1subscript𝜌𝑘𝑐superscriptsubscript𝑑𝑐subscript𝑑maxsubscript𝜌𝑘𝑑\displaystyle\mu_{k}(c)=1-\frac{\rho_{k}(c)}{\sum_{d=c}^{d_{\rm max}}\rho_{k}(% d)}.italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c ) = 1 - divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_d = italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_d ) end_ARG .

The count transition probability is thus defined as

p(ct|ct1,zt1=k)={μk(ct1)ifct=ct1+11μk(ct1)ifct=1.𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1𝑘casessubscript𝜇𝑘subscript𝑐𝑡1ifsubscript𝑐𝑡subscript𝑐𝑡111subscript𝜇𝑘subscript𝑐𝑡1ifsubscript𝑐𝑡1\displaystyle\!\!\!p(c_{t}|c_{t-1},z_{t-1}\!=\!k)\!=\!\!\begin{cases}\!\mu_{k}% (c_{t-1})&\!\!\!\!{\rm if}\;c_{t}\!=\!c_{t-1}\!+1\!\\ 1\!-\!\mu_{k}(c_{t-1})&\!\!\!\!{\rm if}\;c_{t}=1\!\end{cases}.italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_k ) = { start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL start_CELL roman_if italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + 1 end_CELL end_ROW start_ROW start_CELL 1 - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL start_CELL roman_if italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 end_CELL end_ROW .
Mode variables and mode transition probability.

Since the mode variables ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT take one out of K𝐾Kitalic_K possible values, we model them as categorical variables, parameterized by mode transition matrix 𝐓tsubscript𝐓𝑡\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t𝑡titalic_t. The mode transition probability is modeled as

p(zt|zt1,ct,𝐲t1)={δzt=zt1ifct>1Cat(zt;𝐓t)ifct=1,𝑝conditionalsubscript𝑧𝑡subscript𝑧𝑡1subscript𝑐𝑡subscript𝐲𝑡1casessubscript𝛿subscript𝑧𝑡subscript𝑧𝑡1ifsubscript𝑐𝑡1Catsubscript𝑧𝑡subscript𝐓𝑡ifsubscript𝑐𝑡1\displaystyle\!\!\!p(z_{t}|z_{t-1},c_{t},\mathbf{y}_{t-1}\!)\!=\!\begin{cases}% \delta_{z_{t}=z_{t-1}}&\!\!\!\!{\rm if}\;c_{t}>1\\ {\rm Cat}(z_{t};\mathbf{T}_{t})&\!\!\!\!{\rm if}\;c_{t}=1\end{cases},italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 1 end_CELL end_ROW start_ROW start_CELL roman_Cat ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL roman_if italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 end_CELL end_ROW ,

where we resample the modes or preserve them depending on whether count variables are reset to 1 or not. We model the parameters 𝐓tsubscript𝐓𝑡\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the categorical distributions with a neural network, i.e. a simple MLP, 𝐓t=fz(𝐲t1)subscript𝐓𝑡subscript𝑓𝑧subscript𝐲𝑡1\mathbf{T}_{t}=f_{z}(\mathbf{y}_{t-1})bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) that takes as input the observations. The network returns a K×K𝐾𝐾K\times Kitalic_K × italic_K transition matrix per time step t𝑡titalic_t, where rows correspond to past modes zt1subscript𝑧𝑡1z_{t-1}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and columns current modes ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Each term τtj,ksuperscriptsubscript𝜏𝑡𝑗𝑘\tau_{t}^{j,k}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_k end_POSTSUPERSCRIPT in 𝐓tsubscript𝐓𝑡\mathbf{T}_{t}bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the probability of mode j𝑗jitalic_j switching to mode k𝑘kitalic_k at timestep t𝑡titalic_t. To satisfy the positivity τtj,k>0,j,k=1,,Kformulae-sequencesuperscriptsubscript𝜏𝑡𝑗𝑘0for-all𝑗𝑘1𝐾\tau_{t}^{j,k}>0,\;\forall j,k=1,\cdots,Kitalic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_k end_POSTSUPERSCRIPT > 0 , ∀ italic_j , italic_k = 1 , ⋯ , italic_K and 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT constraints kτtj,k=1,j=1,,Kformulae-sequencesubscript𝑘superscriptsubscript𝜏𝑡𝑗𝑘1for-all𝑗1𝐾\sum_{k}\tau_{t}^{j,k}=1,\;\forall j=1,...,K∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j , italic_k end_POSTSUPERSCRIPT = 1 , ∀ italic_j = 1 , … , italic_K, we apply a tempered softmax after fzsubscript𝑓𝑧f_{z}italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, i.e. 𝒮τzfz()subscript𝒮subscript𝜏𝑧subscript𝑓𝑧\mathcal{S}_{\tau_{z}}\circ f_{z}(\cdot)caligraphic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( ⋅ ).

A.2 Inference Model of AMORE

We perform conditionally exact inference for the two discrete latent variables, i.e. modes 𝐳1:Tsubscript𝐳:1𝑇\mathbf{z}_{1:T}bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and counts 𝐜1:Tsubscript𝐜:1𝑇\mathbf{c}_{1:T}bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, similar to the forward-backward procedure for HMM (Eddy, 1996). Conditioned on observations 𝐲1:Tsubscript𝐲:1𝑇\mathbf{y}_{1:T}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, the posterior joint distribution pθ(𝐳1:T,𝐜1:T|𝐲1:T)subscript𝑝𝜃subscript𝐳:1𝑇conditionalsubscript𝐜:1𝑇subscript𝐲:1𝑇p_{\theta}(\mathbf{z}_{1:T},\mathbf{c}_{1:T}|\mathbf{y}_{1:T})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) is calculated by modifying the forward-backward recursions to handle the joint hierarchical latent variables. Specifically, the forward αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and backward βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT parts are defined as

αt(zt,ct)subscript𝛼𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle\alpha_{t}(z_{t},c_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =p(zt,ct,𝐲1:t),absent𝑝subscript𝑧𝑡subscript𝑐𝑡subscript𝐲:1𝑡\displaystyle=p(z_{t},c_{t},\mathbf{y}_{1:t}),= italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) ,
βt(zt,ct)subscript𝛽𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle\beta_{t}(z_{t},c_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =p(𝐲t+1:T|𝐲t,zt,ct).absent𝑝conditionalsubscript𝐲:𝑡1𝑇subscript𝐲𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle=p(\mathbf{y}_{t+1:T}|\mathbf{y}_{t},z_{t},c_{t}).= italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Specifically, the posterior joint probability of mode and count variables 𝐳𝐳\mathbf{z}bold_z, 𝐜𝐜\mathbf{c}bold_c conditioned on observations 𝐲𝐲\mathbf{y}bold_y is calculated as

p(zt,ct|𝐲1:T)𝑝subscript𝑧𝑡conditionalsubscript𝑐𝑡subscript𝐲:1𝑇\displaystyle p(z_{t},c_{t}|\mathbf{y}_{1:T})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) p(zt,ct,𝐲1:T)proportional-toabsent𝑝subscript𝑧𝑡subscript𝑐𝑡subscript𝐲:1𝑇\displaystyle\propto p(z_{t},c_{t},\mathbf{y}_{1:T})∝ italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT )
=p(zt,ct,𝐲1:t)Forwardp(𝐲t+1:T|𝐲t,zt,ct)Backwardabsentsubscript𝑝subscript𝑧𝑡subscript𝑐𝑡subscript𝐲:1𝑡𝐹𝑜𝑟𝑤𝑎𝑟𝑑subscript𝑝conditionalsubscript𝐲:𝑡1𝑇subscript𝐲𝑡subscript𝑧𝑡subscript𝑐𝑡𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑\displaystyle=\underbrace{p(z_{t},c_{t},\mathbf{y}_{1:t})}_{Forward}% \underbrace{p(\mathbf{y}_{t+1:T}|\mathbf{y}_{t},z_{t},c_{t})}_{Backward}= under⏟ start_ARG italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_F italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUBSCRIPT under⏟ start_ARG italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_B italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUBSCRIPT
=αt(zt,ct)βt(zt,ct).absentsubscript𝛼𝑡subscript𝑧𝑡subscript𝑐𝑡subscript𝛽𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle=\alpha_{t}(z_{t},c_{t})\cdot\beta_{t}(z_{t},c_{t}).= italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The derivatives of the forward section αt(zt,ct)subscript𝛼𝑡subscript𝑧𝑡subscript𝑐𝑡\alpha_{t}(z_{t},c_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are

α1(z1,c1)subscript𝛼1subscript𝑧1subscript𝑐1\displaystyle\alpha_{1}(z_{1},c_{1})italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =p(z1,c1,𝐲1)absent𝑝subscript𝑧1subscript𝑐1subscript𝐲1\displaystyle=p(z_{1},c_{1},\mathbf{y}_{1})= italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
=δc1=1p(z1)p(𝐲1|z1),absentsubscript𝛿subscript𝑐11𝑝subscript𝑧1𝑝conditionalsubscript𝐲1subscript𝑧1\displaystyle=\delta_{c_{1}=1}p(z_{1})p(\mathbf{y}_{1}|z_{1}),= italic_δ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,
αt(zt,ct)¯¯subscript𝛼𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle\underline{\alpha_{t}(z_{t},c_{t})}under¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG =p(zt,ct,𝐲1:t)absent𝑝subscript𝑧𝑡subscript𝑐𝑡subscript𝐲:1𝑡\displaystyle=p(z_{t},c_{t},\mathbf{y}_{1:t})= italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT )
=zt1,ct1p(zt,ct,𝐲1:t,zt1,ct1)absentsubscriptsubscript𝑧𝑡1subscript𝑐𝑡1𝑝subscript𝑧𝑡subscript𝑐𝑡subscript𝐲:1𝑡subscript𝑧𝑡1subscript𝑐𝑡1\displaystyle=\sum_{z_{t-1},c_{t-1}}p(z_{t},c_{t},\mathbf{y}_{1:t},z_{t-1},c_{% t-1})= ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
=zt1,ct1p(zt1,ct1,𝐲1:t1)p(ct|ct1,zt1)p(zt|zt1,ct,𝐲t1)p(𝐲t|𝐲t1,zt)absentsubscriptsubscript𝑧𝑡1subscript𝑐𝑡1𝑝subscript𝑧𝑡1subscript𝑐𝑡1subscript𝐲:1𝑡1𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1𝑝conditionalsubscript𝑧𝑡subscript𝑧𝑡1subscript𝑐𝑡subscript𝐲𝑡1𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡\displaystyle=\sum_{z_{t-1},c_{t-1}}p(z_{t-1},c_{t-1},\mathbf{y}_{1:t-1})p(c_{% t}|c_{t-1},z_{t-1})p(z_{t}|z_{t-1},c_{t},\mathbf{y}_{t-1})p(\mathbf{y}_{t}|% \mathbf{y}_{t-1},z_{t})= ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=p(𝐲t|𝐲t1,zt)zt1,ct1αt1(zt1,ct1)¯p(ct|ct1,zt1)p(zt|zt1,ct,𝐲t1)absent𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡subscriptsubscript𝑧𝑡1subscript𝑐𝑡1¯subscript𝛼𝑡1subscript𝑧𝑡1subscript𝑐𝑡1𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1𝑝conditionalsubscript𝑧𝑡subscript𝑧𝑡1subscript𝑐𝑡subscript𝐲𝑡1\displaystyle=p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})\sum_{z_{t-1},c_{t-1}}% \underline{\alpha_{t-1}(z_{t-1},c_{t-1})}p(c_{t}|c_{t-1},z_{t-1})p(z_{t}|z_{t-% 1},c_{t},\mathbf{y}_{t-1})= italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT under¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
=p(𝐲t|𝐲t1,zt)[δct=1zt1p(zt|zt1,ct,𝐲t1)ct1(1μzt1(ct1))αt1(zt1,ct1)\displaystyle=p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})\Bigg{[}\delta_{c_{t}=1}% \sum_{z_{t-1}}p(z_{t}|z_{t-1},c_{t},\mathbf{y}_{t-1})\sum_{c_{t-1}}(1-\mu_{z_{% t-1}(c_{t-1})})\alpha_{t-1}(z_{t-1},c_{t-1})= italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ italic_δ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
+δzt1=ztct>1ct1=ct1μzt1(ct1)αt1(zt1,ct1)],\displaystyle+\delta_{\begin{subarray}{c}z_{t-1}=z_{t}\\ c_{t}>1\\ c_{t-1}=c_{t}-1\end{subarray}}\mu_{z_{t-1}}(c_{t-1})\alpha_{t-1}(z_{t-1},c_{t-% 1})\Bigg{]},+ italic_δ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 1 end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] ,

where αt(zt,ct)subscript𝛼𝑡subscript𝑧𝑡subscript𝑐𝑡\alpha_{t}(z_{t},c_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be expressed by αt1(zt1,ct1)subscript𝛼𝑡1subscript𝑧𝑡1subscript𝑐𝑡1\alpha_{t-1}(z_{t-1},c_{t-1})italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) recursively with states transitions.

The derivatives of the backward section βt(zt,ct)subscript𝛽𝑡subscript𝑧𝑡subscript𝑐𝑡\beta_{t}(z_{t},c_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are

βT(zT,cT)subscript𝛽𝑇subscript𝑧𝑇subscript𝑐𝑇\displaystyle\beta_{T}(z_{T},c_{T})italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =1absent1\displaystyle=1= 1
βt(zt,ct)¯¯subscript𝛽𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle\underline{\beta_{t}(z_{t},c_{t})}under¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG =p(𝐲t+1:T|𝐲t,zt,ct)absent𝑝conditionalsubscript𝐲:𝑡1𝑇subscript𝐲𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle=p(\mathbf{y}_{t+1:T}|\mathbf{y}_{t},z_{t},c_{t})= italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=zt+1,ct+1p(𝐲t+1:T,zt+1,ct+1|𝐲t,zt,ct)absentsubscriptsubscript𝑧𝑡1subscript𝑐𝑡1𝑝subscript𝐲:𝑡1𝑇subscript𝑧𝑡1conditionalsubscript𝑐𝑡1subscript𝐲𝑡subscript𝑧𝑡subscript𝑐𝑡\displaystyle=\sum_{z_{t+1},c_{t+1}}p(\mathbf{y}_{t+1:T},z_{t+1},c_{t+1}|% \mathbf{y}_{t},z_{t},c_{t})= ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=zt+1,ct+1p(ct+1|ct,zt)p(zt+1|zt,ct,𝐲t)p(𝐲t+1|𝐲t,zt+1)p(𝐲t+2:T|𝐲t+1,zt+1,ct+1)absentsubscriptsubscript𝑧𝑡1subscript𝑐𝑡1𝑝conditionalsubscript𝑐𝑡1subscript𝑐𝑡subscript𝑧𝑡𝑝conditionalsubscript𝑧𝑡1subscript𝑧𝑡subscript𝑐𝑡subscript𝐲𝑡𝑝conditionalsubscript𝐲𝑡1subscript𝐲𝑡subscript𝑧𝑡1𝑝conditionalsubscript𝐲:𝑡2𝑇subscript𝐲𝑡1subscript𝑧𝑡1subscript𝑐𝑡1\displaystyle=\sum_{z_{t+1},c_{t+1}}p(c_{t+1}|c_{t},z_{t})p(z_{t+1}|z_{t},c_{t% },\mathbf{y}_{t})p(\mathbf{y}_{t+1}|\mathbf{y}_{t},z_{t+1})p(\mathbf{y}_{t+2:T% }|\mathbf{y}_{t+1},z_{t+1},c_{t+1})= ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 2 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
=zt+1,ct+1p(ct+1|ct,zt)p(zt+1|zt,ct+1,𝐲t)p(𝐲t+1|𝐲t,zt+1)βt+1(zt+1,ct+1)¯absentsubscriptsubscript𝑧𝑡1subscript𝑐𝑡1𝑝conditionalsubscript𝑐𝑡1subscript𝑐𝑡subscript𝑧𝑡𝑝conditionalsubscript𝑧𝑡1subscript𝑧𝑡subscript𝑐𝑡1subscript𝐲𝑡𝑝conditionalsubscript𝐲𝑡1subscript𝐲𝑡subscript𝑧𝑡1¯subscript𝛽𝑡1subscript𝑧𝑡1subscript𝑐𝑡1\displaystyle=\sum_{z_{t+1},c_{t+1}}p(c_{t+1}|c_{t},z_{t})p(z_{t+1}|z_{t},c_{t% +1},\mathbf{y}_{t})p(\mathbf{y}_{t+1}|\mathbf{y}_{t},z_{t+1})\underline{\beta_% {t+1}(z_{t+1},c_{t+1})}= ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) under¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG
=δct+1=1ctdmin(1μzt(ct))zt+1p(zt+1|zt,ct+1,𝐲t)p(𝐲t+1|𝐲t,zt+1)βt+1(zt+1,ct+1)absentsubscript𝛿subscript𝑐𝑡11subscript𝑐𝑡subscript𝑑min1subscript𝜇subscript𝑧𝑡subscript𝑐𝑡subscriptsubscript𝑧𝑡1𝑝conditionalsubscript𝑧𝑡1subscript𝑧𝑡subscript𝑐𝑡1subscript𝐲𝑡𝑝conditionalsubscript𝐲𝑡1subscript𝐲𝑡subscript𝑧𝑡1subscript𝛽𝑡1subscript𝑧𝑡1subscript𝑐𝑡1\displaystyle=\delta_{\begin{subarray}{c}c_{t+1}=1\\ c_{t}\geq d_{\rm min}\end{subarray}}(1-\mu_{z_{t}}(c_{t}))\sum_{z_{t+1}}p(z_{t% +1}|z_{t},c_{t+1},\mathbf{y}_{t})p(\mathbf{y}_{t+1}|\mathbf{y}_{t},z_{t+1})% \beta_{t+1}(z_{t+1},c_{t+1})= italic_δ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_d start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ( 1 - italic_μ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
+δct+1=ct+1zt+1=ztμzt(ct)p(𝐲t+1|𝐲t,zt+1)βt+1(zt+1,ct+1),subscript𝛿subscript𝑐𝑡1subscript𝑐𝑡1subscript𝑧𝑡1subscript𝑧𝑡subscript𝜇subscript𝑧𝑡subscript𝑐𝑡𝑝conditionalsubscript𝐲𝑡1subscript𝐲𝑡subscript𝑧𝑡1subscript𝛽𝑡1subscript𝑧𝑡1subscript𝑐𝑡1\displaystyle+\delta_{\begin{subarray}{c}c_{t+1}=c_{t}+1\\ z_{t+1}=z_{t}\end{subarray}}\mu_{z_{t}}(c_{t})p(\mathbf{y}_{t+1}|\mathbf{y}_{t% },z_{t+1})\beta_{t+1}(z_{t+1},c_{t+1}),+ italic_δ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 end_CELL end_ROW start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ,

where βt(zt,ct)subscript𝛽𝑡subscript𝑧𝑡subscript𝑐𝑡\beta_{t}(z_{t},c_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be computed via βt+1(zt+1,ct+1)subscript𝛽𝑡1subscript𝑧𝑡1subscript𝑐𝑡1\beta_{t+1}(z_{t+1},c_{t+1})italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) recursively with states transitions.

A.3 Derivation of Optimization Objective

The optimization objective of our model is to maximize the observation likelihood logp(𝐲)log𝑝𝐲{\rm log}\,p(\mathbf{y})roman_log italic_p ( bold_y ) with sparse regularization on coefficients of candidate basis functions, where the observation likelihood logp(𝐲)log𝑝𝐲{\rm log}\,p(\mathbf{y})roman_log italic_p ( bold_y ) can be calculated as

logp(𝐲)logp𝐲\displaystyle\rm log\,p(\mathbf{y})roman_log roman_p ( bold_y ) =𝔼p(𝐳,𝐜|𝐲)[logp(𝐲)]absentsubscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐲\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[{\rm log}% \,p(\mathbf{y})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y ) ]
=𝔼p(𝐳,𝐜|𝐲)[logp(𝐲,𝐳,𝐜)]𝔼p(𝐳,𝐜|𝐲)[logp(𝐳,𝐜|𝐲)]absentsubscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐲𝐳𝐜subscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐳conditional𝐜𝐲\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[{\rm log}% \,p(\mathbf{y},\mathbf{z},\mathbf{c})\right]-\mathbb{E}_{p(\mathbf{z},\mathbf{% c}|\mathbf{y})}\left[{\rm log}\,p(\mathbf{z},\mathbf{c}|\mathbf{y})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y , bold_z , bold_c ) ] - blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_z , bold_c | bold_y ) ]
=𝔼p(𝐳,𝐜|𝐲)[logp(𝐲,𝐳,𝐜)],absentsubscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐲𝐳𝐜\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[{\rm log}% \,p(\mathbf{y},\mathbf{z},\mathbf{c})\right],= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y , bold_z , bold_c ) ] ,

where 𝔼p(𝐳,𝐜|𝐲)[logp(𝐳,𝐜|𝐲)]subscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐳conditional𝐜𝐲\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[{\rm log}\,p(\mathbf{z},% \mathbf{c}|\mathbf{y})\right]blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_z , bold_c | bold_y ) ] is calculated as

𝔼p(𝐳,𝐜|𝐲)[logp(𝐳,𝐜|𝐲)]=p(𝐳,𝐜|𝐲)logp(𝐳,𝐜|𝐲)p(𝐳,𝐜|𝐲)d(𝐳,𝐜)=logp(𝐳,𝐜|𝐲)d(𝐳,𝐜)=1=0.subscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐳conditional𝐜𝐲𝑝𝐳conditional𝐜𝐲log𝑝𝐳conditional𝐜𝐲𝑝𝐳conditional𝐜𝐲𝑑𝐳𝐜log𝑝𝐳conditional𝐜𝐲𝑑𝐳𝐜10\displaystyle\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[{\rm log}\,% p(\mathbf{z},\mathbf{c}|\mathbf{y})\right]=\int p(\mathbf{z},\mathbf{c}|% \mathbf{y})\frac{{\rm log}\,p(\mathbf{z},\mathbf{c}|\mathbf{y})}{p(\mathbf{z},% \mathbf{c}|\mathbf{y})}d(\mathbf{z},\mathbf{c})=\int{\rm log}\,p(\mathbf{z},% \mathbf{c}|\mathbf{y})d(\mathbf{z},\mathbf{c})=1=0.blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_z , bold_c | bold_y ) ] = ∫ italic_p ( bold_z , bold_c | bold_y ) divide start_ARG roman_log italic_p ( bold_z , bold_c | bold_y ) end_ARG start_ARG italic_p ( bold_z , bold_c | bold_y ) end_ARG italic_d ( bold_z , bold_c ) = ∫ roman_log italic_p ( bold_z , bold_c | bold_y ) italic_d ( bold_z , bold_c ) = 1 = 0 .

Following Markovian property, we expand logp(𝐲,𝐳,𝐜)log𝑝𝐲𝐳𝐜{\rm log}\,p(\mathbf{y},\mathbf{z},\mathbf{c})roman_log italic_p ( bold_y , bold_z , bold_c ) over time and calculate it as

logp(𝐲,𝐳,𝐜)log𝑝𝐲𝐳𝐜\displaystyle{\rm log}\,p(\mathbf{y},\mathbf{z},\mathbf{c})roman_log italic_p ( bold_y , bold_z , bold_c ) =logp(𝐲1:T,𝐳1:T,𝐜1:T)absentlog𝑝subscript𝐲:1𝑇subscript𝐳:1𝑇subscript𝐜:1𝑇\displaystyle=\,{\rm log}\,p(\mathbf{y}_{1:T},\mathbf{z}_{1:T},\mathbf{c}_{1:T})= roman_log italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT )
=log[p(𝐲1|z1)p(z1)]+t=2Tlog[p(𝐲t|𝐲t1,zt)p(zt|zt1,ct,𝐲t1)p(ct|ct1,zt1)].absentlogdelimited-[]𝑝conditionalsubscript𝐲1subscript𝑧1𝑝subscript𝑧1superscriptsubscript𝑡2𝑇logdelimited-[]𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡𝑝conditionalsubscript𝑧𝑡subscript𝑧𝑡1subscript𝑐𝑡subscript𝐲𝑡1𝑝conditionalsubscript𝑐𝑡subscript𝑐𝑡1subscript𝑧𝑡1\displaystyle=\,{\rm log}\!\left[p(\mathbf{y}_{1}|z_{1})p(z_{1})\right]+\sum_{% t=2}^{T}\,{\rm log}\!\left[p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t})p(z_{t}|z_{% t-1},c_{t},\mathbf{y}_{t-1})p(c_{t}|c_{t-1},z_{t-1})\right].= roman_log [ italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log [ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] .

Finally, combined with expectations, logp(𝐲)log𝑝𝐲{\rm log}\,p(\mathbf{y})roman_log italic_p ( bold_y ) can be calculated as

logp(𝐲)log𝑝𝐲\displaystyle{\rm log}\,p(\mathbf{y})roman_log italic_p ( bold_y ) =𝔼p(𝐳,𝐜|𝐲)[logp(𝐲,𝐳,𝐜)],absentsubscript𝔼𝑝𝐳conditional𝐜𝐲delimited-[]log𝑝𝐲𝐳𝐜\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y})}\left[{\rm log}% \,p(\mathbf{y},\mathbf{z},\mathbf{c})\right],= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y , bold_z , bold_c ) ] ,
=𝔼p(𝐳1:T,𝐜1:T|𝐲1:T)[logp(𝐲1:T,𝐳1:T,𝐜1:T)]absentsubscript𝔼𝑝subscript𝐳:1𝑇conditionalsubscript𝐜:1𝑇subscript𝐲:1𝑇delimited-[]log𝑝subscript𝐲:1𝑇subscript𝐳:1𝑇subscript𝐜:1𝑇\displaystyle=\mathbb{E}_{p(\mathbf{z}_{1:T},\mathbf{c}_{1:T}|\mathbf{y}_{1:T}% )}\left[{\rm log}\,p(\mathbf{y}_{1:T},\mathbf{z}_{1:T},\mathbf{c}_{1:T})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ]
=kp(z1=k|𝐲1:T)log[p(𝐲1|z1)p(z1=k)]absentsubscript𝑘𝑝subscript𝑧1conditional𝑘subscript𝐲:1𝑇logdelimited-[]𝑝conditionalsubscript𝐲1subscript𝑧1𝑝subscript𝑧1𝑘\displaystyle=\sum_{k}p(z_{1}=k|\mathbf{y}_{1:T})\,{\rm log}\left[p(\mathbf{y}% _{1}|z_{1})p(z_{1}=k)\right]= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) roman_log [ italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k ) ]
+t=2Tk,j,u,vξ(k,j,u,v)log[p(𝐲t|𝐲t1,zt=k)p(zt=k|zt1=j,ct=v,𝐲t1)p(ct=v|ct1=u,zt1=j)]\displaystyle\,\,\,\,\,\,\,+\sum_{t=2}^{T}\sum_{k,j,u,v}\!\!\xi(k,j,u,v)\,\,{% \rm log}\!\left[p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t}=k)p(z_{t}\!=\!k|z_{t-1% }\!=\!j,c_{t}\!=\!v,\mathbf{y}_{t-1})p(c_{t}\!=\!v|c_{t-1}\!=\!u,z_{t-1}\!=\!j% )\right]+ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k , italic_j , italic_u , italic_v end_POSTSUBSCRIPT italic_ξ ( italic_k , italic_j , italic_u , italic_v ) roman_log [ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k ) italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_j , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_u , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_j ) ]
=kγ(k)log[B1(k)π(k)]absentsubscript𝑘𝛾𝑘logdelimited-[]subscript𝐵1𝑘𝜋𝑘\displaystyle=\sum_{k}\gamma(k)\,\,{\rm log}\!\left[B_{1}(k)\cdot\pi(k)\right]= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ ( italic_k ) roman_log [ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k ) ⋅ italic_π ( italic_k ) ]
+t=2Tk,j,u,vξ(k,j,u,v)log[Bt(k)At(k,j,v)Ct(j,u,v)]superscriptsubscript𝑡2𝑇subscript𝑘𝑗𝑢𝑣𝜉𝑘𝑗𝑢𝑣logdelimited-[]subscript𝐵𝑡𝑘subscript𝐴𝑡𝑘𝑗𝑣subscript𝐶𝑡𝑗𝑢𝑣\displaystyle\,\,\,\,\,\,\,+\sum_{t=2}^{T}\sum_{k,j,u,v}\!\!\xi(k,j,u,v)\,\,{% \rm log}\!\left[B_{t}(k)\cdot A_{t}(k,j,v)\cdot C_{t}(j,u,v)\right]+ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k , italic_j , italic_u , italic_v end_POSTSUBSCRIPT italic_ξ ( italic_k , italic_j , italic_u , italic_v ) roman_log [ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) ⋅ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_j , italic_v ) ⋅ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_j , italic_u , italic_v ) ]

where π(k)𝜋𝑘\pi(k)italic_π ( italic_k ), γ(k)𝛾𝑘\gamma(k)italic_γ ( italic_k ), ξ(k,j,u,v)𝜉𝑘𝑗𝑢𝑣\xi(k,j,u,v)italic_ξ ( italic_k , italic_j , italic_u , italic_v ), Bt(k)subscript𝐵𝑡𝑘B_{t}(k)italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ), At(k,j,v)subscript𝐴𝑡𝑘𝑗𝑣A_{t}(k,j,v)italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_j , italic_v ), and Ct(j,u,v)subscript𝐶𝑡𝑗𝑢𝑣C_{t}(j,u,v)italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_j , italic_u , italic_v ) are defined as

π(k)𝜋𝑘\displaystyle\pi(k)italic_π ( italic_k ) =p(z1=k),absent𝑝subscript𝑧1𝑘\displaystyle=p(z_{1}=k),= italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k ) ,
γ(k)𝛾𝑘\displaystyle\gamma(k)italic_γ ( italic_k ) =p(z1=k|𝐲1:T),absent𝑝subscript𝑧1conditional𝑘subscript𝐲:1𝑇\displaystyle=p(z_{1}=k|\mathbf{y}_{1:T}),= italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ,
ξ(k,j,u,v)𝜉𝑘𝑗𝑢𝑣\displaystyle\xi(k,j,u,v)italic_ξ ( italic_k , italic_j , italic_u , italic_v ) =p(zt=k,zt1=j,ct=v,ct1=u|𝐲1:T),absent𝑝formulae-sequencesubscript𝑧𝑡𝑘formulae-sequencesubscript𝑧𝑡1𝑗formulae-sequencesubscript𝑐𝑡𝑣subscript𝑐𝑡1conditional𝑢subscript𝐲:1𝑇\displaystyle=p(z_{t}\!\!=\!k,z_{t-1}\!\!=\!j,c_{t}\!=\!v,c_{t-1}\!=\!u|% \mathbf{y}_{1:T}),= italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_j , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_u | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ,
Bt(k)subscript𝐵𝑡𝑘\displaystyle B_{t}(k)italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) =p(𝐲t|𝐲t1,zt=k),absent𝑝conditionalsubscript𝐲𝑡subscript𝐲𝑡1subscript𝑧𝑡𝑘\displaystyle=p(\mathbf{y}_{t}|\mathbf{y}_{t-1},z_{t}=k),= italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k ) ,
At(k,j,v)subscript𝐴𝑡𝑘𝑗𝑣\displaystyle A_{t}(k,j,v)italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_j , italic_v ) =p(zt=k|zt1=j,ct=v,𝐲t1),\displaystyle=p(z_{t}\!=\!k|z_{t-1}\!=\!j,c_{t}\!=\!v,\mathbf{y}_{t-1}),= italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_j , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ,
Ct(j,u,v)subscript𝐶𝑡𝑗𝑢𝑣\displaystyle C_{t}(j,u,v)italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_j , italic_u , italic_v ) =p(ct=v|ct1=u,zt1=j).\displaystyle=p(c_{t}\!=\!v|c_{t-1}\!=\!u,z_{t-1}\!=\!j).= italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_u , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_j ) .

π(𝐤)𝜋𝐤\pi(\mathbf{k})italic_π ( bold_k ) is the initial discrete mode probability. Bt(k)subscript𝐵𝑡𝑘B_{t}(k)italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) is the continuous state transition probability conditioned on different types of discrete modes k𝑘kitalic_k. At(k,j,v)subscript𝐴𝑡𝑘𝑗𝑣A_{t}(k,j,v)italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_j , italic_v ) is the discrete mode transition probability. Ct(j,u,v)subscript𝐶𝑡𝑗𝑢𝑣C_{t}(j,u,v)italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_j , italic_u , italic_v ) is the mode duration count transition probability. Besides, γ(k)=p(z1=k|𝐲1:T)𝛾𝑘𝑝subscript𝑧1conditional𝑘subscript𝐲:1𝑇\gamma(k)=p(z_{1}=k|\mathbf{y}_{1:T})italic_γ ( italic_k ) = italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) and ξ(k,j,u,v)=p(zt=k,zt1=j,ct=v,ct1=u|𝐲1:T)𝜉𝑘𝑗𝑢𝑣𝑝formulae-sequencesubscript𝑧𝑡𝑘formulae-sequencesubscript𝑧𝑡1𝑗formulae-sequencesubscript𝑐𝑡𝑣subscript𝑐𝑡1conditional𝑢subscript𝐲:1𝑇\xi(k,j,u,v)=p(z_{t}\!\!=\!k,z_{t-1}\!\!=\!j,c_{t}\!=\!v,c_{t-1}\!=\!u|\mathbf% {y}_{1:T})italic_ξ ( italic_k , italic_j , italic_u , italic_v ) = italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_j , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_u | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) can be calculated similarly to the forward and backward algorithm in HMMs (Eddy, 1996) which is detailed in Appendix A.2.

Appendix B More Details of AMORE-MIO

B.1 Expansion of Generative Model over Objects

The joint generative probability of AMORE-MIO for multi-object hybrid systems is expanded over objects as

p(𝐲,𝐳,𝐜,𝐞)𝑝𝐲𝐳𝐜𝐞\displaystyle p(\mathbf{y},\mathbf{z},\mathbf{c},\mathbf{e})italic_p ( bold_y , bold_z , bold_c , bold_e ) =n=1Np(𝐲1n|z1n)n=1Np(z1n)n=1Nm=1Np(e1mn)Initialstatest=2T[n=1Np(𝐲tn|𝐲t1n,ztn)n=1Np(ctn|ct1n,zt1n)\displaystyle=\underbrace{\prod_{n=1}^{N}\!p(\mathbf{y}_{1}^{n}|z_{1}^{n})% \cdot\prod_{n=1}^{N}\!p(z_{1}^{n})\cdot\prod_{n=1}^{N}\prod_{m=1}^{N}\!p(e_{1}% ^{m\rightarrow n})}_{\rm{Initial\,\,states}}\cdot\prod_{t=2}^{T}\Bigg{[}\prod_% {n=1}^{N}p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n})\cdot\prod_{n=1}^% {N}p(c_{t}^{n}|c_{t-1}^{n},z_{t-1}^{n})\cdot= under⏟ start_ARG ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Initial roman_states end_POSTSUBSCRIPT ⋅ ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅
n=1Nm=1Np(ztn|zt1m,ctn,etmn,𝐲t1m,𝐲t1n)n=1Nm=1Np(etmn|et1mn,𝐯t1m,𝐯t1n)],\displaystyle\;\;\;\;\;\;\prod_{n=1}^{N}\!\sum_{m=1}^{N}p(z_{t}^{n}|z_{t-1}^{m% },c_{t}^{n},e_{t}^{m\rightarrow n},\mathbf{y}_{t-1}^{m},\mathbf{y}_{t-1}^{n})% \cdot\prod_{n=1}^{N}\!\sum_{m=1}^{N}p(e_{t}^{m\rightarrow n}|e_{t-1}^{m% \rightarrow n},\mathbf{v}_{t-1}^{m},\mathbf{v}_{t-1}^{n})\Bigg{]},∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] ,

where in the initial states, we model for each object n𝑛nitalic_n an initial mode and observation distributions, i.e. p(z1n)𝑝superscriptsubscript𝑧1𝑛p(z_{1}^{n})italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and p(𝐲1n|z1n)𝑝conditionalsuperscriptsubscript𝐲1𝑛superscriptsubscript𝑧1𝑛p(\mathbf{y}_{1}^{n}|z_{1}^{n})italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). For each pair of interactions, p(e1mn)𝑝superscriptsubscript𝑒1𝑚𝑛p(e_{1}^{m\rightarrow n})italic_p ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) models the initial edge distribution. For later time steps t2𝑡2t\geq 2italic_t ≥ 2, p(etmn|et1mn,𝐯t1m,𝐯t1n)𝑝conditionalsuperscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝑒𝑡1𝑚𝑛superscriptsubscript𝐯𝑡1𝑚superscriptsubscript𝐯𝑡1𝑛p(e_{t}^{m\rightarrow n}|e_{t-1}^{m\rightarrow n},\mathbf{v}_{t-1}^{m},\mathbf% {v}_{t-1}^{n})italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) models the edge variable transition probability conditioned on node states {𝐯t1m,𝐯t1n}superscriptsubscript𝐯𝑡1𝑚superscriptsubscript𝐯𝑡1𝑛\{\mathbf{v}_{t-1}^{m},\mathbf{v}_{t-1}^{n}\}{ bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } in graph 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. p(ztn|zt1m,ctn,etmn,𝐲t1m,n)𝑝conditionalsuperscriptsubscript𝑧𝑡𝑛superscriptsubscript𝑧𝑡1𝑚superscriptsubscript𝑐𝑡𝑛superscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝐲𝑡1𝑚𝑛p(z_{t}^{n}|z_{t-1}^{m},c_{t}^{n},e_{t}^{m\rightarrow n},\mathbf{y}_{t-1}^{m,n})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT ) models how the modes of objects are affected by the modes of all other objects, conditioned on count variables ctnsuperscriptsubscript𝑐𝑡𝑛c_{t}^{n}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, edge variables etmnsuperscriptsubscript𝑒𝑡𝑚𝑛e_{t}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT, and observations {𝐲t1m,𝐲t1n}superscriptsubscript𝐲𝑡1𝑚superscriptsubscript𝐲𝑡1𝑛\{\mathbf{y}_{t-1}^{m},\mathbf{y}_{t-1}^{n}\}{ bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }. p(𝐲tn|𝐲t1n,ztn)𝑝conditionalsuperscriptsubscript𝐲𝑡𝑛superscriptsubscript𝐲𝑡1𝑛superscriptsubscript𝑧𝑡𝑛p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and p(ctn|ct1n,zt1n)𝑝conditionalsuperscriptsubscript𝑐𝑡𝑛superscriptsubscript𝑐𝑡1𝑛superscriptsubscript𝑧𝑡1𝑛p(c_{t}^{n}|c_{t-1}^{n},z_{t-1}^{n})italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) model for each object an observation transition probability and count variable transition probability.

B.2 Neural Network Implementation

Implementations of p(𝐲1n|z1n)𝑝conditionalsuperscriptsubscript𝐲1𝑛superscriptsubscript𝑧1𝑛p(\mathbf{y}_{1}^{n}|z_{1}^{n})italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), p(z1n)𝑝superscriptsubscript𝑧1𝑛p(z_{1}^{n})italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), p(𝐲tn|𝐲t1n,ztn)𝑝conditionalsuperscriptsubscript𝐲𝑡𝑛superscriptsubscript𝐲𝑡1𝑛superscriptsubscript𝑧𝑡𝑛p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n})italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), and p(ctn|ct1n,zt1n)𝑝conditionalsuperscriptsubscript𝑐𝑡𝑛superscriptsubscript𝑐𝑡1𝑛superscriptsubscript𝑧𝑡1𝑛p(c_{t}^{n}|c_{t-1}^{n},z_{t-1}^{n})italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) in multi-object scenarios are the same as those in single-object scenarios. Next, we elaborate on how we implement the other terms, i.e. p(e1mn)𝑝superscriptsubscript𝑒1𝑚𝑛p(e_{1}^{m\rightarrow n})italic_p ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ), p(etmn|et1mn,𝐯t1m,𝐯t1n)𝑝conditionalsuperscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝑒𝑡1𝑚𝑛superscriptsubscript𝐯𝑡1𝑚superscriptsubscript𝐯𝑡1𝑛p(e_{t}^{m\rightarrow n}|e_{t-1}^{m\rightarrow n},\mathbf{v}_{t-1}^{m},\mathbf% {v}_{t-1}^{n})italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), and p(ztn|zt1m,ctn,etmn,𝐲t1m,𝐲t1n)𝑝conditionalsuperscriptsubscript𝑧𝑡𝑛superscriptsubscript𝑧𝑡1𝑚superscriptsubscript𝑐𝑡𝑛superscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝐲𝑡1𝑚superscriptsubscript𝐲𝑡1𝑛p(z_{t}^{n}|z_{t-1}^{m},c_{t}^{n},e_{t}^{m\rightarrow n},\mathbf{y}_{t-1}^{m},% \mathbf{y}_{t-1}^{n})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ).

Edge variables and edge transition probability.

We implement the edge variable 𝐞𝐞\mathbf{e}bold_e as a categorical distribution over {1,,L}1𝐿\{1,\cdots,L\}{ 1 , ⋯ , italic_L } for L𝐿Litalic_L possible interaction types including a no-interaction type. We set the prior distribution to be higher for no-interaction edges in p(e1mn)𝑝superscriptsubscript𝑒1𝑚𝑛p(e_{1}^{m\rightarrow n})italic_p ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) to encourage sparse graphs. The edge transition probability is modeled as

p(etmn|et1mn,𝐯t1m,𝐯t1n)=Cat(etmn;𝒮τe(fe(et1mn,𝐯t1m,𝐯t1n))),𝑝conditionalsuperscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝑒𝑡1𝑚𝑛superscriptsubscript𝐯𝑡1𝑚superscriptsubscript𝐯𝑡1𝑛Catsuperscriptsubscript𝑒𝑡𝑚𝑛subscript𝒮subscript𝜏𝑒subscript𝑓𝑒superscriptsubscript𝑒𝑡1𝑚𝑛superscriptsubscript𝐯𝑡1𝑚superscriptsubscript𝐯𝑡1𝑛\displaystyle p(e_{t}^{m\rightarrow n}|e_{t-1}^{m\rightarrow n},\mathbf{v}_{t-% 1}^{m},\mathbf{v}_{t-1}^{n})={\rm Cat}(e_{t}^{m\rightarrow n};\mathcal{S}_{% \tau_{e}}(f_{e}(e_{t-1}^{m\rightarrow n},\mathbf{v}_{t-1}^{m},\mathbf{v}_{t-1}% ^{n}))),italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = roman_Cat ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ; caligraphic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ) ,

where The neural network fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT takes et1mnsuperscriptsubscript𝑒𝑡1𝑚𝑛e_{t-1}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT, 𝐯t1msuperscriptsubscript𝐯𝑡1𝑚\mathbf{v}_{t-1}^{m}bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and 𝐯t1nsuperscriptsubscript𝐯𝑡1𝑛\mathbf{v}_{t-1}^{n}bold_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as input and outputs the probabilities of all possible edge types at time step t𝑡titalic_t, which are further post-processed by a tempered softmax function 𝒮τesubscript𝒮subscript𝜏𝑒\mathcal{S}_{\tau_{e}}caligraphic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT with temperature τesubscript𝜏𝑒\tau_{e}italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to ensure normalization. In practice, the edge transition network fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is a single hidden layer MLP.

Extension of mode transition probability.

After getting etmnsuperscriptsubscript𝑒𝑡𝑚𝑛e_{t}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT by the edge transition probability, we show how etmnsuperscriptsubscript𝑒𝑡𝑚𝑛e_{t}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT affects the mod-switching behaviors. We model the mode transition probability in multi-object hybrid systems as

p(ztn|zt1m,ctn,etmn,𝐲t1m,𝐲t1n)={δztn=zt1nifctn>1Cat(ztn;𝒮τz(let,lmnfl(𝐲t1m,𝐲t1n))ifctn=1,\displaystyle p(z_{t}^{n}|z_{t-1}^{m},c_{t}^{n},e_{t}^{m\rightarrow n},\mathbf% {y}_{t-1}^{m},\mathbf{y}_{t-1}^{n}\!)=\!\begin{cases}\delta_{z_{t}^{n}=z_{t-1}% ^{n}}&\!\!\!\!{\rm if}\;c_{t}^{n}\!>\!1\\ {\rm Cat}(z_{t}^{n};\mathcal{S}_{\tau_{z}}(\sum_{l}e_{t,l}^{m\rightarrow n}f_{% l}(\mathbf{y}_{t-1}^{m},\mathbf{y}_{t-1}^{n}))&\!\!\!\!{\rm if}\;c_{t}^{n}\!=% \!1\end{cases}\!,italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT > 1 end_CELL end_ROW start_ROW start_CELL roman_Cat ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; caligraphic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) end_CELL start_CELL roman_if italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 1 end_CELL end_ROW ,

where δ𝛿\deltaitalic_δ and 𝒮τzsubscript𝒮subscript𝜏𝑧\mathcal{S}_{\tau_{z}}caligraphic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT are a Kronecker function and a tempered softmax function. et,lmnsuperscriptsubscript𝑒𝑡𝑙𝑚𝑛e_{t,l}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT denotes the probability of each edge type l𝑙litalic_l. We set a neural network flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for each edge type l𝑙litalic_l (totally L𝐿Litalic_L) to model different interaction effects, which are normalized by et,lmnsuperscriptsubscript𝑒𝑡𝑙𝑚𝑛e_{t,l}^{m\rightarrow n}italic_e start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT to aggregate effects from all the interaction types.

B.3 Inference Model of AMORE-MIO

Approximate inference of edge variables.

We use a graph neural network fϕe(𝐲)subscript𝑓subscriptitalic-ϕ𝑒𝐲f_{\phi_{e}}(\mathbf{y})italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_y ) to conduct approximate inference of edge variables 𝐞𝐞\mathbf{e}bold_e, i.e. qϕe(𝐞|𝐲)subscript𝑞subscriptitalic-ϕ𝑒conditional𝐞𝐲q_{\phi_{e}}(\mathbf{e}|\mathbf{y})italic_q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_e | bold_y ). The node embeddings in the latent graph 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the observations 𝐲𝐲\mathbf{y}bold_y, and the edge embeddings are calculated by two rounds of message-passing

𝐡n1=fϕeemb(𝐲tn),superscriptsubscript𝐡𝑛1superscriptsubscript𝑓subscriptitalic-ϕ𝑒embsuperscriptsubscript𝐲𝑡𝑛\displaystyle\mathbf{h}_{n}^{1}=f_{\phi_{e}}^{\rm emb}(\mathbf{y}_{t}^{n}),bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_emb end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ,
ve::𝑣𝑒absent\displaystyle v\rightarrow e\!:\,\,italic_v → italic_e : 𝐡mn1=fϕze,1([𝐡m1,𝐡n1]),superscriptsubscript𝐡𝑚𝑛1superscriptsubscript𝑓subscriptitalic-ϕ𝑧𝑒1superscriptsubscript𝐡𝑚1superscriptsubscript𝐡𝑛1\displaystyle\mathbf{h}_{m\rightarrow n}^{1}=f_{\phi_{z}}^{e,1}([\mathbf{h}_{m% }^{1},\mathbf{h}_{n}^{1}]),bold_h start_POSTSUBSCRIPT italic_m → italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e , 1 end_POSTSUPERSCRIPT ( [ bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] ) ,
ev::𝑒𝑣absent\displaystyle e\rightarrow v\!:\,\,italic_e → italic_v : 𝐡n2=fϕev,1(m=1N𝐡mn1),superscriptsubscript𝐡𝑛2superscriptsubscript𝑓subscriptitalic-ϕ𝑒𝑣1superscriptsubscript𝑚1𝑁superscriptsubscript𝐡𝑚𝑛1\displaystyle\mathbf{h}_{n}^{2}=f_{\phi_{e}}^{v,1}(\sum_{m=1}^{N}\mathbf{h}_{m% \rightarrow n}^{1}),bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v , 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_m → italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ,
ve::𝑣𝑒absent\displaystyle v\rightarrow e\!:\,\,italic_v → italic_e : 𝐡mn2=fϕee,2([𝐡m2,𝐡n2]),superscriptsubscript𝐡𝑚𝑛2superscriptsubscript𝑓subscriptitalic-ϕ𝑒𝑒2superscriptsubscript𝐡𝑚2superscriptsubscript𝐡𝑛2\displaystyle\mathbf{h}_{m\rightarrow n}^{2}=f_{\phi_{e}}^{e,2}([\mathbf{h}_{m% }^{2},\mathbf{h}_{n}^{2}]),bold_h start_POSTSUBSCRIPT italic_m → italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e , 2 end_POSTSUPERSCRIPT ( [ bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) ,

where 𝐡mn2superscriptsubscript𝐡𝑚𝑛2\mathbf{h}_{m\rightarrow n}^{2}bold_h start_POSTSUBSCRIPT italic_m → italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is further processed by a tempered Gumbel softmax softmax((𝐡mn2+𝐠)/τ)softmaxsuperscriptsubscript𝐡𝑚𝑛2𝐠𝜏{\rm softmax}((\mathbf{h}_{m\rightarrow n}^{2}+\mathbf{g})/\tau)roman_softmax ( ( bold_h start_POSTSUBSCRIPT italic_m → italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_g ) / italic_τ ) to achieve qϕe(𝐞|𝐲)subscript𝑞subscriptitalic-ϕ𝑒conditional𝐞𝐲q_{\phi_{e}}(\mathbf{e}|\mathbf{y})italic_q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_e | bold_y ), to be more specific qϕe(etmn|𝐲tm,𝐲tn)subscript𝑞subscriptitalic-ϕ𝑒conditionalsuperscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝐲𝑡𝑚superscriptsubscript𝐲𝑡𝑛q_{\phi_{e}}(e_{t}^{m\rightarrow n}|\mathbf{y}_{t}^{m},\mathbf{y}_{t}^{n})italic_q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). Here, we use continuous relaxation and reparameterization of discrete distributions for gradient backpropagation (Kipf et al., 2018). 𝐠𝐠\mathbf{g}bold_g is a vector sampled from a Gumbel(0,1)Gumbel01{\rm Gumbel}(0,1)roman_Gumbel ( 0 , 1 ) distribution and the softmax temperature τ𝜏\tauitalic_τ controls relaxation smoothness.

Exact inference of mode and count variables.

Given the approximate edge variables 𝐞~qϕe(𝐞|𝐲)similar-to~𝐞subscript𝑞subscriptitalic-ϕ𝑒conditional𝐞𝐲\tilde{\mathbf{e}}\sim q_{\phi_{e}}\!(\mathbf{e}|\mathbf{y})over~ start_ARG bold_e end_ARG ∼ italic_q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_e | bold_y ), we do exact inference of the mode and count variables pθ(𝐳,𝐜|𝐲,𝐞~)subscript𝑝𝜃𝐳conditional𝐜𝐲~𝐞p_{\theta}(\mathbf{z},\mathbf{c}|\mathbf{y},\tilde{\mathbf{e}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c | bold_y , over~ start_ARG bold_e end_ARG ). Similar to the single-object scenarios, the conditional joint distribution is calculated by modifying the forward-backward algorithm. Specifically, the forward αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and backward βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are calculated as

αt(𝐳t1:N,𝐜t1:N)subscript𝛼𝑡superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁\displaystyle\!\alpha_{t}(\mathbf{z}_{t}^{1:N},\mathbf{c}_{t}^{1:N})\!italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) =p(𝐳t1:N,𝐜t1:N,𝐲1:t1:N,𝐞1:t1:N2),absent𝑝superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐲:1𝑡:1𝑁superscriptsubscript𝐞:1𝑡:1superscript𝑁2\displaystyle=\!p(\mathbf{z}_{t}^{1:N},\mathbf{c}_{t}^{1:N},\mathbf{y}_{1:t}^{% 1:N},\mathbf{e}_{1:t}^{1:N^{2}}),= italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ,
βt(𝐳t1:N,𝐜t1:N)subscript𝛽𝑡superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁\displaystyle\!\beta_{t}(\mathbf{z}_{t}^{1:N},\mathbf{c}_{t}^{1:N})\!italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) =p(𝐲t+1:T1:N|𝐲t1:N,𝐳t1:N,𝐜t1:N,𝐞t1:N2).absent𝑝conditionalsuperscriptsubscript𝐲:𝑡1𝑇:1𝑁superscriptsubscript𝐲𝑡:1𝑁superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐞𝑡:1superscript𝑁2\displaystyle=\!p(\mathbf{y}_{t+1:T}^{1:N}|\mathbf{y}_{t}^{1:N},\mathbf{z}_{t}% ^{1:N},\mathbf{c}_{t}^{1:N},\mathbf{e}_{t}^{1:N^{2}}).= italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) .

Specifically, the joint probability of mode and count variables 𝐳𝐳\mathbf{z}bold_z, 𝐜𝐜\mathbf{c}bold_c conditioned on observations 𝐲𝐲\mathbf{y}bold_y and approximate edge variables 𝐞𝐞\mathbf{e}bold_e is calculated as

p(𝐳t,𝐜t|𝐲1:T,𝐞1:T)𝑝subscript𝐳𝑡conditionalsubscript𝐜𝑡subscript𝐲:1𝑇subscript𝐞:1𝑇\displaystyle p(\mathbf{z}_{t},\mathbf{c}_{t}|\mathbf{y}_{1:T},\mathbf{e}_{1:T})italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) p(𝐳t,𝐜t,𝐲1:T,𝐞1:T)proportional-toabsent𝑝subscript𝐳𝑡subscript𝐜𝑡subscript𝐲:1𝑇subscript𝐞:1𝑇\displaystyle\propto p(\mathbf{z}_{t},\mathbf{c}_{t},\mathbf{y}_{1:T},\mathbf{% e}_{1:T})∝ italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT )
=p(𝐳t,𝐜t,𝐲1:t,𝐞1:t)Forwardp(𝐲t+1:T,𝐞t+1:T|𝐲t,𝐳t,𝐜tBackward)\displaystyle=\underbrace{p(\mathbf{z}_{t},\mathbf{c}_{t},\mathbf{y}_{1:t},% \mathbf{e}_{1:t})}_{Forward}\underbrace{p(\mathbf{y}_{t+1:T},\mathbf{e}_{t+1:T% }|\mathbf{y}_{t},\mathbf{z}_{t},\mathbf{c}_{t}}_{Backward})= under⏟ start_ARG italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_F italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUBSCRIPT under⏟ start_ARG italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_B italic_a italic_c italic_k italic_w italic_a italic_r italic_d end_POSTSUBSCRIPT )
=αt(𝐳t,𝐜t)βt(𝐳t,𝐜t).absentsubscript𝛼𝑡subscript𝐳𝑡subscript𝐜𝑡subscript𝛽𝑡subscript𝐳𝑡subscript𝐜𝑡\displaystyle=\alpha_{t}(\mathbf{z}_{t},\mathbf{c}_{t})\cdot\beta_{t}(\mathbf{% z}_{t},\mathbf{c}_{t}).= italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The derivatives of the forward section αt(𝐳t,𝐜t)subscript𝛼𝑡subscript𝐳𝑡subscript𝐜𝑡\alpha_{t}(\mathbf{z}_{t},\mathbf{c}_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is calculated as:

α1(𝐳1,𝐜1)subscript𝛼1subscript𝐳1subscript𝐜1\displaystyle\alpha_{1}(\mathbf{z}_{1},\mathbf{c}_{1})italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =p(𝐳1,𝐜1,𝐲1,𝐞1)absent𝑝subscript𝐳1subscript𝐜1subscript𝐲1subscript𝐞1\displaystyle=p(\mathbf{z}_{1},\mathbf{c}_{1},\mathbf{y}_{1},\mathbf{e}_{1})= italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
=p(𝐳11:N,𝐜11:N,𝐲11:N,𝐞11:N2)absent𝑝superscriptsubscript𝐳1:1𝑁superscriptsubscript𝐜1:1𝑁superscriptsubscript𝐲1:1𝑁superscriptsubscript𝐞1:1superscript𝑁2\displaystyle=p(\mathbf{z}_{1}^{1:N},\mathbf{c}_{1}^{1:N},\mathbf{y}_{1}^{1:N}% ,\mathbf{e}_{1}^{1:N^{2}})= italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
=δ𝐜11:N=1p(𝐳11:N)p(𝐞11:N2)p(𝐲11:N|𝐳11:N)absentsubscript𝛿superscriptsubscript𝐜1:1𝑁1𝑝superscriptsubscript𝐳1:1𝑁𝑝superscriptsubscript𝐞1:1superscript𝑁2𝑝conditionalsuperscriptsubscript𝐲1:1𝑁superscriptsubscript𝐳1:1𝑁\displaystyle=\delta_{\mathbf{c}_{1}^{1:N}=1}p(\mathbf{z}_{1}^{1:N})p(\mathbf{% e}_{1}^{1:N^{2}})p(\mathbf{y}_{1}^{1:N}|\mathbf{z}_{1}^{1:N})= italic_δ start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT )
=δ𝐜11:N=1p(𝐳11:N)p(𝐞11:N2)n=1Np(𝐲1n|𝐳1n)absentsubscript𝛿superscriptsubscript𝐜1:1𝑁1𝑝superscriptsubscript𝐳1:1𝑁𝑝superscriptsubscript𝐞1:1superscript𝑁2superscriptsubscriptproduct𝑛1𝑁𝑝conditionalsuperscriptsubscript𝐲1𝑛superscriptsubscript𝐳1𝑛\displaystyle=\delta_{\mathbf{c}_{1}^{1:N}=1}p(\mathbf{z}_{1}^{1:N})p(\mathbf{% e}_{1}^{1:N^{2}})\prod_{n=1}^{N}p(\mathbf{y}_{1}^{n}|\mathbf{z}_{1}^{n})= italic_δ start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
αt(𝐳t,𝐜t)¯¯subscript𝛼𝑡subscript𝐳𝑡subscript𝐜𝑡\displaystyle\underline{\alpha_{t}(\mathbf{z}_{t},\mathbf{c}_{t})}under¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG =p(𝐳t,𝐜t,𝐲1:t,𝐞1:t)absent𝑝subscript𝐳𝑡subscript𝐜𝑡subscript𝐲:1𝑡subscript𝐞:1𝑡\displaystyle=p(\mathbf{z}_{t},\mathbf{c}_{t},\mathbf{y}_{1:t},\mathbf{e}_{1:t})= italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT )
=p(𝐳t1:N,𝐜t1:N,𝐲1:t1:N,𝐞1:t1:N2)absent𝑝superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐲:1𝑡:1𝑁superscriptsubscript𝐞:1𝑡:1superscript𝑁2\displaystyle=p(\mathbf{z}_{t}^{1:N},\mathbf{c}_{t}^{1:N},\mathbf{y}_{1:t}^{1:% N},\mathbf{e}_{1:t}^{1:N^{2}})= italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
=𝐳t11:N,𝐜t11:Np(𝐳t1:N,𝐜t1:N,𝐲1:t1:N,𝐞1:t1:N2,𝐳t11:N,𝐜t11:N)absentsubscriptsuperscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐜𝑡1:1𝑁𝑝superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁superscriptsubscript𝐲:1𝑡:1𝑁superscriptsubscript𝐞:1𝑡:1superscript𝑁2superscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐜𝑡1:1𝑁\displaystyle=\sum_{\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t-1}^{1:N}}p(\mathbf{z}% _{t}^{1:N},\mathbf{c}_{t}^{1:N},\mathbf{y}_{1:t}^{1:N},\mathbf{e}_{1:t}^{1:N^{% 2}},\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t-1}^{1:N})= ∑ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT )
=𝐳t11:N,𝐜t11:N[p(𝐳t11:N,𝐜t11:N,𝐲1:t11:N,𝐞1:t11:N2)p(𝐲t1:N|𝐲t11:N,𝐳t1:N)p(𝐳t1:N|𝐳t11:N,𝐜t1:N,𝐲t11:N,𝐞t1:N2)\displaystyle=\sum_{\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t-1}^{1:N}}\Bigg{[}p(% \mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t-1}^{1:N},\mathbf{y}_{1:t-1}^{1:N},\mathbf% {e}_{1:t-1}^{1:N^{2}})p(\mathbf{y}_{t}^{1:N}|\mathbf{y}_{t-1}^{1:N},\mathbf{z}% _{t}^{1:N})p(\mathbf{z}_{t}^{1:N}|\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t}^{1:N},% \mathbf{y}_{t-1}^{1:N},\mathbf{e}_{t}^{1:N^{2}})= ∑ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
p(𝐜t1:N|𝐜t11:N,𝐳t11:N)p(𝐞t1:N2|𝐞t11:N2,𝐳t11:N,𝐲t11:N)]\displaystyle\quad\quad\quad\quad\quad\,\,\cdot p(\mathbf{c}_{t}^{1:N}|\mathbf% {c}_{t-1}^{1:N},\mathbf{z}_{t-1}^{1:N})p(\mathbf{e}_{t}^{1:N^{2}}|\mathbf{e}_{% t-1}^{1:N^{2}},\mathbf{z}_{t-1}^{1:N},\mathbf{y}_{t-1}^{1:N})\Bigg{]}⋅ italic_p ( bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) ]
=𝐳t11:N,𝐜t11:N[αt1(𝐳t1,𝐜t1)¯n=1Np(𝐲tn|𝐲t1n,ztn)n=1Nm=1Np(ztn|zt1m,ctn,𝐲t1m,𝐲t1n,etmn)\displaystyle=\sum_{\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t-1}^{1:N}}\Bigg{[}% \underline{\alpha_{t-1}(\mathbf{z}_{t-1},\mathbf{c}_{t-1})}\cdot\prod_{n=1}^{N% }p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n})\cdot\prod_{n=1}^{N}\!% \prod_{m=1}^{N}\!p(z_{t}^{n}|z_{t-1}^{m},c_{t}^{n},\mathbf{y}_{t-1}^{m},% \mathbf{y}_{t-1}^{n},e_{t}^{m\rightarrow n})\cdot= ∑ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ under¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) ⋅
n=1Np(ctn|ct1n,zt1n)n=1Nm=1Np(etmn|et1mn,zt1m,zt1n,ct1m,ct1n,𝐲t1m,𝐲t1n)],\displaystyle\quad\quad\quad\quad\quad\,\,\cdot\prod_{n=1}^{N}p(c_{t}^{n}|c_{t% -1}^{n},z_{t-1}^{n})\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!p(e_{t}^{m\rightarrow n}% |e_{t-1}^{m\rightarrow n},z_{t-1}^{m},z_{t-1}^{n},c_{t-1}^{m},c_{t-1}^{n},% \mathbf{y}_{t-1}^{m},\mathbf{y}_{t-1}^{n})\Bigg{]},⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] ,

where αt(𝐳t,𝐜t)subscript𝛼𝑡subscript𝐳𝑡subscript𝐜𝑡\alpha_{t}(\mathbf{z}_{t},\mathbf{c}_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are calculated by αt1(𝐳t1,𝐜t1)subscript𝛼𝑡1subscript𝐳𝑡1subscript𝐜𝑡1\alpha_{t-1}(\mathbf{z}_{t-1},\mathbf{c}_{t-1})italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) recursively with states transitions.

The derivatives of the backward section βt(𝐳t,𝐜t)subscript𝛽𝑡subscript𝐳𝑡subscript𝐜𝑡\beta_{t}(\mathbf{z}_{t},\mathbf{c}_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is calculated as

βT(𝐳T,𝐜T)subscript𝛽𝑇subscript𝐳𝑇subscript𝐜𝑇\displaystyle\beta_{T}(\mathbf{z}_{T},\mathbf{c}_{T})italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =1absent1\displaystyle=1= 1
βt(𝐳t,𝐜t)¯¯subscript𝛽𝑡subscript𝐳𝑡subscript𝐜𝑡\displaystyle\underline{\beta_{t}(\mathbf{z}_{t},\mathbf{c}_{t})}under¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG =p(𝐲t+1:T,𝐞t+1:T|𝐲t,𝐳t,𝐜t)absent𝑝subscript𝐲:𝑡1𝑇conditionalsubscript𝐞:𝑡1𝑇subscript𝐲𝑡subscript𝐳𝑡subscript𝐜𝑡\displaystyle=p(\mathbf{y}_{t+1:T},\mathbf{e}_{t+1:T}|\mathbf{y}_{t},\mathbf{z% }_{t},\mathbf{c}_{t})= italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=p(𝐲t+1:T1:N,𝐞t+1:T1:N2|𝐲t1:N,𝐳t1:N,𝐜t1:N)absent𝑝superscriptsubscript𝐲:𝑡1𝑇:1𝑁conditionalsuperscriptsubscript𝐞:𝑡1𝑇:1superscript𝑁2superscriptsubscript𝐲𝑡:1𝑁superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁\displaystyle=p(\mathbf{y}_{t+1:T}^{1:N},\mathbf{e}_{t+1:T}^{1:N^{2}}|\mathbf{% y}_{t}^{1:N},\mathbf{z}_{t}^{1:N},\mathbf{c}_{t}^{1:N})= italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT )
=𝐳t+11:N,𝐜t+11:Np(𝐲t+1:T1:N,𝐞t+1:T1:N2,𝐳t+11:N,𝐜t+11:N|𝐲t1:N,𝐳t1:N,𝐜t1:N)absentsubscriptsuperscriptsubscript𝐳𝑡1:1𝑁superscriptsubscript𝐜𝑡1:1𝑁𝑝superscriptsubscript𝐲:𝑡1𝑇:1𝑁superscriptsubscript𝐞:𝑡1𝑇:1superscript𝑁2superscriptsubscript𝐳𝑡1:1𝑁conditionalsuperscriptsubscript𝐜𝑡1:1𝑁superscriptsubscript𝐲𝑡:1𝑁superscriptsubscript𝐳𝑡:1𝑁superscriptsubscript𝐜𝑡:1𝑁\displaystyle=\sum_{\mathbf{z}_{t+1}^{1:N},\mathbf{c}_{t+1}^{1:N}}p(\mathbf{y}% _{t+1:T}^{1:N},\mathbf{e}_{t+1:T}^{1:N^{2}},\mathbf{z}_{t+1}^{1:N},\mathbf{c}_% {t+1}^{1:N}|\mathbf{y}_{t}^{1:N},\mathbf{z}_{t}^{1:N},\mathbf{c}_{t}^{1:N})= ∑ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT )
=𝐳t+11:N,𝐜t+11:N[p(𝐲t+11:N|𝐲t1:N,𝐳t+11:N)p(𝐳t+11:N|𝐳t1:N,𝐜t+11:N,𝐲t1:N,𝐞t+11:N2)\displaystyle=\sum_{\mathbf{z}_{t+1}^{1:N},\mathbf{c}_{t+1}^{1:N}}\Bigg{[}p(% \mathbf{y}_{t+1}^{1:N}|\mathbf{y}_{t}^{1:N},\mathbf{z}_{t+1}^{1:N})p(\mathbf{z% }_{t+1}^{1:N}|\mathbf{z}_{t}^{1:N},\mathbf{c}_{t+1}^{1:N},\mathbf{y}_{t}^{1:N}% ,\mathbf{e}_{t+1}^{1:N^{2}})= ∑ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )
p(𝐜t+11:N|𝐜t1:N,𝐳t1:N)p(𝐞t+11:N2|𝐞t1:N2,𝐳t1:N,𝐲t1:N)p(𝐲t+2:T1:N,𝐞t+2:T1:N2|𝐲t+11:N,𝐳t+11:N,𝐜t+11:N)]\displaystyle\cdot p(\mathbf{c}_{t+1}^{1:N}|\mathbf{c}_{t}^{1:N},\mathbf{z}_{t% }^{1:N})p(\mathbf{e}_{t+1}^{1:N^{2}}|\mathbf{e}_{t}^{1:N^{2}},\mathbf{z}_{t}^{% 1:N},\mathbf{y}_{t}^{1:N})p(\mathbf{y}_{t+2:T}^{1:N},\mathbf{e}_{t+2:T}^{1:N^{% 2}}|\mathbf{y}_{t+1}^{1:N},\mathbf{z}_{t+1}^{1:N},\mathbf{c}_{t+1}^{1:N})\Bigg% {]}⋅ italic_p ( bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_e start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 2 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t + 2 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) ]
=𝐳t+11:N,𝐜t+11:N[n=1Np(𝐲t+1n|𝐲tn,zt+1n)n=1Nm=1Np(zt+1n|ztm,ct+1n,𝐲tm,n,et+1mn)\displaystyle=\sum_{\mathbf{z}_{t+1}^{1:N},\mathbf{c}_{t+1}^{1:N}}\Bigg{[}% \prod_{n=1}^{N}p(\mathbf{y}_{t+1}^{n}|\mathbf{y}_{t}^{n},z_{t+1}^{n})\cdot% \prod_{n=1}^{N}\!\prod_{m=1}^{N}\!p(z_{t+1}^{n}|z_{t}^{m},c_{t+1}^{n},\mathbf{% y}_{t}^{m,n},e_{t+1}^{m\rightarrow n})= ∑ start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT )
n=1Np(ct+1n|ctn,ztn)n=1Nm=1Np(et+1mn|etmn,ztm,ztn,ctm,ctn,𝐲tm,𝐲tn)βt+1(𝐳t+1,𝐜t+1)¯],\displaystyle\cdot\prod_{n=1}^{N}p(c_{t+1}^{n}|c_{t}^{n},z_{t}^{n})\cdot\prod_% {n=1}^{N}\!\prod_{m=1}^{N}\!p(e_{t+1}^{m\rightarrow n}|e_{t}^{m\rightarrow n},% z_{t}^{m},z_{t}^{n},c_{t}^{m},c_{t}^{n},\mathbf{y}_{t}^{m},\mathbf{y}_{t}^{n})% \,\,\underline{\beta_{t+1}(\mathbf{z}_{t+1},\mathbf{c}_{t+1})}\Bigg{]},⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) under¯ start_ARG italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG ] ,

where βt(𝐳t,𝐜t)subscript𝛽𝑡subscript𝐳𝑡subscript𝐜𝑡\beta_{t}(\mathbf{z}_{t},\mathbf{c}_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is computed via βt+1(𝐳t+1,𝐜t+1)subscript𝛽𝑡1subscript𝐳𝑡1subscript𝐜𝑡1\beta_{t+1}(\mathbf{z}_{t+1},\mathbf{c}_{t+1})italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) recursively by state transitions.

B.4 Derivation of Optimization Objective

Learnable parameters of our model are optimized by maximizing the evidence lower bound (ELBO) with sparse regularization on coefficients of candidate basis functions where the derivatives of ELBO are as follows. For brevity, 𝐲𝐲\mathbf{y}bold_y, 𝐳𝐳\mathbf{z}bold_z, 𝐜𝐜\mathbf{c}bold_c, and 𝐞𝐞\mathbf{e}bold_e represents 𝐲1:T1:Nsuperscriptsubscript𝐲:1𝑇:1𝑁\mathbf{y}_{1:T}^{1:N}bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT, 𝐳1:T1:Nsuperscriptsubscript𝐳:1𝑇:1𝑁\mathbf{z}_{1:T}^{1:N}bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT, 𝐜1:T1:Nsuperscriptsubscript𝐜:1𝑇:1𝑁\mathbf{c}_{1:T}^{1:N}bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT, and 𝐞1:T1:N2superscriptsubscript𝐞:1𝑇:1superscript𝑁2\mathbf{e}_{1:T}^{1:N^{2}}bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT respectively. N𝑁Nitalic_N is the number of objects. T𝑇Titalic_T is the number of time steps.

ELBO𝐸𝐿𝐵𝑂\displaystyle ELBOitalic_E italic_L italic_B italic_O =logpθ(𝐲)DKL[qϕ(𝐳,𝐜,𝐞|𝐲)pθ(𝐳,𝐜,𝐞|𝐲)]\displaystyle={\rm log}\,p_{\theta}(\mathbf{y})\!-\!D_{K\!L}\left[q_{\phi}(% \mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})\,\|\,p_{\theta}(\mathbf{z},% \mathbf{c},\mathbf{e}|\mathbf{y})\right]= roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ) - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) ]
=qϕ(𝐳,𝐜,𝐞|𝐲)logpθ(𝐲)d(𝐳,𝐜,𝐞)qϕ(𝐳,𝐜,𝐞|𝐲)logqϕ(𝐳,𝐜,𝐞|𝐲)pθ(𝐳,𝐜,𝐞|𝐲)d(𝐳,𝐜,𝐞)absentsubscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲logsubscript𝑝𝜃𝐲𝑑𝐳𝐜𝐞subscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲logsubscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲subscript𝑝𝜃𝐳𝐜conditional𝐞𝐲𝑑𝐳𝐜𝐞\displaystyle=\int q_{\phi}(\mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})\,{\rm log% }\,p_{\theta}(\mathbf{y})\,d(\mathbf{z},\mathbf{c},\mathbf{e})-\int q_{\phi}(% \mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})\,{\rm log}\,\frac{q_{\phi}(% \mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})}{p_{\theta}(\mathbf{z},\mathbf{c}% ,\mathbf{e}|\mathbf{y})}\,d(\mathbf{z},\mathbf{c},\mathbf{e})= ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y ) italic_d ( bold_z , bold_c , bold_e ) - ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) end_ARG italic_d ( bold_z , bold_c , bold_e )
=qϕ(𝐳,𝐜,𝐞|𝐲)[logpθ(𝐳,𝐜,𝐞,𝐲)logqϕ(𝐳,𝐜,𝐞|𝐲)]d(𝐳,𝐜,𝐞)absentsubscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲delimited-[]logsubscript𝑝𝜃𝐳𝐜𝐞𝐲logsubscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲𝑑𝐳𝐜𝐞\displaystyle=\int q_{\phi}(\mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})\left[% {\rm log}\,p_{\theta}(\mathbf{z},\mathbf{c},\mathbf{e},\mathbf{y})-{\rm log}\,% q_{\phi}(\mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})\right]\,d(\mathbf{z},% \mathbf{c},\mathbf{e})= ∫ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e , bold_y ) - roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) ] italic_d ( bold_z , bold_c , bold_e )
=𝔼qϕ(𝐳,𝐜,𝐞|𝐲)[logpθ(𝐳,𝐜,𝐞,𝐲)logqϕ(𝐳,𝐜,𝐞|𝐲)]absentsubscript𝔼subscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲delimited-[]logsubscript𝑝𝜃𝐳𝐜𝐞𝐲logsubscript𝑞italic-ϕ𝐳𝐜conditional𝐞𝐲\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y}% )}\left[{\rm log}\,p_{\theta}(\mathbf{z},\mathbf{c},\mathbf{e},\mathbf{y})-{% \rm log}\,q_{\phi}(\mathbf{z},\mathbf{c},\mathbf{e}|\mathbf{y})\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e , bold_y ) - roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z , bold_c , bold_e | bold_y ) ]
=𝔼qϕ(𝐞|𝐲)pθ(𝐳,𝐜|𝐲,𝐞)[logpθ(𝐲,𝐞)pθ(𝐳,𝐜|𝐲,𝐞)logqϕ(𝐞|𝐲)pθ(𝐳,𝐜|𝐲,𝐞)]absentsubscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲subscript𝑝𝜃𝐳conditional𝐜𝐲𝐞delimited-[]logsubscript𝑝𝜃𝐲𝐞subscript𝑝𝜃𝐳conditional𝐜𝐲𝐞logsubscript𝑞italic-ϕconditional𝐞𝐲subscript𝑝𝜃𝐳conditional𝐜𝐲𝐞\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})p_{\theta}(\mathbf{z}% ,\mathbf{c}|\mathbf{y},\mathbf{e})}\left[{\rm log}\,p_{\theta}(\mathbf{y},% \mathbf{e})p_{\theta}(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})-{\rm log}\,% q_{\phi}(\mathbf{e}|\mathbf{y})p_{\theta}(\mathbf{z},\mathbf{c}|\mathbf{y},% \mathbf{e})\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c | bold_y , bold_e ) - roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z , bold_c | bold_y , bold_e ) ]
=𝔼qϕ(𝐞|𝐲)[logpθ(𝐲,𝐞)logqϕ(𝐞|𝐲)]absentsubscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲delimited-[]logsubscript𝑝𝜃𝐲𝐞logsubscript𝑞italic-ϕconditional𝐞𝐲\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})}\left[{\rm log}\,p_{% \theta}(\mathbf{y},\mathbf{e})-{\rm log}\,q_{\phi}(\mathbf{e}|\mathbf{y})\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) - roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) ]
=𝔼qϕ(𝐞|𝐲)[logpθ(𝐲,𝐞)]+H(qϕ(𝐞|𝐲)),absentsubscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲delimited-[]logsubscript𝑝𝜃𝐲𝐞𝐻subscript𝑞italic-ϕconditional𝐞𝐲\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})}\left[{\rm log}\,p_{% \theta}(\mathbf{y},\mathbf{e})\right]+H(q_{\phi}(\mathbf{e}|\mathbf{y})),= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) ] + italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) ) ,

where logpθ(𝐲,𝐞)logsubscript𝑝𝜃𝐲𝐞{\rm log}\,p_{\theta}(\mathbf{y},\mathbf{e})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) is a joint likelihood, and H(qϕ(𝐞|𝐲))𝐻subscript𝑞italic-ϕconditional𝐞𝐲H(q_{\phi}(\mathbf{e}|\mathbf{y}))italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) ) is a conditional entropy for the approximate posterior of edge variable 𝐞𝐞\mathbf{e}bold_e.

B.4.1 Training of ELBO

We use the mini-batch stochastic gradient descent algorithm for training of ELBO. The gradients with respect to θ𝜃\thetaitalic_θ or ϕitalic-ϕ\phiitalic_ϕ in ELBO are calculated as

θELBOsubscript𝜃𝐸𝐿𝐵𝑂\displaystyle\nabla_{\theta}ELBO∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E italic_L italic_B italic_O =θ[𝔼qϕ(𝐞|𝐲)logpθ(𝐲,𝐞)]=𝔼qϕ(𝐞|𝐲)θlogpθ(𝐲,𝐞),absentsubscript𝜃subscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲logsubscript𝑝𝜃𝐲𝐞subscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲subscript𝜃logsubscript𝑝𝜃𝐲𝐞\displaystyle=\nabla_{\theta}\left[\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})% }{\rm log}\,p_{\theta}(\mathbf{y},\mathbf{e})\right]=\mathbb{E}_{q_{\phi}(% \mathbf{e}|\mathbf{y})}\nabla_{\theta}{\rm log}\,p_{\theta}(\mathbf{y},\mathbf% {e}),= ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) ] = blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) ,
ϕELBOsubscriptitalic-ϕ𝐸𝐿𝐵𝑂\displaystyle\nabla_{\phi}ELBO∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_E italic_L italic_B italic_O =ϕ[𝔼qϕ(𝐞|𝐲)logpθ(𝐲,𝐞)+H(qϕ(𝐞|𝐲))]absentsubscriptitalic-ϕsubscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲logsubscript𝑝𝜃𝐲𝐞𝐻subscript𝑞italic-ϕconditional𝐞𝐲\displaystyle=\nabla_{\phi}\left[\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})}{% \rm log}\,p_{\theta}(\mathbf{y},\mathbf{e})+H(q_{\phi}(\mathbf{e}|\mathbf{y}))\right]= ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) + italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) ) ]
=ϕ[𝔼qϕ(𝐞|𝐲)logpθ(𝐲,𝐞)]+ϕH(qϕ(𝐞|𝐲))absentsubscriptitalic-ϕsubscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲logsubscript𝑝𝜃𝐲𝐞subscriptitalic-ϕ𝐻subscript𝑞italic-ϕconditional𝐞𝐲\displaystyle=\nabla_{\phi}\left[\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})}{% \rm log}\,p_{\theta}(\mathbf{y},\mathbf{e})\right]+\nabla_{\phi}H(q_{\phi}(% \mathbf{e}|\mathbf{y}))= ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) ] + ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) )
=𝔼ϵ𝒩[ϕlogpθ(𝐞,𝐲ϕ(𝐞,ϵ))]+ϕH(qϕ(𝐞|𝐲)),absentsubscript𝔼similar-toitalic-ϵ𝒩delimited-[]subscriptitalic-ϕlogsubscript𝑝𝜃𝐞subscript𝐲italic-ϕ𝐞italic-ϵsubscriptitalic-ϕ𝐻subscript𝑞italic-ϕconditional𝐞𝐲\displaystyle=\mathbb{E}_{\,\mathbf{\epsilon}\sim\mathcal{N}}\left[\nabla_{% \phi}{\rm log}\,p_{\theta}(\mathbf{e},\mathbf{y}_{\phi}(\mathbf{e},\mathbf{% \epsilon}))\right]+\nabla_{\phi}H(q_{\phi}(\mathbf{e}|\mathbf{y})),= blackboard_E start_POSTSUBSCRIPT italic_ϵ ∼ caligraphic_N end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_e , bold_y start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e , italic_ϵ ) ) ] + ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) ) ,

where we use the reparameterization trick (Kingma & Welling, 2013) to calculate the gradients of ϕ[𝔼qϕ(𝐞|𝐲)logpθ(𝐲,𝐞)]subscriptitalic-ϕsubscript𝔼subscript𝑞italic-ϕconditional𝐞𝐲logsubscript𝑝𝜃𝐲𝐞\nabla_{\phi}\left[\mathbb{E}_{q_{\phi}(\mathbf{e}|\mathbf{y})}{\rm log}\,p_{% \theta}(\mathbf{y},\mathbf{e})\right]∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ) ]. ϕH(qϕ(𝐞|𝐲))subscriptitalic-ϕ𝐻subscript𝑞italic-ϕconditional𝐞𝐲\nabla_{\phi}H(q_{\phi}(\mathbf{e}|\mathbf{y}))∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_H ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_e | bold_y ) ) is an entropy loss. Among the derivative terms, the challenging part is the gradients of joint probability θlogpθ(𝐲,𝐞)subscript𝜃logsubscript𝑝𝜃𝐲𝐞\nabla_{\theta}{\rm log}\,p_{\theta}(\mathbf{y},\mathbf{e})∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_e ), which is calculated as

logp(𝐲,𝐞)log𝑝𝐲𝐞\displaystyle\nabla{\rm log}\,p(\mathbf{y},\mathbf{e})∇ roman_log italic_p ( bold_y , bold_e ) =𝔼p(𝐳,𝐜|𝐲,𝐞)[logp(𝐲,𝐞)]absentsubscript𝔼𝑝𝐳conditional𝐜𝐲𝐞delimited-[]log𝑝𝐲𝐞\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})}\left% [\nabla{\rm log}\,p(\mathbf{y},\mathbf{e})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_y , bold_e ) ]
=𝔼p(𝐳,𝐜|𝐲,𝐞)[logp(𝐲,𝐞,𝐳,𝐜)]𝔼p(𝐳,𝐜|𝐲,𝐞)[logp(𝐳,𝐜|𝐲,𝐞)]absentsubscript𝔼𝑝𝐳conditional𝐜𝐲𝐞delimited-[]log𝑝𝐲𝐞𝐳𝐜subscript𝔼𝑝𝐳conditional𝐜𝐲𝐞delimited-[]log𝑝𝐳conditional𝐜𝐲𝐞\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})}\left% [\nabla{\rm log}\,p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})\right]-% \mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})}\left[\nabla{\rm log% }\,p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_y , bold_e , bold_z , bold_c ) ] - blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_z , bold_c | bold_y , bold_e ) ]
=𝔼p(𝐳,𝐜|𝐲,𝐞)[logp(𝐲,𝐞,𝐳,𝐜)]p(𝐳,𝐜|𝐲,𝐞)logp(𝐳,𝐜|𝐲,𝐞)p(𝐳,𝐜|𝐲,𝐞)d(𝐳,𝐜)absentsubscript𝔼𝑝𝐳conditional𝐜𝐲𝐞delimited-[]log𝑝𝐲𝐞𝐳𝐜𝑝𝐳conditional𝐜𝐲𝐞log𝑝𝐳conditional𝐜𝐲𝐞𝑝𝐳conditional𝐜𝐲𝐞𝑑𝐳𝐜\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})}\left% [\nabla{\rm log}\,p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})\right]-\int p% (\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})\frac{\nabla{\rm log}\,p(\mathbf{% z},\mathbf{c}|\mathbf{y},\mathbf{e})}{p(\mathbf{z},\mathbf{c}|\mathbf{y},% \mathbf{e})}d(\mathbf{z},\mathbf{c})= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_y , bold_e , bold_z , bold_c ) ] - ∫ italic_p ( bold_z , bold_c | bold_y , bold_e ) divide start_ARG ∇ roman_log italic_p ( bold_z , bold_c | bold_y , bold_e ) end_ARG start_ARG italic_p ( bold_z , bold_c | bold_y , bold_e ) end_ARG italic_d ( bold_z , bold_c )
=𝔼p(𝐳,𝐜|𝐲,𝐞)[logp(𝐲,𝐞,𝐳,𝐜)],absentsubscript𝔼𝑝𝐳conditional𝐜𝐲𝐞delimited-[]log𝑝𝐲𝐞𝐳𝐜\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})}\left% [\nabla{\rm log}\,p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})\right],= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_y , bold_e , bold_z , bold_c ) ] ,

Following the Markovian property, we unfold the joint likelihood p(𝐲,𝐞,𝐳,𝐜)𝑝𝐲𝐞𝐳𝐜p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})italic_p ( bold_y , bold_e , bold_z , bold_c ) over time as:

logp(𝐲,𝐞,𝐳,𝐜)log𝑝𝐲𝐞𝐳𝐜\displaystyle\nabla{\rm log}\,p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})∇ roman_log italic_p ( bold_y , bold_e , bold_z , bold_c )
=logp(𝐲1:T1:N,𝐞1:T1:N2,𝐳1:T1:N,𝐜1:T1:N)absentlog𝑝superscriptsubscript𝐲:1𝑇:1𝑁superscriptsubscript𝐞:1𝑇:1superscript𝑁2superscriptsubscript𝐳:1𝑇:1𝑁superscriptsubscript𝐜:1𝑇:1𝑁\displaystyle=\nabla\,{\rm log}\,p(\mathbf{y}_{1:T}^{1:N},\mathbf{e}_{1:T}^{1:% N^{2}},\mathbf{z}_{1:T}^{1:N},\mathbf{c}_{1:T}^{1:N})= ∇ roman_log italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT )
=log[p(𝐲11:N|𝐳11:N)p(𝐳11:N)]+t=2Tlog[p(𝐲t1:N|𝐲t11:N,𝐳t1:N)p(𝐳t1:N|𝐳t11:N,𝐜t1:N,𝐲t11:N,𝐞t1:N2)\displaystyle=\nabla\,{\rm log}\!\left[p(\mathbf{y}_{1}^{1:N}|\mathbf{z}_{1}^{% 1:N})p(\mathbf{z}_{1}^{1:N})\right]+\sum_{t=2}^{T}\nabla\,{\rm log}\!\bigg{[}p% (\mathbf{y}_{t}^{1:N}|\mathbf{y}_{t-1}^{1:N},\mathbf{z}_{t}^{1:N})p(\mathbf{z}% _{t}^{1:N}|\mathbf{z}_{t-1}^{1:N},\mathbf{c}_{t}^{1:N},\mathbf{y}_{t-1}^{1:N},% \mathbf{e}_{t}^{1:N^{2}})\cdot= ∇ roman_log [ italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) ] + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ roman_log [ italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ⋅
p(𝐜t1:N|𝐜t11:N,𝐳t11:N)p(𝐞t1:N2|𝐞t11:N2,𝐳t11:N,𝐲t11:N)]\displaystyle\;\;\;\;\;\;\;\;p(\mathbf{c}_{t}^{1:N}|\mathbf{c}_{t-1}^{1:N},% \mathbf{z}_{t-1}^{1:N})p(\mathbf{e}_{t}^{1:N^{2}}|\mathbf{e}_{t-1}^{1:N^{2}},% \mathbf{z}_{t-1}^{1:N},\mathbf{y}_{t-1}^{1:N})\bigg{]}italic_p ( bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) italic_p ( bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | bold_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) ]
=log[n=1Np(𝐲1n|z1n)n=1Np(z1n)]+t=2Tlog[n=1Np(𝐲tn|𝐲t1n,ztn)n=1Nm=1Np(ztn|zt1m,ctn,𝐲t1m,𝐲t1n,etmn)\displaystyle=\nabla\,{\rm log}\!\left[\prod_{n=1}^{N}p(\mathbf{y}_{1}^{n}|z_{% 1}^{n})\cdot\prod_{n=1}^{N}p(z_{1}^{n})\right]+\sum_{t=2}^{T}\nabla\,{\rm log}% \!\Bigg{[}\prod_{n=1}^{N}p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n})% \cdot\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!p(z_{t}^{n}|z_{t-1}^{m},c_{t}^{n},% \mathbf{y}_{t-1}^{m},\mathbf{y}_{t-1}^{n},e_{t}^{m\rightarrow n})\cdot= ∇ roman_log [ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] + ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ roman_log [ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) ⋅
n=1Np(ctn|ct1n,zt1n)n=1Nm=1Np(etmn|et1mn,zt1m,zt1n,ct1m,ct1n,𝐲t1m,𝐲t1n)],\displaystyle\;\;\;\;\;\;\;\;\prod_{n=1}^{N}p(c_{t}^{n}|c_{t-1}^{n},z_{t-1}^{n% })\cdot\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!p(e_{t}^{m\rightarrow n}|e_{t-1}^{m% \rightarrow n},z_{t-1}^{m},z_{t-1}^{n},c_{t-1}^{m},c_{t-1}^{n},\mathbf{y}_{t-1% }^{m},\mathbf{y}_{t-1}^{n})\Bigg{]},∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] ,

where edge variables evolve based on all previous states of both objects. We model the influences of interactions between each pair of objects by p(ztn|zt1m,ctn,𝐲t1m,𝐲t1n,etmn)𝑝conditionalsuperscriptsubscript𝑧𝑡𝑛superscriptsubscript𝑧𝑡1𝑚superscriptsubscript𝑐𝑡𝑛superscriptsubscript𝐲𝑡1𝑚superscriptsubscript𝐲𝑡1𝑛superscriptsubscript𝑒𝑡𝑚𝑛p(z_{t}^{n}|z_{t-1}^{m},c_{t}^{n},\mathbf{y}_{t-1}^{m},\mathbf{y}_{t-1}^{n},e_% {t}^{m\rightarrow n})italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) without instantaneous dependences. Combining with expectation, logp(𝐲,𝐞)log𝑝𝐲𝐞\nabla{\rm log}\,p(\mathbf{y},\mathbf{e})∇ roman_log italic_p ( bold_y , bold_e ) is finally calculated as

logp(𝐲,𝐞)log𝑝𝐲𝐞\displaystyle\nabla{\rm log}\,p(\mathbf{y},\mathbf{e})∇ roman_log italic_p ( bold_y , bold_e ) =𝔼p(𝐳,𝐜|𝐲,𝐞)[logp(𝐲,𝐞,𝐳,𝐜)]absentsubscript𝔼𝑝𝐳conditional𝐜𝐲𝐞delimited-[]log𝑝𝐲𝐞𝐳𝐜\displaystyle=\mathbb{E}_{p(\mathbf{z},\mathbf{c}|\mathbf{y},\mathbf{e})}\left% [\nabla{\rm log}\,p(\mathbf{y},\mathbf{e},\mathbf{z},\mathbf{c})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z , bold_c | bold_y , bold_e ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_y , bold_e , bold_z , bold_c ) ]
=𝔼p(𝐳1:T1:N,𝐜1:T1:N|𝐲1:T1:N,𝐞1:T1:N2)[logp(𝐲1:T1:N,𝐞1:T1:N2,𝐳1:T1:N,𝐜1:T1:N)]absentsubscript𝔼𝑝superscriptsubscript𝐳:1𝑇:1𝑁conditionalsuperscriptsubscript𝐜:1𝑇:1𝑁superscriptsubscript𝐲:1𝑇:1𝑁superscriptsubscript𝐞:1𝑇:1superscript𝑁2delimited-[]log𝑝superscriptsubscript𝐲:1𝑇:1𝑁superscriptsubscript𝐞:1𝑇:1superscript𝑁2superscriptsubscript𝐳:1𝑇:1𝑁superscriptsubscript𝐜:1𝑇:1𝑁\displaystyle=\mathbb{E}_{p(\mathbf{z}_{1:T}^{1:N},\mathbf{c}_{1:T}^{1:N}|% \mathbf{y}_{1:T}^{1:N},\mathbf{e}_{1:T}^{1:N^{2}})}\left[\nabla{\rm log}\,p(% \mathbf{y}_{1:T}^{1:N},\mathbf{e}_{1:T}^{1:N^{2}},\mathbf{z}_{1:T}^{1:N},% \mathbf{c}_{1:T}^{1:N})\right]= blackboard_E start_POSTSUBSCRIPT italic_p ( bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ ∇ roman_log italic_p ( bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT ) ]
=𝐤p(𝐳11:N=𝐤|𝐲1:T1:N,𝐞1:T1:N2)log[n=1Np(𝐲1n|𝐳1n=kn)p(𝐳11:N=𝐤)]absentsubscript𝐤𝑝superscriptsubscript𝐳1:1𝑁conditional𝐤superscriptsubscript𝐲:1𝑇:1𝑁superscriptsubscript𝐞:1𝑇:1superscript𝑁2logdelimited-[]superscriptsubscriptproduct𝑛1𝑁𝑝conditionalsuperscriptsubscript𝐲1𝑛superscriptsubscript𝐳1𝑛superscript𝑘𝑛𝑝superscriptsubscript𝐳1:1𝑁𝐤\displaystyle=\sum_{\mathbf{k}}p(\mathbf{z}_{1}^{1:N}=\mathbf{k}|\mathbf{y}_{1% :T}^{1:N},\mathbf{e}_{1:T}^{1:N^{2}})\nabla\,{\rm log}\left[\prod_{n=1}^{N}p(% \mathbf{y}_{1}^{n}|\mathbf{z}_{1}^{n}=k^{n})\cdot p(\mathbf{z}_{1}^{1:N}=% \mathbf{k})\right]= ∑ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_k | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ∇ roman_log [ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_k ) ]
+t=2T𝐤,𝐣,𝐮,𝐯ξ(𝐤,𝐣,𝐮,𝐯)log[n=1Nm=1Np(etmn|et1mn,𝐳tm,n=𝐤m,n,𝐲tm,n)n=1Np(𝐲tn|𝐲t1n,ztn=kn)\displaystyle\,\,\,\,\,\,\,+\sum_{t=2}^{T}\sum_{\mathbf{k},\mathbf{j},\mathbf{% u},\mathbf{v}}\!\!\xi(\mathbf{k},\mathbf{j},\mathbf{u},\mathbf{v})\,\nabla\,{% \rm log}\!\Bigg{[}\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!p(e_{t}^{m\rightarrow n}|e% _{t-1}^{m\rightarrow n},\mathbf{z}_{t}^{m,n}=\mathbf{k}^{m,n},\mathbf{y}_{t}^{% m,n})\cdot\prod_{n=1}^{N}p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{n}=k% ^{n})+ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_k , bold_j , bold_u , bold_v end_POSTSUBSCRIPT italic_ξ ( bold_k , bold_j , bold_u , bold_v ) ∇ roman_log [ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT = bold_k start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
n=1Nm=1Np(ztn=kn|zt1m=jm,ctn=vn,𝐲t1m,n,et1mn)n=1Np(ctn=vn|ct1n=un,zt1n=jn)]\displaystyle\,\,\,\,\,\,\,\cdot\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!\,p(z_{t}^{n% }\!=\!k^{n}|z_{t-1}^{m}\!=\!j^{m},c_{t}^{n}\!=\!v^{n},\mathbf{y}_{t-1}^{m,n},e% _{t-1}^{m\rightarrow n})\cdot\prod_{n=1}^{N}p(c_{t}^{n}\!=\!v^{n}|c_{t-1}^{n}% \!=\!u^{n},z_{t-1}^{n}\!=\!j^{n})\Bigg{]}⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_j start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ]
=𝐤γ(𝐤)log[B1(kn)π(𝐤)]absentsubscript𝐤𝛾𝐤logdelimited-[]subscript𝐵1superscript𝑘𝑛𝜋𝐤\displaystyle=\sum_{\mathbf{k}}\gamma(\mathbf{k})\,\nabla\,{\rm log}\!\left[B_% {1}(k^{n})\cdot\pi(\mathbf{k})\right]= ∑ start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT italic_γ ( bold_k ) ∇ roman_log [ italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_π ( bold_k ) ]
+t=2T𝐤,𝐣,𝐮,𝐯ξ(𝐤,𝐣,𝐮,𝐯)log[Bt(𝐤)Et(𝐤)At(𝐤,𝐣,𝐯)Ct(𝐮,𝐯,𝐣)],superscriptsubscript𝑡2𝑇subscript𝐤𝐣𝐮𝐯𝜉𝐤𝐣𝐮𝐯logdelimited-[]subscript𝐵𝑡𝐤subscript𝐸𝑡𝐤subscript𝐴𝑡𝐤𝐣𝐯subscript𝐶𝑡𝐮𝐯𝐣\displaystyle\,\,\,\,\,\,\,+\sum_{t=2}^{T}\sum_{\mathbf{k},\mathbf{j},\mathbf{% u},\mathbf{v}}\!\!\xi(\mathbf{k},\mathbf{j},\mathbf{u},\mathbf{v})\,\nabla\,{% \rm log}\!\left[B_{t}(\mathbf{k})\cdot E_{t}(\mathbf{k})\cdot A_{t}(\mathbf{k}% ,\mathbf{j},\mathbf{v})\cdot C_{t}(\mathbf{u},\mathbf{v},\mathbf{j})\right],+ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_k , bold_j , bold_u , bold_v end_POSTSUBSCRIPT italic_ξ ( bold_k , bold_j , bold_u , bold_v ) ∇ roman_log [ italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ) ⋅ italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ) ⋅ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k , bold_j , bold_v ) ⋅ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u , bold_v , bold_j ) ] ,

where π(𝐤)𝜋𝐤\pi(\mathbf{k})italic_π ( bold_k ), γ(𝐤)𝛾𝐤\gamma(\mathbf{k})italic_γ ( bold_k ), ξ(𝐤,𝐣,𝐮,𝐯)𝜉𝐤𝐣𝐮𝐯\xi(\mathbf{k},\mathbf{j},\mathbf{u},\mathbf{v})italic_ξ ( bold_k , bold_j , bold_u , bold_v ), Bt(𝐤)subscript𝐵𝑡𝐤B_{t}(\mathbf{k})italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ), Et(𝐤)subscript𝐸𝑡𝐤E_{t}(\mathbf{k})italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ), At(𝐤,𝐣,𝐯)subscript𝐴𝑡𝐤𝐣𝐯A_{t}(\mathbf{k},\mathbf{j},\mathbf{v})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k , bold_j , bold_v ), and Ct(𝐮,𝐯,𝐣)subscript𝐶𝑡𝐮𝐯𝐣C_{t}(\mathbf{u},\mathbf{v},\mathbf{j})italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u , bold_v , bold_j ) are defined as

π(𝐤)𝜋𝐤\displaystyle\pi(\mathbf{k})italic_π ( bold_k ) =p(𝐳11:N=𝐤),absent𝑝superscriptsubscript𝐳1:1𝑁𝐤\displaystyle=p(\mathbf{z}_{1}^{1:N}=\mathbf{k}),= italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_k ) ,
γ(𝐤)𝛾𝐤\displaystyle\gamma(\mathbf{k})italic_γ ( bold_k ) =p(𝐳11:N=𝐤|𝐲1:T1:N,𝐞1:T1:N2),absent𝑝superscriptsubscript𝐳1:1𝑁conditional𝐤superscriptsubscript𝐲:1𝑇:1𝑁superscriptsubscript𝐞:1𝑇:1superscript𝑁2\displaystyle=p(\mathbf{z}_{1}^{1:N}=\mathbf{k}|\mathbf{y}_{1:T}^{1:N},\mathbf% {e}_{1:T}^{1:N^{2}}),= italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_k | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ,
ξ(𝐤,𝐣,𝐮,𝐯)𝜉𝐤𝐣𝐮𝐯\displaystyle\xi(\mathbf{k},\mathbf{j},\mathbf{u},\mathbf{v})italic_ξ ( bold_k , bold_j , bold_u , bold_v ) =p(𝐳t1:N=𝐤,𝐳t11:N=𝐣,𝐜t1:N=𝐯,𝐜t11:N=𝐮|𝐲1:T1:N,𝐞1:T1:N2),absent𝑝formulae-sequencesuperscriptsubscript𝐳𝑡:1𝑁𝐤formulae-sequencesuperscriptsubscript𝐳𝑡1:1𝑁𝐣formulae-sequencesuperscriptsubscript𝐜𝑡:1𝑁𝐯superscriptsubscript𝐜𝑡1:1𝑁conditional𝐮superscriptsubscript𝐲:1𝑇:1𝑁superscriptsubscript𝐞:1𝑇:1superscript𝑁2\displaystyle=p(\mathbf{z}_{t}^{1:N}\!\!=\!\mathbf{k},\mathbf{z}_{t-1}^{1:N}\!% \!=\!\mathbf{j},\mathbf{c}_{t}^{1:N}\!=\!\mathbf{v},\mathbf{c}_{t-1}^{1:N}\!=% \!\mathbf{u}|\mathbf{y}_{1:T}^{1:N},\mathbf{e}_{1:T}^{1:N^{2}}),= italic_p ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_k , bold_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_j , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_v , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT = bold_u | bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ,
Bt(𝐤)subscript𝐵𝑡𝐤\displaystyle B_{t}(\mathbf{k})italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ) =n=1Np(𝐲tn|𝐲t1n,ztn=kn),absentsuperscriptsubscriptproduct𝑛1𝑁𝑝conditionalsuperscriptsubscript𝐲𝑡𝑛superscriptsubscript𝐲𝑡1𝑛superscriptsubscript𝑧𝑡𝑛superscript𝑘𝑛\displaystyle=\prod_{n=1}^{N}p(\mathbf{y}_{t}^{n}|\mathbf{y}_{t-1}^{n},z_{t}^{% n}=k^{n}),= ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ,
Et(𝐤)subscript𝐸𝑡𝐤\displaystyle E_{t}(\mathbf{k})italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ) =n=1Nm=1Np(etmn|et1mn,𝐳tm,n=𝐤m,n,𝐲tm,n),absentsuperscriptsubscriptproduct𝑛1𝑁superscriptsubscriptproduct𝑚1𝑁𝑝conditionalsuperscriptsubscript𝑒𝑡𝑚𝑛superscriptsubscript𝑒𝑡1𝑚𝑛superscriptsubscript𝐳𝑡𝑚𝑛superscript𝐤𝑚𝑛superscriptsubscript𝐲𝑡𝑚𝑛\displaystyle=\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!p(e_{t}^{m\rightarrow n}|e_{t-% 1}^{m\rightarrow n},\mathbf{z}_{t}^{m,n}=\mathbf{k}^{m,n},\mathbf{y}_{t}^{m,n}),= ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT | italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT = bold_k start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT ) ,
At(𝐤,𝐣,𝐯)subscript𝐴𝑡𝐤𝐣𝐯\displaystyle A_{t}(\mathbf{k},\mathbf{j},\mathbf{v})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k , bold_j , bold_v ) =n=1Nm=1Np(ztn=kn|zt1m=jm,ctn=vn,𝐲t1m,n,et1mn),\displaystyle=\prod_{n=1}^{N}\!\prod_{m=1}^{N}\!\,p(z_{t}^{n}\!=\!k^{n}|z_{t-1% }^{m}\!=\!j^{m},c_{t}^{n}\!=\!v^{n},\mathbf{y}_{t-1}^{m,n},e_{t-1}^{m% \rightarrow n}),= ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_j start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT ) ,
Ct(𝐮,𝐯,𝐣)subscript𝐶𝑡𝐮𝐯𝐣\displaystyle C_{t}(\mathbf{u},\mathbf{v},\mathbf{j})italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u , bold_v , bold_j ) =n=1Np(ctn=vn|ct1n=un,zt1n=jn),\displaystyle=\prod_{n=1}^{N}p(c_{t}^{n}\!=\!v^{n}|c_{t-1}^{n}\!=\!u^{n},z_{t-% 1}^{n}\!=\!j^{n}),= ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_v start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ,

Among these, π(𝐤)𝜋𝐤\pi(\mathbf{k})italic_π ( bold_k ) is the initial joint discrete mode probability. Bt(𝐤)subscript𝐵𝑡𝐤B_{t}(\mathbf{k})italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ) is the observation transition probability conditioned on motion modes 𝐤𝐤\mathbf{k}bold_k. Et(𝐤)subscript𝐸𝑡𝐤E_{t}(\mathbf{k})italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k ) is the discrete edge transition probability. At(𝐤,𝐣,𝐯)subscript𝐴𝑡𝐤𝐣𝐯A_{t}(\mathbf{k},\mathbf{j},\mathbf{v})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_k , bold_j , bold_v ) is the discrete motion mode transition probability. Ct(𝐮,𝐯,𝐣)subscript𝐶𝑡𝐮𝐯𝐣C_{t}(\mathbf{u},\mathbf{v},\mathbf{j})italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_u , bold_v , bold_j ) is the mode count transition probability. Besides, γ(𝐤)𝛾𝐤\gamma(\mathbf{k})italic_γ ( bold_k ) and ξ(𝐤,𝐣,𝐮,𝐯)𝜉𝐤𝐣𝐮𝐯\xi(\mathbf{k},\mathbf{j},\mathbf{u},\mathbf{v})italic_ξ ( bold_k , bold_j , bold_u , bold_v ) are conditional posterior distributions, which can be calculated by the forward-backward algorithm in Appendix B.3.

Appendix C More Experiments

C.1 Details of Datasets

Refer to caption
Figure 6: An illustration of Mass-spring hopper system (Brunton et al., 2016).

C.1.1 Mass-spring hopper

Figure 6 shows an illustration of a Mass-spring system that contains two motion modes, i.e. flying and compression. A minimal model of the Mass-spring hopper system is defined as

mx¨={k(xx0)mg,xx0mg,x>x0,𝑚¨𝑥cases𝑘𝑥subscript𝑥0𝑚𝑔𝑥subscript𝑥0𝑚𝑔𝑥subscript𝑥0\displaystyle m{\ddot{x}}=\begin{cases}-k(x-x_{0})-mg,&x\leq x_{0}\\ -mg,&x>x_{0}\end{cases}\!,italic_m over¨ start_ARG italic_x end_ARG = { start_ROW start_CELL - italic_k ( italic_x - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_m italic_g , end_CELL start_CELL italic_x ≤ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - italic_m italic_g , end_CELL start_CELL italic_x > italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW ,

where k𝑘kitalic_k, m𝑚mitalic_m, and g𝑔gitalic_g are the spring constant, mass, and gravity, respectively. x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the unstretched spring length, which defines the flying x>x0𝑥subscript𝑥0x>x_{0}italic_x > italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and compression xx0𝑥subscript𝑥0x\leq x_{0}italic_x ≤ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT modes. After scaling by κ=kx0/mg𝜅𝑘subscript𝑥0𝑚𝑔\kappa=kx_{0}/mgitalic_κ = italic_k italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_m italic_g, the equations above becomes

y¨={1κ(y1),y11,y>1.¨𝑦cases1𝜅𝑦1𝑦11𝑦1\displaystyle{\ddot{y}}=\begin{cases}1-\kappa(y-1),&y\leq 1\\ -1,&y>1\end{cases}\!.over¨ start_ARG italic_y end_ARG = { start_ROW start_CELL 1 - italic_κ ( italic_y - 1 ) , end_CELL start_CELL italic_y ≤ 1 end_CELL end_ROW start_ROW start_CELL - 1 , end_CELL start_CELL italic_y > 1 end_CELL end_ROW .

Following Hybrid-SINDy (Mangan et al., 2019), we set κ=10𝜅10\kappa=10italic_κ = 10 for data generation. Denoting y𝑦yitalic_y as l𝑙litalic_l and y˙˙𝑦\dot{y}over˙ start_ARG italic_y end_ARG as v𝑣vitalic_v, thus the target closed-form ordinary differential equations are

{l˙=vandv˙=1110l,l1l˙=vandv˙=1,l>1cases˙𝑙𝑣and˙𝑣1110𝑙𝑙1˙𝑙𝑣and˙𝑣1𝑙1\displaystyle\begin{cases}\dot{l}=v\;\;\textrm{and}\;\;\dot{v}=11-10\;l,&l\leq 1% \\ \dot{l}=v\;\;\textrm{and}\;\;\dot{v}=-1,&l>1\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_l end_ARG = italic_v and over˙ start_ARG italic_v end_ARG = 11 - 10 italic_l , end_CELL start_CELL italic_l ≤ 1 end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_l end_ARG = italic_v and over˙ start_ARG italic_v end_ARG = - 1 , end_CELL start_CELL italic_l > 1 end_CELL end_ROW (8)

The generated positions and velocities are concatenated [l,v]𝑙𝑣[l,v][ italic_l , italic_v ] and used as observations. Instead of generating only a few samples in Hybrid-SINDy (Mangan et al., 2019) (3 for training and 5 for validation), we scale up the datasets and sample 240 initial conditions from the ranges (0.5,3)0.53(0.5,3)( 0.5 , 3 ) and (1,1)11(-1,1)( - 1 , 1 ) for positions a𝑎aitalic_a and velocities b𝑏bitalic_b, respectively. Among them, 200 samples are for training, 20 for validation, and 20 for testing. The system is simulated to generate 150 time steps for each time series, with sampling intervals of τ=0.033subscript𝜏0.033\triangle_{\tau}=0.033△ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = 0.033. We add Gaussian noise with mean zero and standard derivation 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to generated samples. By default, we use the first 100 time steps as context and predict the following next 50 time steps one by one based on the ground truth of the previous time step, i.e. one-step prediction. By default, the order of polynomial functions is set as 2, and the maximal number of possible modes is 3.

C.1.2 Susceptible, Infected and Recovered (SIR) Disease Dataset

The SIR disease model in the epidemiological community has been widely studied in the literature (Toda, 2020; McMahon et al., 2020). The model can be defined as

S˙˙𝑆\displaystyle\dot{S}over˙ start_ARG italic_S end_ARG =vNβtNISdS,absent𝑣𝑁subscript𝛽𝑡𝑁𝐼𝑆𝑑𝑆\displaystyle=vN-\frac{\beta_{t}}{N}IS-dS,= italic_v italic_N - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_I italic_S - italic_d italic_S ,
I˙˙𝐼\displaystyle\dot{I}over˙ start_ARG italic_I end_ARG =βtNIS(γ+d)I,absentsubscript𝛽𝑡𝑁𝐼𝑆𝛾𝑑𝐼\displaystyle=\frac{\beta_{t}}{N}IS-(\gamma+d)I,= divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_I italic_S - ( italic_γ + italic_d ) italic_I ,
R˙˙𝑅\displaystyle\dot{R}over˙ start_ARG italic_R end_ARG =γIdR,absent𝛾𝐼𝑑𝑅\displaystyle=\gamma I-dR,= italic_γ italic_I - italic_d italic_R ,

where the rate of transmission βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is time-varying, which takes two discrete values according to whether the school is in session or not

βt={β^(1+b),tschoolinsession,β^/(1+b),tschooloutofsession.subscript𝛽𝑡cases^𝛽1𝑏𝑡schoolinsession^𝛽1𝑏𝑡schooloutofsession\displaystyle\beta_{t}=\begin{cases}\hat{\beta}\cdot(1+b),&t\in{\rm school\,\,% in\,\,session},\\ \hat{\beta}/(1+b),&t\in{\rm school\,\,out\,\,of\,\,session}.\end{cases}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL over^ start_ARG italic_β end_ARG ⋅ ( 1 + italic_b ) , end_CELL start_CELL italic_t ∈ roman_school roman_in roman_session , end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_β end_ARG / ( 1 + italic_b ) , end_CELL start_CELL italic_t ∈ roman_school roman_out roman_of roman_session . end_CELL end_ROW

Following Hybrid-SINDy (Mangan et al., 2019), for dataset generation, the rates that define at which students enter and leave the population are set as v=1/365𝑣1365v=1/365italic_v = 1 / 365 and d=1/365𝑑1365d=1/365italic_d = 1 / 365. The total population of students is set as N=1000𝑁1000N=1000italic_N = 1000. The recovery rate is set as γ=1/5𝛾15\gamma=1/5italic_γ = 1 / 5 assuming 5 days is the average infectious period. The base transmission rate is set as β^=9.336^𝛽9.336\hat{\beta}=9.336over^ start_ARG italic_β end_ARG = 9.336 and b=0.8𝑏0.8b=0.8italic_b = 0.8 tunes the transmission rate change. Following Hybrid-SINDy (Mangan et al., 2019), the concatenation [S,I]𝑆𝐼[S,I][ italic_S , italic_I ] of S𝑆Sitalic_S and I𝐼Iitalic_I are used as observations. Thus the target closed-form ordinary differential equations are

{S˙=2.740.0168IS0.0027SandI˙=0.0168IS0.20I,tschoolinsessionS˙=2.740.0052IS0.0027SandI˙=0.0052IS0.20I,tschooloutofsession.cases˙𝑆2.740.0168𝐼𝑆0.0027𝑆and˙𝐼0.0168𝐼𝑆0.20𝐼𝑡schoolinsession˙𝑆2.740.0052𝐼𝑆0.0027𝑆and˙𝐼0.0052𝐼𝑆0.20𝐼𝑡schooloutofsession\displaystyle\begin{cases}\dot{S}=2.74-0.0168\;IS-0.0027\;S\;\;{\rm and}\;\;% \dot{I}=0.0168\;IS-0.20\;I,&t\in{\rm school\,\,in\,\,session}\\ \dot{S}=2.74-0.0052\;IS-0.0027\;S\;\;{\rm and}\;\;\dot{I}=0.0052\;IS-0.20\;I,&% t\in{\rm school\,\,out\,\,of\,\,session}.\end{cases}{ start_ROW start_CELL over˙ start_ARG italic_S end_ARG = 2.74 - 0.0168 italic_I italic_S - 0.0027 italic_S roman_and over˙ start_ARG italic_I end_ARG = 0.0168 italic_I italic_S - 0.20 italic_I , end_CELL start_CELL italic_t ∈ roman_school roman_in roman_session end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_S end_ARG = 2.74 - 0.0052 italic_I italic_S - 0.0027 italic_S roman_and over˙ start_ARG italic_I end_ARG = 0.0052 italic_I italic_S - 0.20 italic_I , end_CELL start_CELL italic_t ∈ roman_school roman_out roman_of roman_session . end_CELL end_ROW (9)

In a school year, the in-class periods are 35-155 and 225-365 days. The break periods are 0-35 and 155-225 days. Instead of creating only one time series for training and one for validation in Hybrid-SINDy, we scale up the datasets and sample 240 initial conditions for S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For instance, in each sample, we first sample a R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the range (900,980)900980(900,980)( 900 , 980 ), and then sample a I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the range (0,1000R0)01000subscript𝑅0(0,1000-R_{0})( 0 , 1000 - italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and then calculate S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by S0=1000R0I0subscript𝑆01000subscript𝑅0subscript𝐼0S_{0}=1000-R_{0}-I_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1000 - italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We simulate each time series for 2 years with a daily interval, thus producing 730 time steps for each time series. We add a random perturbation to the start of each session by changing the states of S𝑆Sitalic_S, I𝐼Iitalic_I and R𝑅Ritalic_R by either -2, -1, 0, 1, or 2, independently. By default, we use the first 600 time steps as context and predict the next 130 time steps one by one based on the ground truth of the previous time step, i.e. one-step prediction. By default, the order of polynomial functions is set as 2, and the maximal number of possible modes is 3 for our methods.

C.1.3 Non-hybrid Physical Systems

Following Course & Nair (2023), non-hybrid physical systems include the Coupled linear, Cubic oscillator, Lorenz’ 63, Hopf bifurcation, Seklov glycolysis, and Duffing oscillator. Equations of a Damped linear oscillator are defined as x˙=0.1x+2y˙𝑥0.1𝑥2𝑦\dot{x}=-0.1x+2yover˙ start_ARG italic_x end_ARG = - 0.1 italic_x + 2 italic_y and y˙=2x0.1y˙𝑦2𝑥0.1𝑦\dot{y}=-2x-0.1yover˙ start_ARG italic_y end_ARG = - 2 italic_x - 0.1 italic_y. A Damped cubic oscillator is x˙=0.1x3+2y3˙𝑥0.1superscript𝑥32superscript𝑦3\dot{x}=-0.1x^{3}+2y^{3}over˙ start_ARG italic_x end_ARG = - 0.1 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + 2 italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and y˙=2x30.1y3˙𝑦2superscript𝑥30.1superscript𝑦3\dot{y}=-2x^{3}-0.1y^{3}over˙ start_ARG italic_y end_ARG = - 2 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - 0.1 italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. A coupled linear system is x¨=6x+2y¨𝑥6𝑥2𝑦\ddot{x}=-6x+2yover¨ start_ARG italic_x end_ARG = - 6 italic_x + 2 italic_y and y¨=2x6y¨𝑦2𝑥6𝑦\ddot{y}=2x-6yover¨ start_ARG italic_y end_ARG = 2 italic_x - 6 italic_y. A Duffing oscillator is x˙=y˙𝑥𝑦\dot{x}=yover˙ start_ARG italic_x end_ARG = italic_y and y˙=x3+x0.35y˙𝑦superscript𝑥3𝑥0.35𝑦\dot{y}=-x^{3}+x-0.35yover˙ start_ARG italic_y end_ARG = - italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_x - 0.35 italic_y. A Selkov glycolysis is x˙=x+0.08y+x2y˙𝑥𝑥0.08𝑦superscript𝑥2𝑦\dot{x}=-x+0.08y+x^{2}yover˙ start_ARG italic_x end_ARG = - italic_x + 0.08 italic_y + italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y and y˙=0.60.08yx2y˙𝑦0.60.08𝑦superscript𝑥2𝑦\dot{y}=0.6-0.08y-x^{2}yover˙ start_ARG italic_y end_ARG = 0.6 - 0.08 italic_y - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y. A Lorenz’63 system is x˙=10y10x˙𝑥10𝑦10𝑥\dot{x}=10y-10xover˙ start_ARG italic_x end_ARG = 10 italic_y - 10 italic_x, y˙=28xxzy˙𝑦28𝑥𝑥𝑧𝑦\dot{y}=28x-xz-yover˙ start_ARG italic_y end_ARG = 28 italic_x - italic_x italic_z - italic_y, and z˙=xy2.67z˙𝑧𝑥𝑦2.67𝑧\dot{z}=xy-2.67zover˙ start_ARG italic_z end_ARG = italic_x italic_y - 2.67 italic_z. A Hopf bifurcation is x˙=0.5x+yx3xy2˙𝑥0.5𝑥𝑦superscript𝑥3𝑥superscript𝑦2\dot{x}=0.5x+y-x^{3}-xy^{2}over˙ start_ARG italic_x end_ARG = 0.5 italic_x + italic_y - italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - italic_x italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and y˙=x+0.5yx2yy3˙𝑦𝑥0.5𝑦superscript𝑥2𝑦superscript𝑦3\dot{y}=-x+0.5y-x^{2}y-y^{3}over˙ start_ARG italic_y end_ARG = - italic_x + 0.5 italic_y - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y - italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. We refer readers to see the details in (Course & Nair, 2023). By default, the order of polynomial functions of the Coupled linear, Cubic oscillator, Lorenz’ 63, Hopf bifurcation, Seklov glycolysis, and Duffing oscillator are 2, 3, 2, 3, 3, and 3, respectively for our methods.

C.1.4 ODE-driven Particle Dataset

Following GRASS (Liu et al., 2023), Ordinary Differential Equations are introduced as motion modes to generate trajectories of particles, i.e. Lotka-Volterra, Spiral, and Bouncing Ball

LotkaVolterra:x˙=xxy;y˙=y+xy,:LotkaVolterraformulae-sequence˙𝑥𝑥𝑥𝑦˙𝑦𝑦𝑥𝑦\displaystyle{\rm Lotka\!-\!Volterra\!:}\,\,\dot{x}=x-xy;\,\,\dot{y}=-y+xy,roman_Lotka - roman_Volterra : over˙ start_ARG italic_x end_ARG = italic_x - italic_x italic_y ; over˙ start_ARG italic_y end_ARG = - italic_y + italic_x italic_y ,
Spiral:x˙=0.1x3+2y3;y˙=2x30.1y3,:Spiralformulae-sequence˙𝑥0.1superscript𝑥32superscript𝑦3˙𝑦2superscript𝑥30.1superscript𝑦3\displaystyle{\rm Spiral\!:}\,\,\dot{x}=-0.1x^{3}+2y^{3};\,\,\dot{y}=-2x^{3}-0% .1y^{3},roman_Spiral : over˙ start_ARG italic_x end_ARG = - 0.1 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + 2 italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ; over˙ start_ARG italic_y end_ARG = - 2 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - 0.1 italic_y start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ,
BouncingBall+:x˙=0;y˙=2,:limit-fromBouncingBallformulae-sequence˙𝑥0˙𝑦2\displaystyle{\rm Bouncing\,\,Ball+:}\,\,\dot{x}=0;\,\,\dot{y}=2,roman_Bouncing roman_Ball + : over˙ start_ARG italic_x end_ARG = 0 ; over˙ start_ARG italic_y end_ARG = 2 ,
BouncingBall:x˙=0;y˙=2:limit-fromBouncingBallformulae-sequence˙𝑥0˙𝑦2\displaystyle{\rm Bouncing\,\,Ball-:}\,\,\dot{x}=0;\,\,\dot{y}=-2roman_Bouncing roman_Ball - : over˙ start_ARG italic_x end_ARG = 0 ; over˙ start_ARG italic_y end_ARG = - 2 (10)

Balls are introduced on a squared 2d canvas of size 6464646464*6464 ∗ 64 which are with radius r𝑟ritalic_r and whose locations are randomly initialized. Trajectories of balls are generated by numerical values of different equations over time which are mapped to the canvas field. To simulate mode-switching behaviors, the driven ODE modes of two objects are switched when they collide in the canvas. Different from GRASS (Liu et al., 2023), one mode BouncingBallBouncingBall{\rm Bouncing\,\,Ball}roman_Bouncing roman_Ball is regarded as two modes BouncingBall+limit-fromBouncingBall{\rm Bouncing\,\,Ball+}roman_Bouncing roman_Ball + and BouncingBalllimit-fromBouncingBall{\rm Bouncing\,\,Ball-}roman_Bouncing roman_Ball - in this work as they have different explicit equations for equation discovery. In summary, 4,928 samples are for training, 191 samples for validation, and 204 samples for testing. Each trajectory has 150 time steps with 10 frames per second. By default, the order of polynomial functions is set as 3, and the maximal number of possible modes is 5 for our methods.

C.1.5 Salsa-dancing Dataset

Following GRASS (Liu et al., 2023), four modes are annotated and used in the Salsa-dancing dataset, i.e. “moving forward”, “moving backward”, “clockwise turning”, and “counter-clockwise turning”. In summary, 1,321 samples are for training and 156 samples are for testing. Each sample has 100 time steps, among which 80 for context and the remaining 20 for prediction with 5 frames per second. The coordinates of the skeletal joints of dancers in 3D space are as observations. In practice, for all methods, we utilize two representative joints, i.e. right hip and left hip. By default, the order of polynomial functions is set as 3, and the maximal number of possible modes is 5 for our methods.

C.2 More Implementation Details

For each dataset, we set different numbers of modes K𝐾Kitalic_K and orders of polynomial functions D𝐷Ditalic_D for our model. By default, K=3𝐾3K=3italic_K = 3 and D=2𝐷2D=2italic_D = 2 for the Mass-spring Hopper dataset. K=3𝐾3K=3italic_K = 3 and D=2𝐷2D=2italic_D = 2 for the SIR dataset. D𝐷Ditalic_D of the Coupled linear, Cubic oscillator, Lorenz’ 63, Hopf bifurcation, Seklov glycolysis, and Duffing oscillator are 2, 3, 2, 3, 3, and 3, respectively. K=5𝐾5K=5italic_K = 5 and D=3𝐷3D=3italic_D = 3 for the ODE-driven particle dataset. K=5𝐾5K=5italic_K = 5 and D=3𝐷3D=3italic_D = 3 for the Salsa-dancing dataset.

C.3 Statistics of Experiments

C.3.1 Mass-spring Hopper

Experiments with statistics on the Mass-spring Hopper dataset are reported in Tables 13 and 14, which are extended versions of Tables 1 and 2 in the main paper.

Table 13: Segmentation results with statistics on Mass-spring Hopper dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.426 0.383 0.705 0.691
AMORE (ours) 0.928±plus-or-minus\pm±0.011 0.967±plus-or-minus\pm±0.013 0.991±plus-or-minus\pm±0.005 0.993±plus-or-minus\pm±0.007
Table 14: Forecasting results with statistics on Mass-spring Hopper dataset.
Method NMAE \downarrow NRMSE \downarrow
LLMTime 0.113±plus-or-minus\pm±0.032 / 0.305±plus-or-minus\pm±0.036 0.417±plus-or-minus\pm±0.051 / 0.454±plus-or-minus\pm±0.072
SVI 0.068±plus-or-minus\pm±0.016 / 0.075±plus-or-minus\pm±0.011 0.148±plus-or-minus\pm±0.023 / 0.262±plus-or-minus\pm±0.030
AMORE (ours) 0.008±plus-or-minus\pm±0.003 / 0.039±plus-or-minus\pm±0.008 0.026±plus-or-minus\pm±0.005 / 0.059±plus-or-minus\pm±0.006

C.3.2 SIR disease

Experiments with statistics on the SIR disease dataset are reported in Tables 15 and 16, which are extended versions of Tables 3 and 4 in the main paper.

Table 15: Segmentation results with statistics on the SIR disease dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.296 0.283 0.538 0.519
AMORE (ours) 0.475±plus-or-minus\pm±0.027 0.483±plus-or-minus\pm±0.032 0.731±plus-or-minus\pm±0.054 0.735±plus-or-minus\pm±0.051
Table 16: Forecasting results of Susceptible/Infected with statistics on the SIR disease dataset.
Method NMAE \downarrow NRMSE \downarrow
LLMTime 0.352±plus-or-minus\pm±0.073 / 0.396±plus-or-minus\pm±0.091 0.481±plus-or-minus\pm±0.084 / 0.523±plus-or-minus\pm±0.096
SVI 0.257±plus-or-minus\pm±0.031 / 0.273±plus-or-minus\pm±0.054 0.355±plus-or-minus\pm±0.050 / 0.401±plus-or-minus\pm±0.078
AMORE (ours) 0.088±plus-or-minus\pm±0.012 / 0.113±plus-or-minus\pm±0.018 0.142±plus-or-minus\pm±0.029 / 0.181±plus-or-minus\pm±0.035

C.4 Additional Ablation Studies

C.4.1 Sampling intervals Analysis

In our experiments, we followed the experimental setup of Hybrid-SINDy on the sampling intervals of the Mass-spring Hopper dataset and the SIR disease dataset. That means we use their standard sampling intervals ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, e.g. Δt=0.033subscriptΔ𝑡0.033\Delta_{t}=0.033roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.033 on the Mass-spring Hopper dataset. In Table 17, we report the segmentation comparison results when ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT increases. We double the previous ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT each time and thus get {0.033,0.066,0.132,0.264}0.0330.0660.1320.264\{0.033,0.066,0.132,0.264\}{ 0.033 , 0.066 , 0.132 , 0.264 }. We can see that when Δt0.132subscriptΔ𝑡0.132\Delta_{t}\geq 0.132roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0.132, the segmentation performance of Hybrid-SINDy decreases considerably due to the temporal pattern disruption, while our model has a smaller decrease in performance. When ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT increases (e.g. Δt0.132subscriptΔ𝑡0.132\Delta_{t}\geq 0.132roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0.132), the discretization obviously disrupts the original temporal patterns of time series. Thus, after learning on the discretization, the model shows significantly decreased performance on labels that are annotated based on the original temporal patterns.

Table 17: Analyses of ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on segmentation results of the Mass-spring Hopper dataset.
Sampling interval ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
0.033 Hybrid-SINDy 0.426 0.383 0.705 0.691
0.033 AMORE (ours) 0.928±plus-or-minus\pm±0.011 0.967±plus-or-minus\pm±0.013 0.991±plus-or-minus\pm±0.005 0.993±plus-or-minus\pm±0.007
0.066 Hybrid-SINDy 0.422 0.385 0.701 0.697
0.066 AMORE (ours) 0.925±plus-or-minus\pm±0.017 0.973±plus-or-minus\pm±0.014 0.986±plus-or-minus\pm±0.007 0.982±plus-or-minus\pm±0.010
0.132 Hybrid-SINDy 0.235 0.201 0.447 0.413
0.132 AMORE (ours) 0.458±plus-or-minus\pm±0.021 0.369±plus-or-minus\pm±0.016 0.627±plus-or-minus\pm±0.013 0.644±plus-or-minus\pm±0.017
0.264 Hybrid-SINDy 0.226 0.183 0.382 0.376
0.264 AMORE (ours) 0.417±plus-or-minus\pm±0.015 0.335±plus-or-minus\pm±0.008 0.574±plus-or-minus\pm±0.020 0.580±plus-or-minus\pm±0.012

C.4.2 Number of Training Samples Analysis

To answer the question: “Given the significantly smaller datasets used by Hybrid-SINDy, can the proposed method maintain this level of performance difference?”, we rerun experiments on the Mass-spring Hopper dataset by varying the number of samples in the training set from 3 (the same as Hybrid-SINDy) to 20 and 200. The comparison results are summarized in Tables 1819, and 20. In the few-shot setting with a very low number of samples, e.g. 3 samples, Hybrid-SINDy outperforms AMORE. This is expected and a common limitation of deep learning methods, which usually require larger numbers of samples for training. On the other hand, when given more samples, e.g. larger than 20, AMORE outperforms Hybrid-SINDy consistently.

Table 18: Analyses of numbers of training samples on segmentation results of the Mass-spring Hopper dataset.
Number of training samples Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
3 Hybrid-SINDy 0.425 0.377 0.693 0.684
3 AMORE (ours) 0.238±plus-or-minus\pm±0.052 0.217±plus-or-minus\pm±0.065 0.474±plus-or-minus\pm±0.134 0.429±plus-or-minus\pm±0.110
20 Hybrid-SINDy 0.422 0.383 0.698 0.693
20 AMORE (ours) 0.774±plus-or-minus\pm±0.037 0.762±plus-or-minus\pm±0.025 0.846±plus-or-minus\pm±0.094 0.853±plus-or-minus\pm±0.071
200 Hybrid-SINDy 0.426 0.383 0.705 0.691
200 AMORE (ours) 0.928±plus-or-minus\pm±0.011 0.967±plus-or-minus\pm±0.013 0.991±plus-or-minus\pm±0.005 0.993±plus-or-minus\pm±0.007
Table 19: Analyses of numbers of training samples on forecasting results of Location/Velocity on the Mass-spring Hopper dataset.
Number of training samples Method NMAE \downarrow NRMSE \downarrow
3 LLMTime 0.113±plus-or-minus\pm±0.032 / 0.305±plus-or-minus\pm±0.036 0.417±plus-or-minus\pm±0.051 / 0.454±plus-or-minus\pm±0.072
3 SVI 0.173±plus-or-minus\pm±0.039 / 0.341±plus-or-minus\pm±0.053 0.450±plus-or-minus\pm±0.081 / 0.481±plus-or-minus\pm±0.094
3 AMORE (ours) 0.091±plus-or-minus\pm±0.018 / 0.160±plus-or-minus\pm±0.026 0.315±plus-or-minus\pm±0.049 / 0.348±plus-or-minus\pm±0.042
20 LLMTime 0.113±plus-or-minus\pm±0.032 / 0.305±plus-or-minus\pm±0.036 0.417±plus-or-minus\pm±0.051 / 0.454±plus-or-minus\pm±0.072
20 SVI 0.094±plus-or-minus\pm±0.020 / 0.147±plus-or-minus\pm±0.024 0.302±plus-or-minus\pm±0.038 / 0.381±plus-or-minus\pm±0.044
20 AMORE (ours) 0.036±plus-or-minus\pm±0.012 / 0.057±plus-or-minus\pm±0.018 0.106±plus-or-minus\pm±0.025 / 0.129±plus-or-minus\pm±0.031
200 LLMTime 0.113±plus-or-minus\pm±0.032 / 0.305±plus-or-minus\pm±0.036 0.417±plus-or-minus\pm±0.051 / 0.454±plus-or-minus\pm±0.072
200 SVI 0.068±plus-or-minus\pm±0.016 / 0.075±plus-or-minus\pm±0.011 0.148±plus-or-minus\pm±0.023 / 0.262±plus-or-minus\pm±0.030
200 AMORE (ours) 0.008±plus-or-minus\pm±0.003 / 0.039±plus-or-minus\pm±0.008 0.026±plus-or-minus\pm±0.005 / 0.059±plus-or-minus\pm±0.006
Table 20: Analyses of numbers of training samples on reconstruction errors (RER) of discovered equations on the Mass-spring Hopper dataset. Numbers are of e3superscript𝑒3e^{-3}italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.
Number of training samples Method RER (e3superscript𝑒3e^{-3}italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) \downarrow
3 Hybrid-SINDy 8.3
3 AMORE (ours) 17.2±2.4plus-or-minus17.22.417.2\pm 2.417.2 ± 2.4
20 Hybrid-SINDy 8.2
20 AMORE (ours) 5.1±plus-or-minus\pm±0.6
200 Hybrid-SINDy 7.5
200 AMORE (ours) 2.4±plus-or-minus\pm±0.3

C.4.3 Count Variables Analysis

The count variables are introduced by REDSDS (Ansari et al., 2021) to learn the duration distributions of each mode from the data and to avoid frequent mode-switching behaviors. We show below some ablations studies on count variables in the Mass-spring Hopper system, where the flying mode usually takes more than twice as many time steps as the compression mode. To quantitatively compare the discovered equations, we first report the equation reconstruction error (RER) for Hybrid-SINDy, AMORE, and AMORE w/o count variable, which are respectively 7.5e37.5superscript𝑒37.5e^{-3}7.5 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 2.4e42.4superscript𝑒42.4e^{-4}2.4 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and 2.8e42.8superscript𝑒42.8e^{-4}2.8 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We can see that with count variables, AMORE has a lower equation reconstruction error than its counterpart without count variables. In Tables 21 and 22, we can see that count variables help AMORE learn fewer false-positive mode-switching behaviors, benefitting segmentation and forecasting.

Table 21: Analyse of count variables on segmentation results of the Mass-spring Hopper dataset.
Method NMI \uparrow ARI \uparrow Accuracy \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \uparrow
Hybrid-SINDy 0.426 0.383 0.705 0.691
AMORE (ours) 0.928±plus-or-minus\pm±0.011 0.967±plus-or-minus\pm±0.013 0.991±plus-or-minus\pm±0.005 0.993±plus-or-minus\pm±0.007
AMORE w/o count (ours) 0.903±plus-or-minus\pm±0.017 0.929±plus-or-minus\pm±0.019 0.970±plus-or-minus\pm±0.012 0.975±plus-or-minus\pm±0.013
Table 22: Analyse of count variables on forecasting results of Location/Velocity on the Mass-spring Hopper dataset.
Method NMAE \downarrow NRMSE \downarrow
LLMTime 0.113±plus-or-minus\pm±0.032 / 0.305±plus-or-minus\pm±0.036 0.417±plus-or-minus\pm±0.051 / 0.454±plus-or-minus\pm±0.072
SVI 0.068±plus-or-minus\pm±0.016 / 0.075±plus-or-minus\pm±0.011 0.148±plus-or-minus\pm±0.023 / 0.262±plus-or-minus\pm±0.030
AMORE (ours) 0.008±plus-or-minus\pm±0.003 / 0.039±plus-or-minus\pm±0.008 0.026±plus-or-minus\pm±0.005 / 0.059±plus-or-minus\pm±0.006
AMORE w/o count (ours) 0.014±plus-or-minus\pm±0.004 / 0.046±plus-or-minus\pm±0.007 0.052±plus-or-minus\pm±0.011 / 0.068±plus-or-minus\pm±0.014

C.4.4 Polynomial orders and mode numbers analysis

To qualitatively show the discovered equations when the order of candidates and the number of modes are increased, we increase the order of candidates from 2 to 5, i.e. D=5𝐷5D=5italic_D = 5, and the number of modes from 3 to 5, i.e. K=5𝐾5K=5italic_K = 5 on the Mass-spring Hopper dataset. The discovered equations are summarized in Table 23. We can see that our model can categorize exactly 2 modes, i.e. the same as the ground truth, no matter how many potential modes are introduced. Besides, the discovered equations of the 2 modes are regularized by sparsity promotion and do not involve irrelevant function terms thanks to the sparsity regularization when increasing the order of polynomial basis functions.

Table 23: Analyse of equation discovery of AMORE when increasing the number of modes K𝐾Kitalic_K and the order of candidate basis functions D𝐷Ditalic_D on the Mass-spring Hopper dataset.
Settings Discovered Equations Ground-truth Equations
K=3𝐾3K=3italic_K = 3 and D=2𝐷2D=2italic_D = 2 l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=11.0310.08l˙𝑣11.0310.08𝑙\dot{v}=11.03-10.08lover˙ start_ARG italic_v end_ARG = 11.03 - 10.08 italic_l; l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=1˙𝑣1\dot{v}=-1over˙ start_ARG italic_v end_ARG = - 1 l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=1110l˙𝑣1110𝑙\dot{v}=11-10lover˙ start_ARG italic_v end_ARG = 11 - 10 italic_l; l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=1˙𝑣1\dot{v}=-1over˙ start_ARG italic_v end_ARG = - 1
K=5𝐾5K=5italic_K = 5 and D=5𝐷5D=5italic_D = 5 l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=10.9510.06l˙𝑣10.9510.06𝑙\dot{v}=10.95-10.06lover˙ start_ARG italic_v end_ARG = 10.95 - 10.06 italic_l; l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=1˙𝑣1\dot{v}=-1over˙ start_ARG italic_v end_ARG = - 1 l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=1110l˙𝑣1110𝑙\dot{v}=11-10lover˙ start_ARG italic_v end_ARG = 11 - 10 italic_l; l˙=v˙𝑙𝑣\dot{l}=vover˙ start_ARG italic_l end_ARG = italic_v and v˙=1˙𝑣1\dot{v}=-1over˙ start_ARG italic_v end_ARG = - 1