Towards Learning Stochastic Population Models
by Gradient Descent

Justin N. Kreikemeyer 0000-0002-4109-3608 [email protected] Institute for Visual and Analytic Computing, University of RostockAlbert-Einstein-Straße 22RostockGermany18059 Philipp Andelfinger 0000-0002-0211-7136 [email protected] Institute for Visual and Analytic Computing, University of RostockAlbert-Einstein-Straße 22RostockGermany18059  and  Adelinde M. Uhrmacher 0000-0001-5256-4682 [email protected] Institute for Visual and Analytic Computing, University of RostockAlbert-Einstein-Straße 22RostockGermany18059
(2024)
Abstract.

Increasing effort is put into the development of methods for learning mechanistic models from data. This task entails not only the accurate estimation of parameters but also a suitable model structure. Recent work on the discovery of dynamical systems formulates this problem as a linear equation system. Here, we explore several simulation-based optimization approaches, which allow much greater freedom in the objective formulation and weaker conditions on the available data. We show that even for relatively small stochastic population models, simultaneous estimation of parameters and structure poses major challenges for optimization procedures. Particularly, we investigate the application of the local stochastic gradient descent method, commonly used for training machine learning models. We demonstrate accurate estimation of models but find that enforcing the inference of parsimonious, interpretable models drastically increases the difficulty. We give an outlook on how this challenge can be overcome.

automatic model generation, gradient descent, stochastic simulation algorithm, discrete-event simulation, differentiable simulation
journalyear: 2024copyright: acmlicensedconference: 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation; June 24–26, 2024; Atlanta, GA, USAbooktitle: 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS ’24), June 24–26, 2024, Atlanta, GA, USAdoi: 10.1145/3615979.3656058isbn: 979-8-4007-0363-8/24/06submissionid: 3965ccs: Computing methodologies Discrete-event simulationccs: Computing methodologies Modeling methodologiesccs: Computing methodologies Machine learning

1. Introduction

Statistical machine learning methods provide exciting advances in automatically learning (deep) models from data. Whereas these models exhibit impressive predictive abilities (Noé et al., 2020), their black-box nature does not directly contribute to understanding the reference system’s mechanics and impedes precise manual refinement. This motivated the development of methods for automatically deriving (white-box) mechanistic models from data (Nobile et al., 2013; Brunton et al., 2016; Klimovskaia et al., 2016; Burrage et al., 2024; Martinelli et al., 2023). With these, manual, hypothesis-driven knowledge discovery can increasingly be augmented by automatic, data-driven approaches (Maass et al., 2018). Such an automatic modeling approach is useful when (parts of) the mechanisms of the reference system are unknown, but there are measurements of its behavior over time. Learning mechanistic models from data then entails not only parameter estimation but also the simultaneous identification of a suitable model structure.

In this paper, we study learning stochastic, discrete-event models with an underlying continuous representation of time from time-series snapshots of some traversed state distributions by gradient descent. Specifically, we focus on Markovian population models that are expressed as reaction systems. Our contributions are:

  • Section 5 provides different possible formulations of the model learning problem.

  • Section 5.1 shows how reparametrization enables parameter estimation over different orders of magnitude.

  • Section 6 provides first results on the simultaneous learning of structure and parameters by gradient descent. It discusses the challenges and opportunities of the approach.

We briefly introduce the reaction system formalism in Section 2 and stochastic gradient estimation in Section 3. Section 4 reviews related work. After presenting our methods in Section 5 as outlined above, we conclude in Section 6.

2. Population-based Modeling

In the biology and chemistry domains, reaction systems are a commonly used modeling formalism (Keating et al., 2020). They describe system dynamics in terms of the consumption and production of entities at certain rates. Their underlying assumption is that entities can be grouped into homogeneous populations of species Si,i{1,,nS}subscript𝑆𝑖𝑖1subscript𝑛𝑆S_{i},i\in\{1,\dots,n_{S}\}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } residing in a well-stirred medium. A reaction takes the form

Ri:j=1nScijSjrij=nS+12nScijSjnS:subscript𝑅𝑖subscript𝑟𝑖superscriptsubscript𝑗1subscript𝑛𝑆subscript𝑐𝑖𝑗subscript𝑆𝑗superscriptsubscript𝑗subscript𝑛𝑆12subscript𝑛𝑆subscript𝑐𝑖𝑗subscript𝑆𝑗subscript𝑛𝑆R_{i}:\sum_{j=1}^{n_{S}}c_{ij}S_{j}\xrightarrow{r_{i}}\sum_{j=n_{S}+1}^{2n_{S}% }c_{ij}S_{j-n_{S}}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW ∑ start_POSTSUBSCRIPT italic_j = italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_j - italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT

with 𝐂nR×2nS𝐂superscriptsubscript𝑛𝑅2subscript𝑛𝑆\mathbf{C}\in\mathbb{N}^{n_{R}\times 2n_{S}}bold_C ∈ blackboard_N start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT × 2 italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT being a matrix of coefficients (“model structure”), 𝐫𝐫\mathbf{r}bold_r the vector of rate constants (“parameters”), and nR,nSsubscript𝑛𝑅subscript𝑛𝑆n_{R},n_{S}italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT the number of reactions and species in the system, respectively. A reaction system can be completely represented by providing 𝐂𝐂\mathbf{C}bold_C and 𝐫𝐫\mathbf{r}bold_r. A vector of species counts gives the starting conditions of a reaction system, i.e., 𝐒initsubscript𝐒𝑖𝑛𝑖𝑡\mathbf{S}_{init}bold_S start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT.

As a running example, consider the well-studied SIR model of disease spread, comprising three species representing populations of susceptible, infected and recovered individuals:

R0:1S+1I0.02 2IR1:1I5.00 1Rmissing-subexpression:subscript𝑅01𝑆1𝐼0.02 2𝐼missing-subexpression:subscript𝑅11𝐼5.00 1𝑅\begin{aligned} &R_{0}\colon 1S+1I&\xrightarrow{0.02}&\text{ }\fcolorbox{red}{% white}{2}I\\ &R_{1}\colon 1I&\xrightarrow{5.00}&\text{ }1R\end{aligned}start_ROW start_CELL end_CELL start_CELL italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : 1 italic_S + 1 italic_I end_CELL start_CELL start_ARROW over0.02 → end_ARROW end_CELL start_CELL 2 italic_I end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : 1 italic_I end_CELL start_CELL start_ARROW over5.00 → end_ARROW end_CELL start_CELL 1 italic_R end_CELL end_ROW \Rightarrow\kern-5.0pt 𝐂SIR=(110020010001)𝐫SIR=(0.025.00)Tsubscript𝐂𝑆𝐼𝑅absent110020010001subscript𝐫𝑆𝐼𝑅absentsuperscript0.025.00𝑇\begin{aligned} \mathbf{C}_{SIR}&=\left(\begin{array}[]{cccccc}1&1&0&0&% \fcolorbox{red}{white}{2}&0\\ 0&1&0&0&0&1\end{array}\right)\\ \mathbf{r}_{SIR}&=\left(\begin{array}[]{cc}0.02&5.00\end{array}\right)^{T}\end% {aligned}start_ROW start_CELL bold_C start_POSTSUBSCRIPT italic_S italic_I italic_R end_POSTSUBSCRIPT end_CELL start_CELL = ( start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY ) end_CELL end_ROW start_ROW start_CELL bold_r start_POSTSUBSCRIPT italic_S italic_I italic_R end_POSTSUBSCRIPT end_CELL start_CELL = ( start_ARRAY start_ROW start_CELL 0.02 end_CELL start_CELL 5.00 end_CELL end_ROW end_ARRAY ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW

This reaction system has two reactions with coefficient matrix 𝐂SIRsubscript𝐂𝑆𝐼𝑅\mathbf{C}_{SIR}bold_C start_POSTSUBSCRIPT italic_S italic_I italic_R end_POSTSUBSCRIPT and rate vector 𝐫SIRsubscript𝐫𝑆𝐼𝑅\mathbf{r}_{SIR}bold_r start_POSTSUBSCRIPT italic_S italic_I italic_R end_POSTSUBSCRIPT. The first reaction describes the infection of a susceptible individual and the second its recovery. Note the correspondence between entries in 𝐂𝐂\mathbf{C}bold_C and Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicated by the red box. Species participating with coefficient 0 are omitted from Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We will use 𝐒init=(1980  20  0)subscript𝐒𝑖𝑛𝑖𝑡1980200\mathbf{S}_{init}=\left(1980\,\,20\,\,0\right)bold_S start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = ( 1980 20 0 ) as the initial state for the S, I, and R species, respectively.

Population-based models defined as reaction systems can be simulated either through numerical integration with ordinary differential equation (ODE) semantics (Kurtz, 1972; Hahl and Kremling, 2016) or the stochastic simulation algorithm (SSA) (Gillespie, 1976) with continuous-time Markov chain (CTMC) semantics. In many cases, stochastic effects cannot be ignored (Ramaswamy et al., 2012; McAdams and Arkin, 1999). Therefore, instead of focusing on the mean continuous dynamics, our approach will take the stochasticity of the system into account.

The vector of species counts 𝐒tsubscript𝐒𝑡\mathbf{S}_{t}bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fully represents the state of the model at the current time t𝑡titalic_t. We make the common assumption that the transition probabilities are governed by the probability of two entities in the well-stirred medium reacting, so the transitions of the CTMC are governed by the stochastic mass action law (Kurtz, 1972; Gillespie, 1976). The effective rate of a reaction in a given state is called its propensity α𝛼\alphaitalic_α. For example, for the SIR model, we have α0=0.02SIsubscript𝛼00.02𝑆𝐼\alpha_{0}=0.02\cdot S\cdot Iitalic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.02 ⋅ italic_S ⋅ italic_I, i.e., the more susceptible and infected individuals there are, the likelier an infection event is to happen. Note that other functions may be used to calculate the propensity depending on the modeled system. Another common assumption is that the probability of more than two species colliding (interacting) is very low. Thus, we only consider binary reactions with at most two reactants. Despite making these assumptions here for simplicity, our approach is theoretically able to accommodate any dependence of the propensities on the state as well as n-ary reactions.

As a simulator, we use Gillespie’s direct method (Gillespie, 1976), which takes sample trajectories through the CTMC defined by 𝐂𝐂\mathbf{C}bold_C and 𝐫𝐫\mathbf{r}bold_r using a Monte Carlo strategy. At each event, t𝑡titalic_t is advanced according to an exponential distribution based on the sum of the propensities αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The state is updated by choosing from a categorical distribution over the reactions, subtracting the reactants, and adding the products. With the number of samples approaching infinity, the probability distribution over system states and time (likelihood) is obtained.

3. Stochastic Gradient Estimation

When there is a closed form of the likelihood, its gradient is an effective tool for optimization. However, a closed form is unattainable for many real-world systems, necessitating Gillespie’s SSA. Determining the gradient of this algorithm is not straightforward. The well-established method of automatic differentiation (AD) provides performant means to calculate the gradient of algorithms at runtime (Margossian, 2019). However, this gradient cannot account for the jumps (discontinuities) inherent to individual SSA trajectories, resulting from the discrete state changes. So even with the mean over trajectories being a smooth function, AD is not useful for optimization.

Thus, we resort to recent advances in estimating the gradient of an alternative objective function, which is smoothed over jumps (Kreikemeyer and Andelfinger, 2023). We use a finite-differences estimator with stochastic step-size for simplicity, cf. (Polyak, 1987) (Chapter 3.4) and for further analysis (Nesterov and Spokoiny, 2017):

(1) f(θ)1Nn=1Nf(θ+σ𝐮)f(θ)σ𝐮𝑓𝜃1𝑁superscriptsubscript𝑛1𝑁𝑓𝜃𝜎𝐮𝑓𝜃𝜎𝐮\nabla f(\mathbf{\theta})\approx\frac{1}{N}\sum_{n=1}^{N}\frac{f(\mathbf{% \theta}+\sigma\mathbf{u})-f(\mathbf{\theta})}{\sigma}\mathbf{u}∇ italic_f ( italic_θ ) ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_f ( italic_θ + italic_σ bold_u ) - italic_f ( italic_θ ) end_ARG start_ARG italic_σ end_ARG bold_u

where θ𝜃\mathbf{\theta}italic_θ is the parameter vector, σ𝜎\sigmaitalic_σ is a smoothing factor that determines the smoothing applied to the objective f𝑓fitalic_f, and 𝐮𝒩(𝟎,𝐈)similar-to𝐮𝒩0𝐈\mathbf{u}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_u ∼ caligraphic_N ( bold_0 , bold_I ) is a vector of i.i.d. normal variates with mean 00 and variance 1111. In contrast to finite differences, which need at least one sample per dimension of θ𝜃\mathbf{\theta}italic_θ, through simultaneous perturbation, this estimator requires only two samples for estimating the full gradient. For the number of samples n𝑛nitalic_n approaching infinity, the estimate converges to the gradient of a smoothed version of f𝑓fitalic_f (Nesterov and Spokoiny, 2017). Further, it can handle jumps and noise in the objective through the smoothing controlled by σ𝜎\sigmaitalic_σ.

4. Related Work

Originating from system identification (Kozin and Natke, 1986), learning mechanistic models has recently inspired various research in many application fields (Tan et al., 2023; Askari and Crevecoeur, 2023; Martinelli et al., 2023). Related to our work, two major approaches can be distinguished: genetic programming, which for the first time provided strategic means of searching in the space of programs or models (Koza et al., [n. d.]; Nobile et al., 2013), and sparse regression, which identifies short yet accurate symbolic expressions, such as differential equations (Daniels and Nemenman, 2015; Brunton et al., 2016). Recently, these approaches have also been combined, e.g., to discover multibody physics systems (Askari and Crevecoeur, 2023).

Specifically in the case of reaction networks, (Nobile et al., 2013) proposed genetic programming to identify reaction systems with ODE semantics. A population of candidate structures is evolved, and evolutionary operators are applied based on the candidates’ fitness. To accurately rank a structure, its fitness is determined by the best solution found with particle swarm optimization and numerical integration. The authors of (Martinelli et al., 2023) propose a statistical search algorithm called Reactmine to infer chemical reactions with ODE semantics. In (Klimovskaia et al., 2016), the sparse identification of non-linear dynamics (SINDy) (Brunton et al., 2016) is adapted to the stochastic semantics (cf. Section 2). This is achieved by working with the moment-equations of the CTMC, an ODE system describing the time-evolution of the Markov chain’s moments. A two-step regression approach is employed to achieve robustness against heteroscedastic, noisy measurements and reaction constants of different magnitudes.

A recent publication adjusts the SINDy approach to accommodate coupled differential equations such as those resulting from the ODE semantics of reaction networks (Burrage et al., 2024); (Jiang et al., 2022) also brings SINDy to the case of biochemical systems with mass-action kinetics accounting for uncertainty and enabling an informed model selection.

In contrast to the above, here we aim at a simulation-based optimization approach, which also allows, e.g., the straightforward inclusion of unmeasured species, arbitrary kinetics, and accounting for probability distributions (instead of their moments). Further, our proposed methods do not rely on numerical differentiation of the time-series data, which can be inaccurate in the presence of noise and large or uneven sampling intervals.

Using gradient descent for parameter estimation of simulation models also saw great interest recently (Andelfinger, 2023; Chopra et al., 2023), including biochemical reaction systems (Wang et al., 2010). In (Yang et al., 2020), gradient descent enables Bayesian inference over general ODE models.

5. Learning Reaction Systems with Gradient Descent

Consider a reaction system 𝐑𝐑\mathbf{R}bold_R with coefficients 𝐂𝐂\mathbf{C}bold_C, stochastic rate constants 𝐫n𝐫superscript𝑛\mathbf{r}\in\mathbb{R}^{n}bold_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and initial populations 𝐒initsubscript𝐒init\mathbf{S}_{\emph{init}}bold_S start_POSTSUBSCRIPT init end_POSTSUBSCRIPT. Assuming the structure 𝐂𝐂\mathbf{C}bold_C of the model is known, we can simulate trajectories over states 𝐒t,t0subscript𝐒𝑡𝑡0\mathbf{S}_{t},t\geq 0bold_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ≥ 0 of the CTMC given by 𝐂𝐂\mathbf{C}bold_C and a certain parametrization 𝐫𝐫\mathbf{r}bold_r. Typically, we want the trajectories produced by 𝐑𝐑\mathbf{R}bold_R to resemble the behavior of a reference system. To achieve this, suitable parameter values 𝐫𝐫\mathbf{r}bold_r have to be estimated from collected time-series data: Given measurements Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at discrete times t{1,,n}𝑡1𝑛t\in\{1,\dots,n\}italic_t ∈ { 1 , … , italic_n }, the goal is to maximize the likelihood (D;𝐫)𝐷𝐫\mathcal{L}(D;\mathbf{r})caligraphic_L ( italic_D ; bold_r ) or some other measure of goodness of fit. Determining the parameters 𝐫𝐫\mathbf{r}bold_r that maximize the likelihood is also referred to as the inverse problem, since “forward” simulation provides a sample from \mathcal{L}caligraphic_L for a given 𝐫𝐫\mathbf{r}bold_r.

Here, our goal is to simultaneously infer the structure of the model, i.e., we try to find 𝐫𝐫\mathbf{r}bold_r and 𝐂𝐂\mathbf{C}bold_C, such that (D;𝐂,𝐫)𝐷𝐂𝐫\mathcal{L}(D;\mathbf{C},\mathbf{r})caligraphic_L ( italic_D ; bold_C , bold_r ) is maximal. Obviously, this is a much harder task than just estimating parameters, as the degrees of freedom in the inverse problem are drastically increased. Further, the optimization landscape will exhibit additional jumps, introduced by the discrete entries in 𝐂𝐂\mathbf{C}bold_C. In fact, we can formulate the problem with varying degrees of smoothness (prior to considering a smoothed objective, cf. Section 3). The following formulations are adapted to the goal of recovering the SIR model (cf. Section 2), which we later use for evaluation.

Library of Reactions. Our first problem formulation is inspired by the use of reaction libraries in (Burrage et al., 2024; Klimovskaia et al., 2016). This approach can directly be translated to a simulation-based optimization problem: the reaction system to optimize comprises (a selection of) all reactions for a given number of species. The task is to adjust 𝐫𝐫\mathbf{r}bold_r, where reactions i𝑖iitalic_i with risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT below a certain threshold are dropped from the final model. Our library consists of the 36363636 binary reactions that abide by the conservation law S+I+R=2000𝑆𝐼𝑅2000S+I+R=2000italic_S + italic_I + italic_R = 2000. This problem is completely smooth in all dimensions.

Coefficient Steps. In the second problem formulation, we fixate the number of reactions to two and try to adjust 𝐂𝐂\mathbf{C}bold_C with cij{0,1,2}subscript𝑐𝑖𝑗012c_{ij}\in\{0,1,2\}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 , 2 } and 𝐫𝐫\mathbf{r}bold_r directly, yielding a 14141414-dimensional problem. This problem is non-smooth in the coefficient dimensions.

Reaction Steps. In the third formulation, we again work with a library of reactions but introduce a (continuous) ranking vector of the same dimensionality as 𝐫𝐫\mathbf{r}bold_r. In each simulation run, only the two reactions with the highest rank are considered, enforcing a certain model size. The task is then to adjust the ranking together with the two rate parameters, one for each reaction in the top two.

Library of Systems. The final formulation, which we adopt for didactic purposes, is a brute-force approach. It simultaneously optimizes the 1260126012601260 rates for all possible combinations of two reactions from our library of 36363636. With one optimization per model being much more performant, this example showcases the gradient estimator’s ability to steer the rate adjustment for large numbers of structures.

Generally, more than one reaction system can produce trajectories from the distribution in D𝐷Ditalic_D (Craciun and Pantea, 2008). It is often hard to choose the “right” system automatically, so the choice must involve domain experts (Jiang et al., 2022). However, certain criteria can constrain the optimization process to desirable solutions, such as parsimony (choosing a low number of reactions producing a good fit) and prior knowledge (such as number of species, conservation laws, or even known reactions). Some of these constraints may result in an NP-hard problem for which the best-known solution is brute force (Gupte and Vaikuntanathan, 2022). This can be overcome, e.g., by regularization (like in SINDy) and relaxation.

As we will demonstrate on the example of the problems above, there is a tradeoff between the ability to strongly enforce these constraints and the smoothness of the objective function, which in turn determines the difficulty of the optimization task.

Refer to caption
Figure 1. The SIR model’s response surface (left) and the effect of reparametrization (right). A darker color equals a lower loss and the star marks the optimum.
Refer to caption
Figure 2. Convergence of gradient descent on the four problems (top) and chosen inferred models (bottom). Progress on the unsmoothed objective, the optimal solution has a loss of about 0.010.010.010.01 (depending on the inferred system’s stochasticity). The reaction system depicted for Library of Reactions shows only the top 3 of the 17 learned reactions above the threshold 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

5.1. Reparametrization

In both parameter estimation and structure identification, order-of-magnitude differences in the rate constants pose a problem for optimization: the appropriate step size depends on the dimension of 𝐫𝐫\mathbf{r}bold_r. This has been tackled in (Klimovskaia et al., 2016) by a separate optimization to determine the orders of magnitude. The authors of (Nobile et al., 2022) use hand-crafted and learned dilation functions. Here, we use a simple logarithmic reparametrization, decreasing the dynamic range of the parameters:

𝐫=exp(a𝐫+c)exp(c), with a=14 and c=20superscript𝐫𝑎𝐫𝑐𝑐, with 𝑎14 and 𝑐20\mathbf{r}^{\prime}=\exp(a\mathbf{r}+c)-\exp(c)\text{, with }a=\frac{1}{4}% \text{ and }c=-20bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_exp ( italic_a bold_r + italic_c ) - roman_exp ( italic_c ) , with italic_a = divide start_ARG 1 end_ARG start_ARG 4 end_ARG and italic_c = - 20

Optimizing in this space means that a step in 𝐫𝐫\mathbf{r}bold_r between, e.g., 0.10.10.10.1 and 0.20.20.20.2 is the same as between 1111 and 2222. The specific shifting and scaling ensure (1) that the value ri=0subscript𝑟𝑖0r_{i}=0italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 is mapped to 00 and (2) that the values between 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are sufficiently spread. This way, the sensitivity of the response wrt. changes in 𝐫𝐫\mathbf{r}bold_r is decreased, aiding the optimization (cf. Figure 1). Specifically, in the case of our stochastic gradient estimator, this allows setting a single smoothing factor σ𝜎\sigmaitalic_σ for all dimensions, which would otherwise lead to oversmoothing and occlude narrow minima.

5.2. Evaluation Setup

To identify the challenges and opportunities of gradient descent in the context of a stochastic simulation-based model inference, we evaluate the convergence of our four problem formulations on recovering the SIR model as parametrized in Section 2. Our time-series reference data is generated by simulating the model until t=1𝑡1t=1italic_t = 1 and collecting state snapshots at 100100100100 discrete, equidistant simulation times (although we generally require neither equidistance nor completeness). For optimization, we employ the stochastic gradient estimator introduced in Section 3 and combine it with the Adam gradient descent optimizer (Kingma and Ba, 2014). For each problem, we manually determined hyperparameters (sample size n𝑛nitalic_n, smoothing factor σ𝜎\sigmaitalic_σ, and learning rate η𝜂\etaitalic_η) that achieved good results. In the order of the problems from Section 5, these are (100,0.2,1)1000.21(100,0.2,1)( 100 , 0.2 , 1 ), (1000,1,1)100011(1000,1,1)( 1000 , 1 , 1 ), (100,0.2,0.1)1000.20.1(100,0.2,0.1)( 100 , 0.2 , 0.1 ), and (100,0.2,0.5)1000.20.5(100,0.2,0.5)( 100 , 0.2 , 0.5 ). Initial parameters are drawn from problem-specific uniform distributions. Our simple demonstration aims to minimize the root mean squared error (RMSE) between the reference and the simulation mean time-series, the latter being determined from 20202020 replications. Note that it is easily possible to change this objective, e.g., to minimizing Wasserstein distances on distribution estimates (Öcal et al., 2020). We repeat the optimization process 10101010 times to account for the stochasticity.

6. Results and Discussion

The evaluation results in Figure 2 show the mean convergence behavior over gradient descent steps on each problem, as well as the final model inferred by a chosen optimization run. For the Brute Force problem, the lowest RMSE of all structures is shown.

The Library of Reactions formulation yields a very precise fit to the input data but lacks parsimony. Convergence is attained fast, as the objective is smooth. Here, a parsimony-encouraging initialization, such as the horseshoe prior for Bayesian regression may be beneficial (Jiang et al., 2022), albeit introducing bias towards certain solutions.

On Coefficient Steps, on the other hand, the smoothed gradient descent struggles to converge to a good solution. Our further experiments showed that convergence to very good solutions is possible, but strongly depends on the initialization. This hints at the existence of hard-to-escape local minima.

In Reaction Steps, the smoothed gradient should be able to capture the effects of possible alternate rankings, and we observe good initial progress toward a parsimonious solution. Still, the decoupling of rates and structure seems to be challenging to overcome. When the ranking vector tends to a local minimum, means of esca** it by (partially) shuffling the current ranking could help to identify better solutions in other parts of the search space. However, in preliminary experiments of this sort, we observed inferior results.

Being completely smooth, the brute force Library of Systems approach is similar in convergence to the Library of Reactions. In contrast to the latter, it is able to recover the parsimonious original model. This indicates the ability of gradient descent to optimize a vast number of reaction systems at a time. Since the combinatorial explosion puts larger systems out of reach, the main missing piece for this approach is a goal-driven exploration of structures.

Our initial results demonstrate a tradeoff between parsimony, goodness of fit, and scalability. This is the result of different response surfaces and their amenability to gradient descent. In all cases, the scaling of rate constants poses a problem, which can be dealt with by reparametrization (cf. Section 5.1). Whereas the rate constant space clearly places solutions of similar quality close to each other (cf. Figure 1), it is generally unclear which steps in the structure dimension (on the coefficients in 𝐂𝐂\mathbf{C}bold_C) lead to lower loss. The simultaneous adjustment of both 𝐂𝐂\mathbf{C}bold_C and 𝐫𝐫\mathbf{r}bold_r further complicates solutions that try to (smoothly) enforce a certain model size. A major step towards better convergence would thus be a combined reparametrization of 𝐂𝐂\mathbf{C}bold_C and 𝐫𝐫\mathbf{r}bold_r that enables a goal-driven exploration of structures. Clearly, such a reparametrization must be approximate, and its existence is unclear, demanding further investigation. Promisingly, in the related case of learning (imperative) programs, first steps have been taken in this direction (Kusner et al., 2017). Besides parsimony, identifyability could be facilitated by constraining solutions on background knowledge, as for example derived from a conceptual model in a simulation study.

Beyond considering the challenges outlined above, future work may explore the application of other smooth gradient estimation schemes based on automatic differentiation, such as StochasticAD (Arya et al., 2022) or DiscoGrad (Kreikemeyer and Andelfinger, 2023). Finally, the full potential of the simulation-based approach needs to be explored, e.g., by considering unmeasured variables and alternative loss functions.

References

  • (1)
  • Andelfinger (2023) Philipp Andelfinger. 2023. Towards Differentiable Agent-Based Simulation. ACM Trans. Model. Comput. Simul. 32, 4, Article 27 (jan 2023), 26 pages. https://doi.org/10.1145/3565810
  • Arya et al. (2022) Gaurav Arya, Moritz Schauer, Frank Schäfer, and Christopher Rackauckas. 2022. Automatic Differentiation of Programs with Discrete Randomness. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 10435–10447. https://proceedings.neurips.cc/paper_files/paper/2022/file/43d8e5fc816c692f342493331d5e98fc-Paper-Conference.pdf
  • Askari and Crevecoeur (2023) Ehsan Askari and Guillaume Crevecoeur. 2023. Evolutionary sparse data-driven discovery of multibody system dynamics. Multibody System Dynamics 58 (6 2023), 197–226. Issue 2. https://doi.org/10.1007/s11044-023-09901-z
  • Brunton et al. (2016) Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. 2016. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (4 2016), 3932–3937. Issue 15. https://doi.org/10.1073/pnas.1517384113
  • Burrage et al. (2024) Pamela M. Burrage, Hasitha N. Weerasinghe, and Kevin Burrage. 2024. Using a library of chemical reactions to fit systems of ordinary differential equations to agent-based models: a machine learning approach. Numerical Algorithms (1 2024). https://doi.org/10.1007/s11075-023-01737-0
  • Chopra et al. (2023) Ayush Chopra, Alexander Rodríguez, Jayakumar Subramanian, Arnau Quera-Bofarull, Balaji Krishnamurthy, B. Aditya Prakash, and Ramesh Raskar. 2023. Differentiable Agent-based Epidemiology. arXiv:2207.09714 [cs.LG]
  • Craciun and Pantea (2008) Gheorghe Craciun and Casian Pantea. 2008. Identifiability of chemical reaction networks. Journal of Mathematical Chemistry 44, 1 (2008), 244–259. https://doi.org/10.1007/s10910-007-9307-x
  • Daniels and Nemenman (2015) Bryan C. Daniels and Ilya Nemenman. 2015. Automated adaptive inference of phenomenological dynamical models. Nature Communications 6 (8 2015). Issue 1. https://doi.org/10.1038/ncomms9133
  • Gillespie (1976) Daniel T Gillespie. 1976. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comput. Phys. 22 (12 1976), 403–434. Issue 4. https://doi.org/10.1016/0021-9991(76)90041-3
  • Gupte and Vaikuntanathan (2022) Aparna Gupte and Vinod Vaikuntanathan. 2022. The Fine-Grained Hardness of Sparse Linear Regression. arXiv:2106.03131 [cs.LG]
  • Hahl and Kremling (2016) Sayuri K Hahl and Andreas Kremling. 2016. A comparison of deterministic and stochastic modeling approaches for biochemical reaction systems: on fixed points, means, and modes. Frontiers in genetics 7 (2016), 157. https://doi.org/10.3389/fgene.2016.00157
  • Jiang et al. (2022) Richard Jiang, Prashant Singh, Fredrik Wrede, Andreas Hellander, and Linda Petzold. 2022. Identification of dynamic mass-action biochemical reaction networks using sparse Bayesian methods. PLOS Computational Biology 18, 1 (01 2022), 1–21. https://doi.org/10.1371/journal.pcbi.1009830
  • Keating et al. (2020) Sarah M Keating, Dagmar Waltemath, Matthias König, Fengkai Zhang, Andreas Dräger, Claudine Chaouiya, Frank T Bergmann, Andrew Finney, Colin S Gillespie, Tomáš Helikar, et al. 2020. SBML Level 3: an extensible format for the exchange and reuse of biological models. Molecular systems biology 16, 8 (2020), e9110. https://doi.org/10.15252/msb.20199110
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. (Dec 2014). arXiv:1412.6980v9 [cs.LG]
  • Klimovskaia et al. (2016) Anna Klimovskaia, Stefan Ganscha, and Manfred Claassen. 2016. Sparse Regression Based Structure Learning of Stochastic Reaction Networks from Single Cell Snapshot Time Series. PLOS Computational Biology 12 (12 2016), e1005234. Issue 12. https://doi.org/10.1371/journal.pcbi.1005234
  • Koza et al. ([n. d.]) John R. Koza, William Mydlowec, Guido Lanza, Jessen Yu, and Martin A. Keane. [n. d.]. Reverse Engineering of Metabolic Pathways From Observed Data Using Genetic Programming. 434–445. https://doi.org/10.1142/9789814447362_0043
  • Kozin and Natke (1986) F Kozin and HG Natke. 1986. System identification techniques. Structural safety 3, 3-4 (1986), 269–316. https://doi.org/10.1016/0167-4730(86)90006-8
  • Kreikemeyer and Andelfinger (2023) Justin N. Kreikemeyer and Philipp Andelfinger. 2023. Smoothing Methods for Automatic Differentiation Across Conditional Branches. IEEE Access 11 (2023), 143190–143211. https://doi.org/10.1109/access.2023.3342136
  • Kurtz (1972) Thomas G Kurtz. 1972. The relationship between stochastic and deterministic models for chemical reactions. The Journal of Chemical Physics 57, 7 (1972), 2976–2978. https://doi.org/10.1063/1.1678692
  • Kusner et al. (2017) Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. 2017. Grammar Variational Autoencoder. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1945–1954. https://proceedings.mlr.press/v70/kusner17a.html
  • Maass et al. (2018) Wolfgang Maass, Jeffrey Parsons, Sandeep Purao, Veda C Storey, and Carson Woo. 2018. Data-driven meets theory-driven research in the era of big data: Opportunities and challenges for information systems research. Journal of the Association for Information Systems 19, 12 (2018), 1. https://doi.org/10.17705/1jais.00526
  • Margossian (2019) Charles C. Margossian. 2019. A review of automatic differentiation and its efficient implementation. WIREs Data Mining and Knowledge Discovery 9 (7 2019). Issue 4. https://doi.org/10.1002/widm.1305
  • Martinelli et al. (2023) Julien Martinelli, Jeremy Grignard, Sylvain Soliman, Annabelle Ballesta, and François Fages. 2023. Reactmine: a statistical search algorithm for inferring chemical reactions from time series data. arXiv:2209.03185 [q-bio.QM]
  • McAdams and Arkin (1999) Harley H McAdams and Adam Arkin. 1999. It’s a noisy business! Genetic regulation at the nanomolar scale. Trends in Genetics 15 (2 1999), 65–69. Issue 2. https://doi.org/10.1016/s0168-9525(98)01659-x
  • Nesterov and Spokoiny (2017) Yurii Nesterov and Vladimir Spokoiny. 2017. Random Gradient-Free Minimization of Convex Functions. Foundations of Computational Mathematics 17 (4 2017), 527–566. Issue 2. https://doi.org/10.1007/s10208-015-9296-2
  • Nobile et al. (2013) Marco S. Nobile, Daniela Besozzi, Paolo Cazzaniga, Dario Pescini, and Giancarlo Mauri. 2013. Reverse engineering of kinetic reaction networks by means of Cartesian Genetic Programming and Particle Swarm Optimization, In 2013 IEEE Congress on Evolutionary Computation (CEC). 2013 IEEE Congress on Evolutionary Computation. https://doi.org/10.1109/cec.2013.6557752
  • Nobile et al. (2022) Marco S. Nobile, Daniele M. Papetti, Simone Spolaor, Paolo Cazzaniga, and Luca Manzoni. 2022. Sha** and Dilating the Fitness Landscape for Parameter Estimation in Stochastic Biochemical Models. Applied Sciences 12 (7 2022), 6671. Issue 13. https://doi.org/10.3390/app12136671
  • Noé et al. (2020) Frank Noé, Gianni De Fabritiis, and Cecilia Clementi. 2020. Machine learning for protein folding and dynamics. Current opinion in structural biology 60 (2020), 77–84. https://doi.org/10.1016/j.sbi.2019.12.005
  • Öcal et al. (2020) Kaan Öcal, Ramon Grima, and Guido Sanguinetti. 2020. Parameter estimation for biochemical reaction networks using Wasserstein distances. Journal of Physics A: Mathematical and Theoretical 53 (1 2020), 034002. Issue 3. https://doi.org/10.1088/1751-8121/ab5877
  • Polyak (1987) B.T. Polyak. 1987. Introduction to Optimization. Optimization Software, New York.
  • Ramaswamy et al. (2012) Rajesh Ramaswamy, Nélido González-Segredo, Ivo F. Sbalzarini, and Ramon Grima. 2012. Discreteness-induced concentration inversion in mesoscopic chemical systems. Nature Communications 3 (4 2012). Issue 1. https://doi.org/10.1038/ncomms1775
  • Tan et al. (2023) Wen Jun Tan, Moon Gi Seok, and Wentong Cai. 2023. Automatic Model Generation and Data Assimilation Framework for Cyber-Physical Production Systems. In Proceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (Orlando, FL, USA) (SIGSIM-PADS ’23). Association for Computing Machinery, New York, NY, USA, 73–83. https://doi.org/10.1145/3573900.3591112
  • Wang et al. (2010) Yuanfeng Wang, Scott Christley, Eric Mjolsness, and Xiaohui Xie. 2010. Parameter inference for discretely observed stochastic kinetic models using stochastic gradient descent. BMC systems biology 4, 1 (2010), 1–16. https://doi.org/10.1186/1752-0509-4-99
  • Yang et al. (2020) Yibo Yang, Mohamed Aziz Bhouri, and Paris Perdikaris. 2020. Bayesian differential programming for robust systems identification under uncertainty. Proceedings of the Royal Society A 476, 2243 (2020), 20200290. https://doi.org/10.1098/rspa.2020.0290