1 Introduction

marginparsep has been altered.
topmargin has been altered.
marginparpush has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

Deep Optimal Experimental Design for Parameter Estimation Problems

Md Shahriar Rahim Siddiqui ¹ Arman Rahmim ¹ Eldad Haber ²

^†^†footnotetext: ¹Department of Physics and Astronomy, University of British Columbia, Vancouver, Canada ²Department of Earth, Ocean, and Atmospheric Sciences, University of British Columbia, Vancouver, Canada. Correspondence to: Md Shahriar Rahim Siddiqui <[email protected]>, Eldad Haber <[email protected]>.
Copyright 2024 by the author(s).

Abstract

Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.

1 Introduction

Mathematical models that are based on ordinary differential equations (ODEs) are used in various fields such as biology Helms (2008), ecology Bellman et al. (1966), physiology Zenker et al. (2007), pharmacometrics Ette & Williams (2007), climate modeling Keane et al. (2017) and more Lenhart & Workman (2007); Betts (2001). These models often involve unknown parameters that need to be estimated from experimental data. The parameters represent rates and constants that are usually important in understanding the dynamics of the underlying system, and could be crucial to decision making and to system’s control. Examples range from determining the insulin resistance of an individual from the clinically used intravenous glucose tolerance test (IVGTT) Chung & Haber (2013), finding the rates of transfer of radio-pharmaceuticals and determining when a system will exhibit chaotic behavior Hunt et al. (2007).

Parameter identification is typically performed in two steps: data collection and data fitting. In the data collection stage experiments are performed and data are measured. In the data fitting stage, a data fitting procedure is used to evaluate parameters from the (typically noisy) measurements. For this process to be effective, the experiment must be carefully designed.

In many cases one can densely measure the data yielding a grossly over-determined problem Björck (1996), however, in other cases, the measurement process is difficult and expensive, and this leads to restrictions on the type, amount and quality (that is, signal to noise) of the data that is collected. A decision must be made about which data should be measured, at what frequency and to what accuracy. A consequent trade-off exists between the accuracy of parameter recovery and the amount and quality of the data. The goal of this paper is to propose an experimental design technique that balances parameter estimation and the cost of the experiment.

Previous Work: Optimal experimental design is an important topic in applied science and engineering with applications in medical imaging, geoscience, optimal control, public health and many other fields (see Bardow (2008); Pukelsheim (1993); Allaire et al. (2001); Chaloner & Verdinelli (1995); Haber et al. (2008); Atkinson & Donev (1992) and references therein).

In this paper we focus our attention to the design of experiments for nonlinear systems that are governed by ODEs, with the aim to identify parameters within the equations. While there is significant work on the topic (see Haber et al. (2009); Chung & Haber (2013); McLellan (1994); Banga & Balsa-Canto (2008); Bock et al. (2013) and references therein) the methods proposed for the solution of the problem are difficult to apply and require classical recovery techniques.

The design process is based on three pillars. In the core stands the ability to quickly solve the so called forward problem for different experimental settings and parameters. In our case, the forward problem is a discretized system of ordinary differential equations that needs to be solved many (in some cases hundreds of thousands) of times. The second pillar of the design is the solution of the so-called inverse problem. That is, the identification of parameters given some experimental settings. For linear problems with quadratic regularization (that is, Gaussian priors), it is possible to obtain an analytic expression for this problem (see Haber et al. (2008)). However, for nonlinear inverse problems such expressions are impossible to obtain and therefore, numerical optimization techniques are typically used for the solution of the problem and estimating the parameters. The solution of the inverse problem is sometimes referred to as the inner optimization problem. Finally, in the third and outer pillar of the design process stands an optimization problem that aims to obtain the best parameter recovery (on average) from the inner optimization problem by changing the experimental setting. The process is illustrated in Figure 1.

Refer to caption — Figure 1: A schematic process of parameter estimation, where the data of the forward process is measured and parameters are estimated. The design links between the quality of the estimated parameters to the type and quality of the data.

Optimal experimental design for nonlinear systems is therefore a bilevel optimization problem. The process of solving the outer problem (the design problem) is iterative. For a given design, one chooses some parameters to be identified from some distribution and solves many inner optimization problems to estimate those parameters. Comparing the parameters to their true values and taking derivatives it is possible to obtain a direction that improves the recovery of those parameters. Design methods for parameter estimation differ in their outer objective function and the statistical framework, Bayesian vs Frequentist Chaloner & Verdinelli (1995); Haber et al. (2008). Nonetheless, all methods known to us rely on repeated solution of the inverse problem, for different parameters, which makes the problem very difficult to solve and computationally challenging. To this end, a few authors have proposed to replace the forward solution with reduced models Wang et al. (2023); Ushijima & Yeh (2015); however those can behave poorly, especially when changing the design.

In recent years deep learning has been used in the context of generative models that are able to sample the main modes of a distribution (see Yang et al. (2023) and references therein). The ability to learn a distribution or its main mode can be very useful when considering optimal design. In particular, we consider likelihood free estimators (see e.g.Sainsbury-Dale et al. (2022); Didelot et al. (2011); Papamakarios (2019)) that allow for the estimation of the solution of the inverse problem without the solution of the forward problem. With improvements in neural network architectures, such estimators have become very attractive for problems where the solution of the forward problem is computationally challenging.

The novelty of this work is the development of a methodology that uses likelihood free estimators for optimal experimental design for a system governed by differential equations. Our method circumvents the bilevel optimization problem that is traditionally used. We propose a mathematical framework that allows for the construction of a deep network that yields an optimal recovery of the parameters and the experimental setting. Given a parameter to be recovered and a forward problem, our framework requires the ability to sample from the prior distribution of the parameter and to solve the forward problem for those samples. Our framework does not require the solution of the inverse problem via optimization and rather learns a direct map from the data to the parameters under different experimental settings. By using an appropriate loss function, we are able to find a network that is optimal for the recovery of the parameter and the optimal experimental settings.

The rest of this paper is organized as follows. In Section 2 we give a mathematical background to the field of design in the context of differential equations. In Section 3 we discuss likelihood free estimators and show how they can be used for the design problem. In Section 4 we propose two training methods to train such estimators. In Section 5 we conduct a number of experiments with systems from diverse fields that show that our method is robust. Finally, in Section 6 we summarize the paper.

2 Mathematical Background for Optimal Design

Consider a parameter estimation problem, and let ${\bf q}\in{\cal Q}$ be a $p$ dimensional parameter vector in a differential equation. Let ${\bm{\omega}}\in\Omega$ be a set of $e$ parameters that represent some experimental settings. Finally, let ${\bf d}\in{\cal D}$ be the data measured. Formally, we write

\displaystyle{\bf d}=F({\bf q},{\bm{\omega}})+{\bm{\epsilon}}

(1)

Here $F:{\cal Q}\times\Omega\rightarrow{\cal D}$ is the forward problem that takes the parameter ${\bf q}$ and experimental setting ${\bm{\omega}}$ and yields some measured data, ${\bf d}$ . In our context, the application of $F$ requires the solution of a system of ODEs, typically an initial value problem. The data are assumed to be polluted with random noise ${\bm{\epsilon}}\sim N(0,\sigma^{2}{\bf I})$ which is assumed to be normal with $0$ mean and $\sigma^{2}$ standard deviation. We assume that ${\bf q}$ is associated with a known probability density function ${\bf q}\sim\pi({\bf q})$ . In practice one does not require to have a mathematical expression for the density, however, we assume to have sufficient amount of samples, ${\bf q}_{i},i=1,\ldots,N_{q}$ , that can be generated from this density. The experimental setting vector ${\bm{\omega}}$ is assumed to be under our control. Such a vector may represent measurement times, frequencies for the forward problem, source terms or other controllable parameters in the experiment.

A common approach to estimate the parameter ${\bf q}$ given the data ${\bf d}$ is to use Bayesian inference. Using Bayes’ theorem, the probability density function of ${\bf q}$ given ${\bf d}$ is

	$\displaystyle\pi_{{\bm{\omega}}}({\bf q}\|{\bf d})$	$\displaystyle\propto$	$\displaystyle\pi({\bf q})\pi_{{\bm{\omega}}}({\bf d}\|{\bf q})$
		$\displaystyle=$	$\displaystyle\pi({\bf q})\exp\left({\frac{1}{2\sigma^{2}}}\\|{\bf d}-F({\bf q};% {\bm{\omega}})\\|^{2}\right).$

where $\pi({\bf d}|{\bf q};{\bm{\omega}})$ is the likelihood, that is, the probability of the data, ${\bf d}$ given the parameter ${\bf q}$ and the experimental setting ${\bm{\omega}}$ . The Maximum A Posteriori (MAP) estimate is the parameter that maximizes this distribution; that is

\displaystyle\widehat{\bf q}={\rm arg}\min_{{\bf q}}{\frac{1}{2\sigma^{2}}}\|F% ({\bf q},{\bm{\omega}})-{\bf d}\|^{2}+R({\bf q})

(3)

Here $R({\bf q})=-\log(\pi({\bf q}))$ is the negative log of the prior.

Associated with any estimator $\widehat{\bf q}$ is the recovery loss or the risk defined by

\displaystyle\ell_{q}({\bm{\omega}})={\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{% \frac{1}{2}}\|\widehat{\bf q}-{\bf q}\|^{2}

(4)

Note that the risk is obtained by taking the expectation over the parameter ${\bf q}$ and the noise ${\bm{\epsilon}}$ , and therefore, the risk depends on the experimental setting alone. Associated with the estimator $\widehat{\bf q}$ is the estimated data

\displaystyle\widehat{\bf d}=F(\widehat{\bf q},{\bm{\omega}})

(5)

We also consider the data risk

\displaystyle\ell_{d}({\bm{\omega}})={\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{% \frac{1}{2}}\int_{0}^{T}\|F(\widehat{\bf q},{\bm{\omega}})-{\bf d}(t)\|^{2}dt.

(6)

When approximating the integral, we use the trapezoidal method, which allows the application of non-uniform time intervals.

Note that two possible estimated solutions $\widehat{\bf q}$ with similar risk $\ell_{q}({\bm{\omega}})$ may have different data risk $\ell_{d}({\bm{\omega}})$ . We therefore consider the total loss as a weighted sum of the risk equation 4 and the data risk equation 6, that is

\displaystyle\ell_{T}({\bm{\omega}})=\ell_{q}({\bm{\omega}})+\gamma\ell_{d}({% \bm{\omega}}).

(7)

where $\gamma$ is a hyper-parameter (we chose $\gamma=1$ in our experiments). Using the definitions equation 7 allows for a framework for the optimal design. In particular let

\displaystyle{\bm{\omega}}^{*}={\rm arg}{\min}_{{\bm{\omega}}}\,\ell_{T}({\bm{% \omega}})

(8)

The solution ${\bm{\omega}}^{*}$ is interpreted as the experimental setting that yields the best recovery of the parameter, ${\bf q}$ , on average that also fit the measure data.

One main difficulty in solving equation 8 is that the problem may not be differentiable with respect to the experimental setting ${\bm{\omega}}$ . A common way to overcome this difficulty that has been reviewed in Boyd & Vandenberghe (2004) is to discretize the space of experimental setting, $\Omega$ , finely obtaining many plausible experiments $[F({\bf q},{\bm{\omega}}_{1}),\ldots,F({\bf q},{\bm{\omega}}_{s})]$ , that can be done. Assume that we have $s$ possible different experiments and assume for simplicity that each datum ${\bf d}_{i}$ corresponds to the data that is recorded from a different experiment ${\bm{\omega}}_{i}$ . Introducing weights $0\leq{\bf w}$ , one replaces the original posterior equation 2 with one that contains all the data for all possible experiments, weighted by ${\bf w}$ , that is

\displaystyle\pi_{{\bf w}}({\bf q}|{\bf d})\propto\pi({\bf q})\exp\left({\frac% {1}{2\sigma^{2}}}\|{\bf w}\odot({\bf d}-F({\bf q}))\|^{2}\right).

(9)

where $F({\bf q})$ assumes that all possible experiments are conducted. The problem of estimating ${\bm{\omega}}$ is replaced with the estimation of the weights ${\bf w}$ . Clearly, if ${\bf w}$ is sparse then only a few experiments are to be conducted. Therefore, it has been proposed Haber et al. (2008) to replace the original problem with a penalized problem

\displaystyle{\bf w}^{*}={\rm arg}{\min}_{{\bf w}}\,\ell_{T}({\bf w})+\alpha\,% {\rm Sp}({\bf w})

(10)

where ${\rm Sp}({\bf w})$ promotes sparsity in ${\bf w}$ , and $\alpha$ is a hyper-parameter. A common approach is using the 1-norm although using approximations to zero-norm have been proposed Donoho (2006); Bruckstein et al. (2009).

Even after the reformulation of the problem, solving the optimization equation 10 is very difficult. It requires solving the optimization problem equation 3 for many different ${\bf q}$ ’s in order to estimate the risk $\ell_{T}({\bf w})$ . This in turn requires solving many forward problems which can be very expensive especially in the context of ODEs. Furthermore, computing derivatives with respect to ${\bf w}$ is challenging and requires implicit differentiation.

Therefore, in the next section we describe an alternative to this process that leads to a much more efficient algorithm.

3 Likelihood Free Estimators and Optimal Design

The optimization problem equation 10 is difficult because we require to estimate $\ell_{T}({\bf w})$ which in turn requires the solution of the optimization problem equation 3. The optimization stems from our use of the MAP estimator. The MAP estimator can be thought of as a nonlinear function of the form

\displaystyle\widehat{\bf q}=F^{{\dagger}}_{\rm map}({\bf w}\odot{\bf d})

(11)

where the nonlinear function $F^{{\dagger}}_{\rm map}$ maps the data to an estimator $\widehat{\bf q}$ .

The MAP estimator can be highly useful and has very desirable properties. However, it may be far from optimal. If the posterior is highly skewed, then the MAP may not lead to the minimization of the risk equation 4. In practice, the main reason that the MAP estimator is commonly used is our ability to compute it. In many cases, estimators such as the conditional mean may be attractive; however they require non-trivial computations and Monte-Carlo integration. Thus, the MAP estimator is certainly not the only estimator that can be used. In the context of learning it is possible to directly learn an estimator that minimizes the risk equation 4. In many cases, such an estimator can perform even better than the MAP estimator (see for example Papamakarios (2019)). Furthermore, by appropriately setting its architecture, it is possible to include different experimental setting and noise levels in this estimator and then use it for the solution of the optimal design.

We now derive a likelihood free estimator for the solution of the problem, that enable us to solve both the estimation problem as well as the design problem. To this end, let us define the estimator

\displaystyle\widehat{\bf q}=F^{{\dagger}}_{{\bm{\theta}}}({\bf w}\odot{\bf d}% ,{\bf w},\sigma).

(12)

The estimator depends on the experimental design vector ${\bf w}$ that controls the design and trainable parameters ${\bm{\theta}}$ . It also uses the parameter $\sigma$ that represents the noise level. Note that this estimator does not require solving the forward problem directly and computing the likelihood, hence its name, Likelihood Free Estimator (LFE).

The risk for this estimator can be written as

\displaystyle\ell_{T}({\bm{\theta}},{\bf w})={\mathbb{E}}_{{\bf q},{\bm{% \epsilon}}}\|F^{{\dagger}}_{{\bm{\theta}}}({\bf w}\odot{\bf d},{\bf w},\sigma)% -{\bf q}\|^{2}+\gamma\ell_{d}({\bm{\theta}},{\bf w})

(13)

Clearly, the best estimator is obtained by minimizing the risk with respect to weights ${\bm{\theta}}$ and the experimental setting ${\bf w}$ . It is important to note that since the estimator is a simple function evaluation, we can solve a single optimization problem to estimate both the estimator parameter ${\bm{\theta}}$ and the experimental design parameters ${\bf w}$ . Thus, this formulation avoids the need for bilevel optimization all-together.

We now discuss a proposed architecture that we use for the solution of the problem. The architecture uses the data, ${\bf d}$ , the experimental setting ${\bf w}$ and the noise level $\sigma$ in order to compute the estimator $\widehat{\bf q}$ . A minimal structure of The network is summarized in Algorithm 1.

Algorithm 1 Neural Network Architecture for Design

{\bf d},\sigma

Set

{\bf y}={\bf Q}_{0}({\bf w}\odot{\bf d})

Embed

{\bf s}={\bf E}(\sigma),{\bf z}={\bf W}_{0}{\bf w}

Generate the augmented tensor

{\bf x}_{0}=[{\bf y},{\bf z},{\bf s}]

for

i=0,...,{\rm(nlayers-1)}

{\bf x}_{i+1}={\bf x}_{i}+{\bf K}_{i}\eta({\bf Q}_{i}{\bf x}_{i}+{\bf b}_{i})

end for

\widehat{\bf q}={\bf Q}_{f}{\bf x}

return

\widehat{\bf q}

The network is a simple residual network with weights ${\bm{\theta}}=\{{\bf W}_{0},{\bf Q}_{0},{\bf Q}_{f},{\bf Q}_{i},{\bf K}_{i},{% \bf b}_{i}\},i=0,...,{\rm(nlayers-1)}$ with a few important differences from standard residual networks. The first difference is that the input data ${\bf d}$ is weighted with the parameters ${\bf w}$ , allowing us to control the experimental design and train for the optimal parameters. Second, the parameters ${\bf w}$ are embedded into the network. This allows for the network to be aware of the experimental setting. Finally, similar to diffusion models Li et al. (2018), we embed the noise into the network, allowing it to be trained for different noise levels. In the network $\eta$ stands for the activation function bundled with possible layer normalization.

The network above can be modified to have different embedding for $\sigma$ and ${\bf w}$ at each layer. However, we have found that even this simple architecture yields very good results. The network needs to be trained for its weights as well as for the best experimental settings, ${\bf w}$ . This is discussed in the next section.

4 Training the Estimator

In this section we discuss training the network by using a self-supervised approach Jaiswal et al. (2020); Zhai et al. (2019). Such techniques have been the cornerstone of recent advances in the application of deep learning for realistic application. However, before we proceed, we need to discuss the particular parametrization used for this problem. While choosing the network weights is standard, there are two options for the choice of the design vector ${\bf w}$ . In the first option, we allow ${\bf w}$ to have any real non-negative value. This is the simplest case as it allows us to simply train the experimental setting together with the parameters of the network, estimating the network and the optimal design simultaneously. In the second option we treat ${\bf w}$ as a binary variable. Binary variables are considered where we either conduct the experiment or not implying that the data is measured or not. In this case, training the network implies solving a mixed nonlinear-binary programming problem. We solve this problem by a combination of Tabu-search for the binary variables and stochastic gradient descent for the continuous variables.

Regardless to the optimization process, we use a self supervised approach. That is, the algorithm generates its own data and trains on it. A summary of the algorithm is given in Algorithm 2.

Algorithm 2 Training an LFE estimator

0: Network, Sampling routine for

{\bf q}

for

i=0,...

Sample a batch

{\bf q}

Use the forward problem to compute

{\bf d}=F({\bf q})+{\bm{\epsilon}}

Use the network to recover

\widehat{\bf q}

Compute

\widehat{\bf d}=F(\widehat{\bf q})

Compute the loss in equation 10

Update parameters to minimize the loss

end for

The algorithm is self trained in a sense that it continuously samples ${\bf q}$ and computes its corresponding data, and then uses the data in order to estimate ${\bf q}$ . The algorithm requires a way to sample from the prior distribution of ${\bf q}$ . For many parameter estimation problems such a distribution is readily available. In the examples below we sample the parameters from a log-normal distribution. The ability to generate an infinite amount of data, allows for a very robust training procedure in which over-fitting does not play a role.

The algorithm 2 minimizes the loss with respect to the design variables ${\bf w}$ and the network parameters ${\bm{\theta}}$ . However, updating the design variables changes, depending on their type. We now discuss the two approaches used for the update of the design variables.

4.1 Continuous Design Variables (Method 1)

Consider first the case where the design vector ${\bf w}$ can take on any values $\mathbb{R}_{>0}$ . In this case we seek to minimize the recovery loss $\ell_{T}({\bf w})$ and enhance the sparsity of ${\bf w}$ .

A simple algorithm to achieve that is to use the soft shrink function Bruckstein et al. (2009). Here we modify the classical soft shrink function to keep non-negativity. The modified soft shrink function is defined as

\displaystyle s_{\rho}(t,\rho)=\left\{\begin{matrix}t-\rho&t>\rho\\ 0&{\rm otherwise}\end{matrix}\right.

(14)

This function is used during minimization in a straight forward manner (see Bruckstein et al. (2009) for convergence proofs). Using the stochastic gradient descent algorithm we update ${\bm{\theta}}$ in a standard way but for ${\bf w}$ we use the following formula

\displaystyle{\bf w}\leftarrow s_{\rho}({\bf w}-\mu\delta{\bf w},\rho)

(15)

Here the soft shrink function is applied element-wise, $\mu$ is the learning rate and $\delta{\bf w}$ is the descent direction, obtained by the stochastic gradient descent algorithm.

4.2 Binary ${\bf w}$ and Tabu Search (Method 2)

Using continuous weights ${\bf w}$ is attractive as it enables the use of continuous optimization techniques. Nonetheless, the result of such an algorithm can lead to many weights that are small, raising the question about the importance of the data that is associated with those weights. In many applications, a binary ${\bf w}$ makes more sense. In this case, ${\bf w}_{i}=1$ implies that the data is measured and when ${\bf w}_{i}=0$ the data is ignored and not measured. Thus, minimizing the loss $\ell_{T}$ is involved with mixed nonlinear binary programming, which is a challenging problem. To approximate the solution of this problem we combine a Tabu search method (for the binary variables) and a standard stochastic gradient descent for the continuous variables. Tabu search methods are commonly used in binary programming Hertz et al. (1995). The algorithm uses the current iterate to define neighbors. These neighbors are tested, and the algorithm steps into the lowest one. The algorithm retains a list (the Tabu list) of points that are visited in order to not repeat points. Although the algorithm is very simple, it finds reasonable approximations to the minimum of binary programming problems (see Gendreau & Potvin (2005) for details). To use Tabu within the context of the optimization problem we incorporate it in a block coordinate descent algorithm. The algorithm is summarized in Algorithm 3.

Algorithm 3 Training LFE with binary variables

0: Network

Initialize

{\bf w}

and

{\bm{\theta}}

for

i=0,...,{\rm outer\_iter}

for

j=0,...,{\rm inner\_iter}

Compute the loss in equation 10

Compute gradients with respect to

{\bm{\theta}}

Update

{\bm{\theta}}

end for

Set the Tabu list to empty.

for

j=0,...,{\rm inner\_iter}

Compute the neighbors for

{\bf w}

Compute the loss in equation 10 for all neighbors

Update

{\bf w}

to the lowest loss

Update the Tabu list

end for

Solving a mixed binary nonlinear programming problem is substantially more expensive compared with solving the problem with continuous variables. However, in our application we have found that a very small number of steps on the outer iteration usually suffice to obtain reasonable accuracy.

5 Numerical Experiments

In this section we study two different problems. The first problem is of practical importance to the field of nuclear medicine, utilizing a ”3-Tissue Compartment” (3-TC) model, an ODE system amongst a range of compartmental models in the field of kinetic modeling Morris et al. (2004) Bentourkia & Zaidi (2007), to estimate physiological parameters of importance using nuclear medicine imaging. The second is an example of a non-linear system of differential equations that model the populations of predators and prey in a system known as the Lotka-Volterra predator-prey model (PPM). We now shortly describe these problems, a detailed implementation can be found in Appendix A.

5.1 3-Tissue Compartment Model

A multi-compartment model is a simplified (or reduced) mathematical model used to describe the movement of mass or energy between different compartments within a system. Such models are often used to generate a reduced yet useful model of a complex system that is difficult to represent accurately. Each compartment in the model is considered to be homogeneous, thus allowing to model only the interaction between compartments. Compartment models have various applications in several fields, including pharmacokinetics, epidemiology, systems theory, complexity theory, engineering, physics, and social science Morris et al. (2004) Zhang et al. (2022) Wagner (1969). Here we consider the 3-Tissue Compartment Model (3-TC) model Zakariaei et al. (2023) from the field of nuclear medicine and molecular imaging. The model is a system of ODEs, and is an example of a compartmental model from kinetic modeling. The 3-TC model (Figure 6 in the Appendix) describes the kinetics of a pharmaceutical in tissue, and how its concentration changes with time in each of the compartments. The model is written as


$\displaystyle\frac{dP_{int}}{dt}$	$\displaystyle=$	$\displaystyle P_{v}k_{1}-(k_{2}+{k_{3}})P_{int}+P_{b}k_{4}$	(16a)
$\displaystyle\frac{dP_{b}}{dt}$	$\displaystyle=$	$\displaystyle P_{int}k_{3}-(k_{4}+k_{5})P_{b}$	(16b)
$\displaystyle\frac{dP_{intern}}{dt}$	$\displaystyle=$	$\displaystyle P_{b}k_{5}-k_{6}P_{intern}$	(16c)

The equations describe the dynamics of the three states $P_{int},P_{b}$ and $P_{intern}$ which represent, respectively, the concentration of the radiopharmaceutical in the interstitial space between cells, bound to cell receptors, and finally internalized into the cells. $P_{v}$ represents the blood input function (source) describing the delivery of the radiopharmaceutical to the tissue by the vascular system, and is the instantaneous concentration of the radiopharmaceutical in the blood. This input function exhibits a sharp spike in concentration at earlier times. The dynamics of the model depends on the parameter vector ${\bf q}=[k_{1},\ldots,k_{6}]$ . The parameters typically exhibit a log-normal prior distribution, that is

\ln({\bf q})\sim\mathcal{N}(\mu_{i},\sigma_{i}^{2})\quad\text{for }i=k_{1},k_{% 2},k_{3},k_{4},k_{5},k_{6}

(17)

The parameters $\mu_{i},\sigma_{i}$ were calculated from Kletting et al. (2016), and the numbers used are given in Appendix A. Data collection is done as follows: a patient is injected with a radiopharmaceutical and is subsequently imaged with a PET/CT scanner. The imaging data comprises so-called Time-Activity-Curves (TACs), obtained from dynamic nuclear medicine imaging, that show the evolution of radioactivity concentration (kBq/ml) in the patient over time. We model these TAC signals as the sum of contributions from the 3 tissues, namely $P_{int}+P_{b}+P_{intern}$ , with Gaussian noise levels $\sigma=0\%,1\%,2\%...19\%$ (picked at random at each sampling). Imaging at numerous time points is costly, and an optimal design aims to identify the best times for data collection. An example of noisy TACs that were used in our experiments with the assumed multiplicative Gaussian noise at varying levels $\sigma$ is presented in Figure 2.

5.2 Lotka-Volterra Predator-Prey Model

In our second set of experiments we consider the Predator-Prey Model (PPM) which is a system of non-linear differential equations that exhibits very different behaviour from the 3-TC model. While the 3-TC model is controlled by decay, the PPM model is cyclical in nature. The model describes the dynamics of two species $x$ and $y$ and can be written as

\vspace{-0.2\baselineskip}\frac{dx}{dt}=\alpha x-\beta xy\quad\quad\frac{dy}{% dt}=\delta xy-\gamma y

(18)

equipped with some initial conditions. The model depends on the parameter vector ${\bf q}=[\alpha,\beta,\gamma,\delta]$ . The parameter $\alpha$ denotes the maximum per capita growth rate of prey, while $\beta$ signifies the rate at which prey are consumed by predators, reflecting the predator’s impact on prey population reduction. Parameters $\delta$ and $\gamma$ for predators represent their natural growth rate through prey consumption and per capita death rate, respectively. The model assumes prey have an unlimited food supply, reproducing exponentially unless preyed upon. Additionally, it assumes prey are the sole food source for predators, with all environmental variables held constant.

In order to estimate the parameters measuring the quantities $x$ or $y$ (or both) are needed. In practice, it is difficult to measure both species and therefore we assume that the experimentalist is only able to measure the population density of the prey. Using this data the goal is to infer the unknown parameters $\alpha$ , $\beta$ , $\delta$ and $\gamma$ . In practice, it is impossible to obtain a continuous monitoring of $x$ or $y$ . Assume that we are given a budget for the measurement process. Similar to the 3-TC model, the experimental design parameters are the number of measurements and the times that at which they need to be performed. Finally, in order to learn a likelihood free estimator, we require to have samples from the parameters ${\bf q}$ . In Mahaffy (2009), data on the populations of lynx and hares from the Hudson Bay Company is used to fit the four unknown parameters $\alpha$ , $\beta$ , $\gamma$ and $\delta$ . To exhibit our proposed methods, we assumed lognormal prior distributions with means equal to their fitted parameters and a standard deviation of $5\%$ of the respective means:

\ln({\bf q})\sim\mathcal{N}(\mu_{i},\sigma_{i}^{2})\quad\text{for }i=\alpha,% \beta,\gamma,\delta

(19)

with numbers specified in Appendix A. Samples are drawn from this prior distribution to generate noisy data samples ${\bf d}$ (equation 1) for training. Gaussian noise of $0\%...10\%$ was added to generate the synethetic noisy prey data ${\bf d}$ (see Figure 4).

5.3 Numerical Results

The network structure for the experiments is given in Algorithm (1), trained for continuous ${\bf w}$ (section 4.1) and binary ${\bf w}$ (section 4.2). Each ODE system was trialled with the two methods. For the 3-TC model, ${\bf d}$ was generated from the sampled ${\bf q}$ added with multiplicative Gaussian noise at levels $\sigma=0\%,1\%,...,19\%$ , and for the PPM model at $\sigma=0\%,1\%,...,10\%$ . A logarithmically spaced time-grid of size $400$ for $t=0$ to $t=10^{4}$ minutes was used for the 3-TC model as it exhibits a spike at earlier times (see Appendix A). For the PPM system, 200 equally spaced time points were chosen from $t=0$ to $t=30$ years. Here we provide an exposition on the details of the trials.

Table 1: Total risk

\ell_{T}({\bf w}_{\rm opt})

and parameter risk

\ell_{q}({\bf w}_{\rm opt})

obtained from optimal designs for each ODE system at different sparsities, using both proposed methods. Corresponding averages

\overline{\ell_{T}({\bf w}_{\rm rand})}

and

\overline{\ell_{q}({\bf w}_{\rm rand})}

for 100 networks trained on random

{\bf w}

are shown for comparison. See Appendix B for further details.

ODE	Method	Sparsity	$\ell_{T}({\bf w}_{\text{opt}})$	$\overline{\ell_{T}({\bf w}_{\text{rand}})}$	$\ell_{q}({\bf w}_{\text{opt}})$	$\overline{\ell_{q}({\bf w}_{\text{rand}})}$
3-TC	Cont. w	2	$6.18\times 10^{-2}$	$1.46\times 10^{+0}$	$3.81\times 10^{-2}$	$8.94\times 10^{-2}$
3-TC	Cont. w	6	$5.42\times 10^{-2}$	$1.87\times 10^{-1}$	$3.84\times 10^{-2}$	$5.79\times 10^{-2}$
3-TC	Cont. w	10	$5.29\times 10^{-2}$	$8.73\times 10^{-2}$	$3.84\times 10^{-2}$	$4.35\times 10^{-2}$
3-TC	TS (Binary w)	2	$5.78\times 10^{-2}$	$8.07\times 10^{-2}$	$3.82\times 10^{-2}$	$3.83\times 10^{-2}$
3-TC	TS (Binary w)	6	$5.29\times 10^{-2}$	$5.81\times 10^{-2}$	$3.80\times 10^{-2}$	$3.82\times 10^{-2}$
3-TC	TS (Binary w)	10	$5.22\times 10^{-2}$	$5.55\times 10^{-2}$	$3.82\times 10^{-2}$	$3.84\times 10^{-2}$
PPM	Cont. w	2	$8.96\times 10^{-3}$	$2.07\times 10^{-2}$	$1.07\times 10^{-3}$	$1.38\times 10^{-3}$
PPM	Cont. w	4	$6.45\times 10^{-3}$	$1.43\times 10^{-2}$	$5.45\times 10^{-4}$	$1.11\times 10^{-3}$
PPM	Cont. w	10	$4.72\times 10^{-3}$	$9.18\times 10^{-3}$	$4.27\times 10^{-4}$	$5.54\times 10^{-4}$
PPM	TS (Binary w)	2	$7.42\times 10^{-3}$	$2.45\times 10^{-2}$	$7.88\times 10^{-4}$	$1.60\times 10^{-3}$
PPM	TS (Binary w)	4	$5.87\times 10^{-3}$	$1.19\times 10^{-2}$	$5.73\times 10^{-4}$	$8.18\times 10^{-4}$
PPM	TS (Binary w)	10	$4.61\times 10^{-3}$	$8.16\times 10^{-3}$	$3.58\times 10^{-4}$	$6.00\times 10^{-4}$

For both training methods, at each iteration $3500$ realizations of ${\bf q}$ were generated from the prior distributions for both ODE systems. Learning rates were chosen by trial and error.

We defined sparsity as the number of weights higher than a threshold of $10^{-3}$ . For Method 1, the design vector ${\bf w}$ was initialized as a vector of ones ${\bf w}_{{\rm init}}=\mathbf{1}$ . The training iterations were continued until the desired sparsity level was reached. Weights that were lower than the threshold at the end of the first phase were set to $0$ , and a further $5000$ iterations were performed with the obtained ${\bf w}_{\rm opt}$ fixed, which concluded the training phase. For training using Algorithm 3 (Method 2) at each sparsity, initialization was found to be important for convergence to a ${\bf w}_{\rm opt}$ after $200$ Tabu iterations. Hyper-parameters such as the size of the subset of neighbours used and the Tabu list length were found using trial and error, and are $10$ and $8$ respectively for both ODEs. For initialization of ${\bf w}$ , a network was trained on a random binary ${\bf w}$ at the chosen sparsity, and then used as a pre-trained model to train $100$ networks using random binary designs ${\bf w}_{\rm rand}$ (at fixed sparsity) for a smaller number of iterations. The random ${\bf w}_{\rm rand}$ that gave the network with the lowest risk $\ell_{T}({\bf w})$ was used to initialize the Tabu Search routine given by Algorithm 3 to obtain a final ${\bf w}_{\rm opt}$ . The two training methods that we propose as algorithms for fast and efficient experimental design was repeated for each ODE system at a number of different sparsities. For each ODE system, method, and sparsity, the model was evaluated on $175K$ newly sampled ${\bf q}$ ’s (and corresponding ${\bf d}$ ’s generated as in equation 1). The results are given in Table 1. Higher sparsities are expected to give better parameter recoveries and data fits, lowering $\ell_{T}({\bf w})$ , while the opposite is expected for lower sparsities. This trend is seen in the comparison of the two methods used in Figure 5 for the PPM ODE system, and in Figure 3 for the 3-TC ODE system.

6 Conclusions

In this paper we have introduced a methodology for experimental design for parameter estimation using a likelihood free estimator that involves the use of deep networks. Similar to other design methodologies we require the availability of training data, that samples the potential parameters to be recovered. Our methodology is self-supervised. It utilizes parameters to obtain the data under some experimental settings, and then, uses the likelihood free estimator to evaluate the parameters. Since likelihood free estimators circumvent the need of solving an optimization problem for the parameter, the experimental design problem does not require the solution of a bilevel optimization problem. We have introduced two design criteria. The first is where the data is multiplied by a continuous weight, and the second, when the data is multiplied by a binary weight. We experiment with these methods on two problems and show that it is possible to obtain an efficient design that can substantially reduce the cost of collecting data that is used in parameter estimation.

7 Impact Statement

Differential equation models are ubiquitous in science and engineering research. Experimental design is commonly used and better identifying the parameters in the face of resource or budgetary constraints can aid in various disciplines such as cancer therapy and controlling the propagation of disease. To our understanding, our work lacks discernible negative ethical or social impacts.

8 Acknowledgements

We acknowledge the Canadian Institutes of Health Research (CIHR) Project Grant PJT-180251. We would like to thank Tamila Kalimullina for help with Figures 1 and 6, and Shadab Ahamed for technical support.

References

Allaire et al. (2001) Allaire, G., Aubry, S., and Jouve, F. Eigenfrequency optimization in optimal design. Comp. Meth. Appl. Mech. Eng., 190(28):3565–3579, 2001.
Atkinson & Donev (1992) Atkinson, A. C. and Donev, A. N. Optimum Experimental Designs. Oxford University Press, 1992.
Banga & Balsa-Canto (2008) Banga, J. R. and Balsa-Canto, E. Parameter estimation and optimal experimental design. Essays in biochemistry, 45:195–210, 2008.
Bardow (2008) Bardow, A. Optimal experimental design for ill-posed problems, the meter approach. Computers and chemical engineering, 32, 2008.
Bellman et al. (1966) Bellman, R., Kagiwada, H., and Kalaba, R. Inverse problems in ecology. J. Theor. Biol., 11(1):164–167, 1966.
Bentourkia & Zaidi (2007) Bentourkia, M. and Zaidi, H. Tracer kinetic modeling in pet. Pet Clinics, 2(2):267–277, 2007.
Betts (2001) Betts, J. Practical Methods for Optimal Control using Nonlinear Programming. Advances in Design and Control. SIAM, Philadelphia, 2001.
Björck (1996) Björck, Å. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.
Bock et al. (2013) Bock, H. G., Körkel, S., and Schlöder, J. P. Parameter estimation and optimum experimental design for differential equation models. Model Based Parameter Estimation: Theory and Applications, pp. 1–30, 2013.
Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University press, 2004.
Bruckstein et al. (2009) Bruckstein, A., Donoho, D., and Elad, M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51:34–81, 2009.
Chaloner & Verdinelli (1995) Chaloner, K. and Verdinelli, I. Bayesian experimental design: A review. Statis. Sci., 10:237–304, 1995.
Chung & Haber (2013) Chung, M. and Haber, E. Experimental design for biological systems. SIAM Journal on Control and Optimization, 50:471–489, 2013.
Didelot et al. (2011) Didelot, X., Everitt, R. G., Johansen, A. M., and Lawson, D. J. Likelihood-free estimation of model evidence. 2011.
Donoho (2006) Donoho, D. D. For most large underdetermined systems of linear equations the minimal $\ell_{1}$ -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006.
Ette & Williams (2007) Ette, E. I. and Williams, P. (eds.). Pharmacometrics: The Science of Quantitative Pharmacology. Wiley-Interscience, New York, 2007.
Feng et al. (1994) Feng, D., Wang, X., and Yan, H. A computer simulation study on the input function sampling schedules in tracer kinetic modeling with positron emission tomography (pet). Computer Methods and Programs in Biomedicine, 45(3):175–186, 1994. ISSN 0169-2607. doi: https://doi.org/10.1016/0169-2607(94)90201-1. URL https://www.sciencedirect.com/science/article/pii/0169260794902011.
Gendreau & Potvin (2005) Gendreau, M. and Potvin, J.-Y. Tabu search. Search methodologies: introductory tutorials in optimization and decision support techniques, pp. 165–186, 2005.
Haber et al. (2008) Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems, 24, 2008.
Haber et al. (2009) Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of nonlinear ill-posed inverse problems. Inverse Problems, 25, 2009.
Helms (2008) Helms, V. (ed.). Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks. Wiley-VCH, New York, 2008.
Hertz et al. (1995) Hertz, A., Taillard, E., and De Werra, D. A tutorial on tabu search. In Proc. of Giornate di Lavoro AIRO, volume 95, pp. 13–24, 1995.
Hunt et al. (2007) Hunt, B. R., Kostelich, E. J., and Szunyogh, I. Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D, 230:112–126, 2007.
Jaiswal et al. (2020) Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., and Makedon, F. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
Keane et al. (2017) Keane, A., Krauskopf, B., and Postlethwaite, C. M. Climate models with delay differential equations. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(11), 2017.
Kletting et al. (2016) Kletting, P., Schuchardt, C., Kulkarni, H., Shahinfar, M., Singh, A., Glatting, G., Baum, R. P., and Beer, A. Investigating the effect of ligand amount and injected therapeutic activity: A simulation study for 177lu-labeled psma-targeting peptides. PloS one, 11:e0162303, 09 2016. doi: 10.1371/journal.pone.0162303.
Lenhart & Workman (2007) Lenhart, S. and Workman, J. (eds.). Optimal Control Applied to Biological Models (1st ed.). Chapman and Hall/CRC, New York, 2007.
Li et al. (2018) Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In International Conference on Learning Representations, 2018.
Mahaffy (2009) Mahaffy, J. Qualitative analysis of 2D systems of Ordinary Differential Equations. https://jmahaffy.sdsu.edu/courses/f09/math636/lectures/lotka/qualde2.html, 2009. Accessed: Dec 25, 2023.
McLellan (1994) McLellan, P. A differential-algebraic perspective on nonlinear controller design methodologies. Chem. Eng. Science, 49:1663–1679, 1994.
Morris et al. (2004) Morris, E. D., Endres, C. J., Schmidt, K. C., Christian, B. T., Muzic, R. F., and Fisher, R. E. Kinetic modeling in positron emission tomography. Emission Tomography: The Fundamentals of PET and SPECT, 46(1):499–540, 2004.
Papamakarios (2019) Papamakarios, G. Neural density estimation and likelihood-free inference. arXiv preprint arXiv:1910.13233, 2019.
Pukelsheim (1993) Pukelsheim, F. Optimal design of experiments. John Wiley & Sons, 1993.
Sainsbury-Dale et al. (2022) Sainsbury-Dale, M., Zammit-Mangion, A., and Huser, R. Fast optimal estimation with intractable models using permutation-invariant neural networks. arXiv preprint arXiv:2208.12942, 2022.
Ushijima & Yeh (2015) Ushijima, T. T. and Yeh, W. W. Experimental design for estimating unknown hydraulic conductivity in an aquifer using a genetic algorithm and reduced order model. Advances in Water Resources, 86:193–208, 2015.
Wagner (1969) Wagner, J. G. Pharmacokinetics: 10. introduction to compartment models. Drug Intelligence & Clinical Pharmacy, 3(9):250–257, 1969. doi: 10.1177/106002806900300904. URL https://doi.org/10.1177/106002806900300904.
Wang et al. (2023) Wang, L., Martins, J. R., and Du, X. Optimal experimental design-based reduced order modeling for learning optimal aerodynamic designs. In AIAA AVIATION 2023 Forum, pp. 3716, 2023.
Yang et al. (2023) Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
Zakariaei et al. (2023) Zakariaei, N., Paranj, A. F., Abdollahi, H., and Rahmim, A. Using the cluster gauss newton algorithm to estimate theranostic pharmacokinetic model parameters. Journal of Nuclear Medicine, 64(supplement 1):P1449–P1449, 2023. ISSN 0161-5505. URL https://jnm.snmjournals.org/content/64/supplement_1/P1449.
Zenker et al. (2007) Zenker, S., Rubin, J., and Clermont, G. From inverse problems in mathematical physiology to quantitative differential diagnoses. PLOS Computational Biology, 3(11):1–15, 2007.
Zhai et al. (2019) Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1476–1485, 2019.
Zhang et al. (2022) Zhang, P., Feng, K., Gong, Y., Lee, J., Lomonaco, S., and Zhao, L. Usage of compartmental models in predicting covid-19 outbreaks. The AAPS Journal, 24(5):98, 2022. ISSN 1550-7416. doi: 10.1208/s12248-022-00743-9. URL https://doi.org/10.1208/s12248-022-00743-9. PMID: 36056223.

Appendix A Appendix: Additional Details on ODE Models and Setup

Here we provide additional details on our experiments. For all experiments concerned, the network structure given by Algorithm 1 used $nlayers=3$ , and no layer normalization was used. The $SiLU$ function was used as the activation function. Further results of our experimental runs are given in the tables shown herein.

The samples ${\bf q}$ (and therefore ${\bf d}$ that were generated from them) were generated from a lognormal prior distribution for each of the parameters in ${\bf q}$ as mentioned in the main text (equation 17 and equation 19). The mean $\mu$ and standard deviation $\sigma$ parameters of a lognormal distribution are related to the mean $\mu_{q}$ and standard deviation $\sigma_{q}$ of the random variable $q$ as follows:

\mu=\ln\left(\frac{\mu_{q}^{2}}{\sqrt{\mu_{q}^{2}+\sigma_{q}^{2}}}\right),% \quad\sigma^{2}=\ln\left(1+\frac{\sigma_{q}^{2}}{\mu_{q}^{2}}\right)

(20)

where $q$ in our case refers to each of the parameters being used in the corresponding ODE model.

A.1 Model Setup: 3-Tissue Compartment Model

A diagram of the 4 compartments in the 3-TC ODE model is provided in Figure 6. The source function $P_{v}$ is a known blood input function (prior to entry into tissue).

In our experiments for the ODE system, we used the input function proposed in Feng et al. (1994) given by:

	$\displaystyle P_{v}(t)=$	$\displaystyle(A_{1}t-A_{2}-A_{3})\exp(-\lambda_{1}t)$		(21)
		$\displaystyle+A_{2}\exp(-\lambda_{2}t)+A_{3}\exp(-\lambda_{3}t)$		(21)

We fit equation 21 to patient data available to us at our institution, arriving at $A_{1}=408.87,A_{2}=A_{3}=14.78$ and $\lambda_{1}=-8.46,\lambda_{2}=\lambda_{3}=-0.1362$ which we used in equation 21 to obtain our input function $P_{v}$ . This input function exhibits a sharp spike in concentration at earlier times, and as such, we selected a log-spaced discretization of $400$ points for ${\bf t}$ from $t=0$ to $t=10^{4}$ minutes. For the initial conditions we set $P_{v}$ , $P_{int}$ , $P_{b}$ and $P_{intern}$ to $0$ $\rm kBq/ml$ .

The means and standard deviations $\mu_{q}$ and $\sigma_{q}$ for the prior distribution of the parameters $\mu_{i},i\in\{k_{1},k_{2},k_{4},k_{5},k_{6}\}$ , and $\sigma_{i},i\in\{k_{1},k_{2},k_{4},k_{5},k_{6}\}$ that were used were $\mu_{k_{1}}=1.5\times 10^{-2}$ , $\mu_{k_{2}}=1.6\times 10^{-3}$ , $\mu_{k_{3}}=121.0$ , $\mu_{k_{4}}=4\times 10^{-2}$ , $\mu_{k_{5}}=1\times 10^{-3}$ , $\mu_{k_{6}}=2\times 10^{-4}$ , and $\sigma_{k_{1}}=0.2\mu_{k_{1}},\sigma_{k_{2}}=0.2\mu_{k_{2}},\sigma_{k_{3}}=0.2% \mu_{k_{3}},\sigma_{k_{4}}=0.2\mu_{k_{4}},\sigma_{k_{5}}=0.2\mu_{k_{5}},\sigma% _{k_{6}}=0.2\mu_{k_{6}}$ respectively. As can be seen, we took the standard deviations of the parameters to be $20\%$ of the means. The noise $\sigma$ added to generated ${\bf d}$ were randomly picked at each sampling from $\{0\%,1\%,2\%...19\%\}$ .

A.2 2: PPM Model Setup

For the Predator-Prey Model described in Section 5, we used an equally spaced time grid of 200 points from $t=0$ to $t=30$ years. In Mahaffy (2009), data on the populations of lynx and hares from the Hudson Bay Company is used to fit the four unknown parameters $\alpha$ , $\beta$ , $\gamma$ and $\delta$ , where they used an initial population of hares (prey) and lynx (predator) as 30 and 4 (in units of thousands) respectively, which we used. We took the fitted parameters obtained by Mahaffy (2009) as the means of the respective prior distributions of the parameters $k_{1},k_{2},k_{3},k_{4},k_{5}$ and $k_{6}$ , and $5\%$ of the means as their respective standard deviations. The means and standard deviations of the parameters for their respective prior distributions were: $(\mu_{\alpha},\sigma_{\alpha})=(0.4,0.05\times 0.4)$ , $(\mu_{\beta},\sigma_{\beta})=(0.018,0.05\times 0.018)$ , $(\mu_{\gamma},\sigma_{\gamma})=(0.8,0.05\times 0.8)$ , $(\mu_{\delta},\sigma_{\delta})=(0.023,0.05\times 0.023)$ . The noise $\sigma$ added to the generated ${\bf d}$ was randomly picked at each sampling from $\{0\%,1\%,2\%...10\%\}$ .

A.3 Scaling of Risks

The loss function equation 7 is a sum of the data risk $\ell_{d}(\omega)$ and the parameter risk $\ell_{q}(\omega)$ . For the data-risk $\ell_{d}(\omega)$ we used the trapezoidal rule for variable time widths (owing to our use of logarithmically spaced time points). Note that the risks $\ell(\omega)$ reported in our results and that used during training had the scaling $k$ shown below in equation 22, and whose discretized form was used during training.

	$\displaystyle\ell_{d}(\omega)$	$\displaystyle={\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{\frac{1}{2}}\int_{0}^{T% }k\cdot\\|F(\widehat{\bf q}_{i},{\bm{\omega}})-{\bf d}_{i}(t)\\|^{2}dt$		(22)
	$\displaystyle k$	$\displaystyle=\frac{1}{\int_{0}^{T}\\|{\bf d}_{i}(t)\\|^{2}dt}$		(22)

where $k$ is a scaling factor. Similarly for the parameter recovery: risk $\ell_{q}(\omega)$ :

	$\displaystyle\ell_{q}({\bm{\omega}})$	$\displaystyle=$	$\displaystyle{\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{\frac{1}{2}}\cdot k\cdot% \\|\widehat{\bf q}-{\bf q}\\|^{2}$		(23)
	$\displaystyle k$	$\displaystyle=$	$\displaystyle\frac{1}{\\|{\bf q}\\|^{2}}$		(24)

Appendix B Appendix: Numerical Details of Figures and Tables

In Figure 3 and Figure 5 we exhibit the performance of networks trained using the proposed method at different sparsities. For each training method at each sparsity, a new batch of $175K$ samples of ${\bf q}$ (and corresponding noisy ${\bf d}$ ) were sampled for the trained optimal network to be evaluated on. For the computation of $\ell_{T}({\bf w})$ , the network was evaluated on $50$ sets of $3500$ newly sampled $q$ and $d$ , amounting to an evaluation over $175K$ unseen samples. The mean value for the total risk $\ell_{T}({\bf w})$ of the optimal network over these $50$ sets was computed and taken as $\ell_{T}({\bf w}_{\rm opt})$ , and the Standard Error of the Mean (SEM) for this value was calculated. The $\ell_{T}({\bf w}_{\rm opt})$ are plotted against $sparsity$ (number of non-zero elements in ${\bf w}$ ) in Figure 3 and in Figure 5. The corresponding numerical values are presented in Table 2. Note that in Figure 3 and Figure 5, the error bars look point-like due to the SEM( $\ell_{T}({\bf w})$ ) values being $\sim 10^{-2}$ smaller than the values of $\ell_{T}({\bf w})$ that are plotted (see Table 2). The error bars in the figures were calculated as:

\displaystyle\log_{10}\left(\ell_{T}({\bf w})+\text{SEM}(\ell_{T}({\bf w})% \right)-\log_{10}\left(\ell_{T}({\bf w})\right)

(25)

The corresponding numerical values for $\ell_{T}({\bf w}_{\rm opt})$ and SEM( $\ell_{T}({\bf w}_{\rm opt})$ ) are in Table 2.

We chose to conduct numerical experiments to compare the risks obtained by our methods to random designs for a subset of sparsities. For the 3-TC model, this was at $sparsity=2,6,10,20$ , and at $sparsity=2,4,10,20$ for the PPM system of equations.

Namely, the risks $\ell_{T}({\bf w})$ , $\ell_{q}({\bf w})$ obtained by training optimal networks for a given method and sparsity are compared against networks trained on random ${\bf w}$ . In Table 1, Table 3 and Table 5, for each ODE model, method, and sparsity, 100 networks were trained on random designs ${\bf w}_{rand}$ for comparison. For the random runs for continuous ${\bf w}$ , the entries of ${\bf w}_{\rm rand}$ were sampled from a uniform distribution ${{\bf w}}_{\rm{rand}}\sim\text{Uniform}(0,2)$ . For binary design variables, ${\bf w}_{rand}$ had a random subset of weights (equal to $sparsity$ ) set to $1$ , with the rest set to $0$ . For the computation of each $\ell_{T}({\bf w}_{\rm opt})$ , the optimal network was evaluated on $50$ sets of $3500$ newly sampled ${\bf q}$ and ${\bf d}$ , amounting to an evaluation over $175K$ unseen samples. A mean value for the risk of the optimal network over these samples was computed and taken as $\ell_{T}({\bf w})_{\rm opt}$ , and the Standard Error of the Mean for this value was calculated. This is given as the SEM( $\ell_{T}({\bf w})_{\rm opt}$ ) column in Table 3 and Table 2. The parameter risks $\ell_{q}({\bf w}_{\rm opt})$ for the optimal networks were calculated similarly. Each of the $100$ networks trained on random designs ${\bf w}_{\rm rand}$ were evaluated for $\ell_{T}({\bf w}_{\rm rand})$ on $70K$ new samples of ${\bf q}$ and ${\bf d}$ . The STD( $\ell_{T}({\bf w}_{\rm rand})$ ) and STD( $\ell_{q}({\bf w}_{\rm rand}))$ columns in Table 3 and Table 5 are the standard deviations of the corresponding risks among the 100 networks trained on random ${\bf w}_{rand}$ for each corresponding entry in the tables. In Table 4 we confirm that the total risks for the trained optimal networks $\ell_{T}({\bf w}_{\rm opt})$ are lower than the risk obtained by the best performing network trained on random ${\bf w}_{\rm rand}$ for each entry.

Table 2: Total risks

\ell_{T}({\bf w}_{\rm opt})

obtained by networks trained for optimal designs

{\bf w}_{\rm opt}

on $175K$ unseen samples of ${\bf q}$ and ${\bf d}$ . The Standard Error of the Mean (SEM) for each are also presented. ODE Method Sparsity $\ell_{T}({\bf w}_{\text{opt}})$ SEM( $\ell_{T}({\bf w}_{\text{opt}})$ ) 3-TC Cont. w 1 $6.44\times 10^{-2}$ $1.42\times 10^{-4}$ 3-TC Cont. w 2 $6.18\times 10^{-2}$ $1.81\times 10^{-4}$ 3-TC Cont. w 4 $5.48\times 10^{-2}$ $1.67\times 10^{-4}$ 3-TC Cont. w 6 $5.42\times 10^{-2}$ $1.66\times 10^{-4}$ 3-TC Cont. w 10 $5.29\times 10^{-2}$ $1.47\times 10^{-4}$ 3-TC Cont. w 15 $5.17\times 10^{-2}$ $1.23\times 10^{-4}$ 3-TC Cont. w 20 $5.125\times 10^{-2}$ $1.07\times 10^{-4}$ 3-TC Cont. w 30 $5.120\times 10^{-2}$ $1.60\times 10^{-4}$ 3-TC TS (Binary w) 1 $6.28\times 10^{-2}$ $1.43\times 10^{-4}$ 3-TC TS (Binary w) 2 $5.78\times 10^{-2}$ $1.77\times 10^{-4}$ 3-TC TS (Binary w) 4 $5.48\times 10^{-2}$ $1.51\times 10^{-4}$ 3-TC TS (Binary w) 6 $5.29\times 10^{-2}$ $1.52\times 10^{-4}$ 3-TC TS (Binary w) 10 $5.22\times 10^{-2}$ $1.48\times 10^{-4}$ 3-TC TS (Binary w) 15 $5.17\times 10^{-2}$ $1.19\times 10^{-4}$ 3-TC TS (Binary w) 20 $5.14\times 10^{-2}$ $1.22\times 10^{-4}$ 3-TC TS (Binary w) 30 $5.11\times 10^{-2}$ $1.55\times 10^{-4}$ PPM Cont. w 1 $1.49\times 10^{-2}$ $3.28\times 10^{-5}$ PPM Cont. w 2 $8.96\times 10^{-3}$ $1.88\times 10^{-5}$ PPM Cont. w 4 $6.45\times 10^{-3}$ $1.74\times 10^{-5}$ PPM Cont. w 6 $5.83\times 10^{-3}$ $1.32\times 10^{-5}$ PPM Cont. w 10 $4.72\times 10^{-3}$ $9.76\times 10^{-6}$ PPM Cont. w 15 $4.60\times 10^{-3}$ $1.29\times 10^{-5}$ PPM Cont. w 20 $4.28\times 10^{-3}$ $1.14\times 10^{-5}$ PPM Cont. w 30 $4.08\times 10^{-3}$ $8.24\times 10^{-6}$ PPM TS (Binary w) 1 $1.32\times 10^{-2}$ $3.30\times 10^{-5}$ PPM TS (Binary w) 2 $7.42\times 10^{-3}$ $1.17\times 10^{-5}$ PPM TS (Binary w) 4 $5.87\times 10^{-3}$ $1.24\times 10^{-5}$ PPM TS (Binary w) 6 $5.12\times 10^{-3}$ $1.25\times 10^{-5}$ PPM TS (Binary w) 10 $4.61\times 10^{-3}$ $9.44\times 10^{-6}$ PPM TS (Binary w) 15 $4.35\times 10^{-3}$ $1.11\times 10^{-5}$ PPM TS (Binary w) 20 $4.15\times 10^{-3}$ $6.90\times 10^{-6}$ PPM TS (Binary w) 30 $4.09\times 10^{-3}$ $9.11\times 10^{-6}$

Table 3: Experimental results for total risks

\ell_{T}({\bf w})

. The

\ell_{T}({\bf w})

were computed over

175k

unseen samples for each ODE system at the given sparsity values. The corresponding Standard Error of the Mean (SEM) are shown. Also given are the mean parameter risks

\overline{\ell_{T}({\bf w}_{\rm rand})}

obtained by 100 networks trained on random designs

{\bf w}_{\rm rand}

. The standard deviations of

\ell_{T}({\bf w}_{\rm rand})

and the percentage difference of

\ell_{T}({\bf w}_{\rm opt})

from the random runs are similarly presented.

\bf{SP}

refers to the sparsity of the corresponding row.

ODE	Method	SP	$\ell_{T}({\bf w}_{\rm{opt}})$	SEM( $\ell_{T}({\bf w}_{\rm{opt}})$ )	$\overline{\ell_{T}({\bf w}_{\rm{rand}})}$	$\frac{\ell_{T}({\bf w}_{\rm{opt})}-\overline{\ell_{T}({\bf w}_{\rm{rand}})}}{% \overline{\ell_{T}({\bf w}_{\rm{rand}})}}\times 100\%$	$\textbf{STD}(\ell_{T}({\bf w}_{\rm rand}))$
3-TC	Cont. w	2	$6.18\times 10^{-2}$	$1.81\times 10^{-4}$	$1.46\times 10^{+0}$	$-95.78\%$	$2.05\times 10^{+0}$
3-TC	Cont. w	6	$5.42\times 10^{-2}$	$1.66\times 10^{-4}$	$1.87\times 10^{-1}$	$-71.10\%$	$2.88\times 10^{-1}$
3-TC	Cont. w	10	$5.29\times 10^{-2}$	$1.47\times 10^{-4}$	$8.73\times 10^{-2}$	$-39.43\%$	$1.49\times 10^{-1}$
3-TC	Cont. w	20	$5.125\times 10^{-2}$	$1.23\times 10^{-4}$	$5.98\times 10^{-2}$	$-14.30\%$	$2.44\times 10^{-3}$
3-TC	TS (Binary w)	2	$5.78\times 10^{-2}$	$1.77\times 10^{-4}$	$8.07\times 10^{-2}$	$-28.37\%$	$8.69\times 10^{-3}$
3-TC	TS (Binary w)	6	$5.29\times 10^{-2}$	$1.52\times 10^{-4}$	$5.81\times 10^{-2}$	$-9.06\%$	$2.19\times 10^{-3}$
3-TC	TS (Binary w)	10	$5.22\times 10^{-2}$	$1.48\times 10^{-4}$	$5.55\times 10^{-2}$	$-5.92\%$	$1.43\times 10^{-3}$
3-TC	TS (Binary w)	20	$5.14\times 10^{-2}$	$1.22\times 10^{-4}$	$5.35\times 10^{-2}$	$-3.76\%$	$7.66\times 10^{-4}$
PPM	Cont. w	2	$8.96\times 10^{-3}$	$1.88\times 10^{-5}$	$2.07\times 10^{-2}$	$-56.72\%$	$7.46\times 10^{-3}$
PPM	Cont. w	4	$6.45\times 10^{-3}$	$1.74\times 10^{-5}$	$1.43\times 10^{-2}$	$-54.89\%$	$8.62\times 10^{-3}$
PPM	Cont. w	10	$4.72\times 10^{-3}$	$9.76\times 10^{-6}$	$9.18\times 10^{-3}$	$-48.59\%$	$2.33\times 10^{-3}$
PPM	Cont. w	20	$4.28\times 10^{-3}$	$1.14\times 10^{-5}$	$7.16\times 10^{-3}$	$-40.22\%$	$3.14\times 10^{-3}$
PPM	TS (Binary w)	2	$7.42\times 10^{-3}$	$1.17\times 10^{-5}$	$2.45\times 10^{-2}$	$-69.77\%$	$8.93\times 10^{-3}$
PPM	TS (Binary w)	4	$5.87\times 10^{-3}$	$1.24\times 10^{-5}$	$1.19\times 10^{-2}$	$-50.89\%$	$4.76\times 10^{-3}$
PPM	TS (Binary w)	10	$4.61\times 10^{-3}$	$9.44\times 10^{-6}$	$8.16\times 10^{-3}$	$-43.51\%$	$5.82\times 10^{-3}$
PPM	TS (Binary w)	20	$4.15\times 10^{-3}$	$6.90\times 10^{-6}$	$5.63\times 10^{-3}$	$-26.25\%$	$2.16\times 10^{-3}$

Table 4: Performance comparison between total risks

\ell_{T}({\bf w}_{\rm opt})

obtained by evaluating optimal networks, with the best performing network with the minimum total risk

\min(\ell_{T}({\bf w}_{\rm rand}))

among

100

networks that were trained on random designs

{\bf w}_{\rm rand}

ODE	Method	Sparsity	$\ell_{T}({\bf w}_{\rm{opt}})$	$\min(\ell_{T}({\bf w}_{\rm rand}))$	$\frac{\ell_{T}({\bf w}_{\rm{opt}})-\min(\ell_{T}({\bf w}_{\rm rand}))}{\min(% \ell_{T}({\bf w}_{\rm rand}))}\times 100\%$
3-TC	Cont. w	2	$6.18\times 10^{-2}$	$1.35\times 10^{-1}$	$-54.38\%$
3-TC	Cont. w	6	$5.42\times 10^{-2}$	$6.28\times 10^{-2}$	$-13.80\%$
3-TC	Cont. w	10	$5.29\times 10^{-2}$	$5.32\times 10^{-2}$	$-0.62\%$
3-TC	Cont. w	20	$5.11\times 10^{-2}$	$5.34\times 10^{-2}$	$-4.36\%$
3-TC	TS (Binary w)	2	$5.78\times 10^{-2}$	$5.49\times 10^{-2}$	$-5.60\%$
3-TC	TS (Binary w)	6	$5.29\times 10^{-2}$	$5.49\times 10^{-2}$	$-3.65\%$
3-TC	TS (Binary w)	10	$5.22\times 10^{-2}$	$5.32\times 10^{-2}$	$-1.72\%$
3-TC	TS (Binary w)	20	$5.14\times 10^{-2}$	$5.20\times 10^{-2}$	$-0.99\%$
PPM	Cont. w	2	$8.96\times 10^{-3}$	$9.23\times 10^{-3}$	$-3.01\%$
PPM	Cont. w	4	$6.45\times 10^{-3}$	$6.94\times 10^{-3}$	$-7.02\%$
PPM	Cont. w	10	$4.72\times 10^{-3}$	$6.01\times 10^{-3}$	$-21.53\%$
PPM	Cont. w	20	$4.28\times 10^{-3}$	$4.91\times 10^{-3}$	$-12.83\%$
PPM	TS (Binary w)	2	$7.42\times 10^{-3}$	$1.14\times 10^{-2}$	$-35.10\%$
PPM	TS (Binary w)	4	$5.87\times 10^{-3}$	$6.14\times 10^{-3}$	$-4.47\%$
PPM	TS (Binary w)	10	$4.61\times 10^{-3}$	$5.27\times 10^{-3}$	$-12.49\%$
PPM	TS (Binary w)	20	$4.15\times 10^{-3}$	$4.33\times 10^{-3}$	$-4.31\%$

Table 5: Experimental results for parameter risks

\ell_{q}({\bf w})

. The

\ell_{q}({\bf w})

were computed over

175k

unseen samples for each ODE system at the given sparsity values. The corresponding Standard Error of the Mean (SEM) are shown. Also given are the mean parameter risks

\overline{\ell_{q}({\bf w}_{\rm rand})}

obtained by 100 networks trained on random designs

{\bf w}_{\rm rand}

. The standard deviations of

\ell_{q}({\bf w}_{\rm rand})

and the percentage difference of

\ell_{q}({\bf w}_{\rm opt})

from the random runs are similarly presented.

\bf{SP}

refers to the sparsity of the corresponding row.

ODE	Method	SP	$\ell_{q}({\bf w}_{\rm{opt}})$	SEM( $\ell_{q}({\bf w}_{\rm{opt}})$ )	$\overline{\ell_{q}({\bf w}_{\rm{rand}})}$	$\frac{\ell_{q}({\bf w}_{\rm{opt}})-\overline{\ell_{q}({\bf w}_{\rm{rand}})}}{% \overline{\ell_{q}({\bf w}_{\rm{rand}})}}\times 100\%$	$\textbf{STD}(\ell_{q}({\bf w}_{\rm{rand}})$ )
3-TC	Cont. w	2	$3.81\times 10^{-2}$	$1.35\times 10^{-4}$	$8.94\times 10^{-2}$	$-57.34\%$	$2.77\times 10^{-1}$
3-TC	Cont. w	6	$3.84\times 10^{-2}$	$1.60\times 10^{-4}$	$5.79\times 10^{-2}$	$-33.69\%$	$1.14\times 10^{-1}$
3-TC	Cont. w	10	$3.84\times 10^{-2}$	$1.49\times 10^{-4}$	$4.35\times 10^{-2}$	$-11.87\%$	$3.69\times 10^{-2}$
3-TC	Cont. w	20	$3.78\times 10^{-2}$	$1.07\times 10^{-4}$	$3.84\times 10^{-2}$	$-1.57\%$	$6.19\times 10^{-4}$
3-TC	TS (Binary w)	2	$3.82\times 10^{-2}$	$1.65\times 10^{-4}$	$3.828\times 10^{-2}$	$-0.28\%$	$3.94\times 10^{-4}$
3-TC	TS (Binary w)	6	$3.80\times 10^{-2}$	$1.43\times 10^{-4}$	$3.819\times 10^{-2}$	$-0.43\%$	$2.39\times 10^{-4}$
3-TC	TS (Binary w)	10	$3.83\times 10^{-2}$	$1.34\times 10^{-4}$	$3.821\times 10^{-2}$	$+0.37\%$	$2.13\times 10^{-4}$
3-TC	TS (Binary w)	20	$3.81\times 10^{-2}$	$1.25\times 10^{-4}$	$3.82\times 10^{-2}$	$-0.22\%$	$2.34\times 10^{-4}$
PPM	Cont. w	2	$1.07\times 10^{-3}$	$3.27\times 10^{-6}$	$1.38\times 10^{-3}$	$-22.31\%$	$4.88\times 10^{-4}$
PPM	Cont. w	4	$5.45\times 10^{-4}$	$1.99\times 10^{-6}$	$1.11\times 10^{-3}$	$-51.05\%$	$6.76\times 10^{-4}$
PPM	Cont. w	10	$4.27\times 10^{-4}$	$1.29\times 10^{-6}$	$5.54\times 10^{-4}$	$-22.97\%$	$1.95\times 10^{-4}$
PPM	Cont. w	20	$3.75\times 10^{-4}$	$2.10\times 10^{-4}$	$5.09\times 10^{-4}$	$-26.44\%$	$2.10\times 10^{-4}$
PPM	TS (Binary w)	2	$7.88\times 10^{-4}$	$2.31\times 10^{-6}$	$1.60\times 10^{-3}$	$-50.76\%$	$5.24\times 10^{-4}$
PPM	TS (Binary w)	4	$5.73\times 10^{-4}$	$1.85\times 10^{-6}$	$8.18\times 10^{-4}$	$-29.98\%$	$3.11\times 10^{-4}$
PPM	TS (Binary w)	10	$3.58\times 10^{-4}$	$1.15\times 10^{-6}$	$6.00\times 10^{-4}$	$-40.42\%$	$3.92\times 10^{-4}$
PPM	TS (Binary w)	20	$3.13\times 10^{-4}$	$1.15\times 10^{-6}$	$3.91\times 10^{-4}$	$-19.95\%$	$1.29\times 10^{-4}$

	$\displaystyle\pi_{{\bm{\omega}}}({\bf q}\|{\bf d})$	$\displaystyle\propto$	$\displaystyle\pi({\bf q})\pi_{{\bm{\omega}}}({\bf d}\|{\bf q})$
		$\displaystyle=$	$\displaystyle\pi({\bf q})\exp\left({\frac{1}{2\sigma^{2}}}\\|{\bf d}-F({\bf q};% {\bm{\omega}})\\|^{2}\right).$

Abstract

1 Introduction

2 Mathematical Background for Optimal Design

3 Likelihood Free Estimators and Optimal Design

4 Training the Estimator

4.1 Continuous Design Variables (Method 1)

4.2 Binary 𝐰𝐰{\bf w}bold_w and Tabu Search (Method 2)

5 Numerical Experiments

5.1 3-Tissue Compartment Model

5.2 Lotka-Volterra Predator-Prey Model

5.3 Numerical Results

6 Conclusions

7 Impact Statement

8 Acknowledgements

References

Appendix A Appendix: Additional Details on ODE Models and Setup

A.1 Model Setup: 3-Tissue Compartment Model

A.2 2: PPM Model Setup

A.3 Scaling of Risks

Appendix B Appendix: Numerical Details of Figures and Tables

4.2 Binary ${\bf w}$ and Tabu Search (Method 2)