marginparsep has been altered.
topmargin has been altered.
marginparpush has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

Β 

Deep Optimal Experimental Design for Parameter Estimation Problems

Β 

Md Shahriar Rahim Siddiqui 1  Arman Rahmim 1  Eldad Haber 2 


††footnotetext: 1Department of Physics and Astronomy, University of British Columbia, Vancouver, Canada 2Department of Earth, Ocean, and Atmospheric Sciences, University of British Columbia, Vancouver, Canada. Correspondence to: Md Shahriar Rahim Siddiqui <[email protected]>, Eldad Haber <[email protected]>. Β 
Copyright 2024 by the author(s).
Abstract

Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.

1 Introduction

Mathematical models that are based on ordinary differential equations (ODEs) are used in various fields such as biology Helms (2008), ecology Bellman etΒ al. (1966), physiology Zenker etΒ al. (2007), pharmacometrics Ette & Williams (2007), climate modeling Keane etΒ al. (2017) and more Lenhart & Workman (2007); Betts (2001). These models often involve unknown parameters that need to be estimated from experimental data. The parameters represent rates and constants that are usually important in understanding the dynamics of the underlying system, and could be crucial to decision making and to system’s control. Examples range from determining the insulin resistance of an individual from the clinically used intravenous glucose tolerance test (IVGTT) Chung & Haber (2013), finding the rates of transfer of radio-pharmaceuticals and determining when a system will exhibit chaotic behavior Hunt etΒ al. (2007).

Parameter identification is typically performed in two steps: data collection and data fitting. In the data collection stage experiments are performed and data are measured. In the data fitting stage, a data fitting procedure is used to evaluate parameters from the (typically noisy) measurements. For this process to be effective, the experiment must be carefully designed.

In many cases one can densely measure the data yielding a grossly over-determined problem BjΓΆrck (1996), however, in other cases, the measurement process is difficult and expensive, and this leads to restrictions on the type, amount and quality (that is, signal to noise) of the data that is collected. A decision must be made about which data should be measured, at what frequency and to what accuracy. A consequent trade-off exists between the accuracy of parameter recovery and the amount and quality of the data. The goal of this paper is to propose an experimental design technique that balances parameter estimation and the cost of the experiment.

Previous Work: Optimal experimental design is an important topic in applied science and engineering with applications in medical imaging, geoscience, optimal control, public health and many other fields (see Bardow (2008); Pukelsheim (1993); Allaire etΒ al. (2001); Chaloner & Verdinelli (1995); Haber etΒ al. (2008); Atkinson & Donev (1992) and references therein).

In this paper we focus our attention to the design of experiments for nonlinear systems that are governed by ODEs, with the aim to identify parameters within the equations. While there is significant work on the topic (see Haber etΒ al. (2009); Chung & Haber (2013); McLellan (1994); Banga & Balsa-Canto (2008); Bock etΒ al. (2013) and references therein) the methods proposed for the solution of the problem are difficult to apply and require classical recovery techniques.

The design process is based on three pillars. In the core stands the ability to quickly solve the so called forward problem for different experimental settings and parameters. In our case, the forward problem is a discretized system of ordinary differential equations that needs to be solved many (in some cases hundreds of thousands) of times. The second pillar of the design is the solution of the so-called inverse problem. That is, the identification of parameters given some experimental settings. For linear problems with quadratic regularization (that is, Gaussian priors), it is possible to obtain an analytic expression for this problem (see Haber etΒ al. (2008)). However, for nonlinear inverse problems such expressions are impossible to obtain and therefore, numerical optimization techniques are typically used for the solution of the problem and estimating the parameters. The solution of the inverse problem is sometimes referred to as the inner optimization problem. Finally, in the third and outer pillar of the design process stands an optimization problem that aims to obtain the best parameter recovery (on average) from the inner optimization problem by changing the experimental setting. The process is illustrated in FigureΒ 1.

Refer to caption
Figure 1: A schematic process of parameter estimation, where the data of the forward process is measured and parameters are estimated. The design links between the quality of the estimated parameters to the type and quality of the data.

Optimal experimental design for nonlinear systems is therefore a bilevel optimization problem. The process of solving the outer problem (the design problem) is iterative. For a given design, one chooses some parameters to be identified from some distribution and solves many inner optimization problems to estimate those parameters. Comparing the parameters to their true values and taking derivatives it is possible to obtain a direction that improves the recovery of those parameters. Design methods for parameter estimation differ in their outer objective function and the statistical framework, Bayesian vs Frequentist Chaloner & Verdinelli (1995); Haber etΒ al. (2008). Nonetheless, all methods known to us rely on repeated solution of the inverse problem, for different parameters, which makes the problem very difficult to solve and computationally challenging. To this end, a few authors have proposed to replace the forward solution with reduced models Wang etΒ al. (2023); Ushijima & Yeh (2015); however those can behave poorly, especially when changing the design.

In recent years deep learning has been used in the context of generative models that are able to sample the main modes of a distribution (see Yang etΒ al. (2023) and references therein). The ability to learn a distribution or its main mode can be very useful when considering optimal design. In particular, we consider likelihood free estimators (see e.g.Sainsbury-Dale etΒ al. (2022); Didelot etΒ al. (2011); Papamakarios (2019)) that allow for the estimation of the solution of the inverse problem without the solution of the forward problem. With improvements in neural network architectures, such estimators have become very attractive for problems where the solution of the forward problem is computationally challenging.

The novelty of this work is the development of a methodology that uses likelihood free estimators for optimal experimental design for a system governed by differential equations. Our method circumvents the bilevel optimization problem that is traditionally used. We propose a mathematical framework that allows for the construction of a deep network that yields an optimal recovery of the parameters and the experimental setting. Given a parameter to be recovered and a forward problem, our framework requires the ability to sample from the prior distribution of the parameter and to solve the forward problem for those samples. Our framework does not require the solution of the inverse problem via optimization and rather learns a direct map from the data to the parameters under different experimental settings. By using an appropriate loss function, we are able to find a network that is optimal for the recovery of the parameter and the optimal experimental settings.

The rest of this paper is organized as follows. In SectionΒ 2 we give a mathematical background to the field of design in the context of differential equations. In SectionΒ 3 we discuss likelihood free estimators and show how they can be used for the design problem. In SectionΒ 4 we propose two training methods to train such estimators. In SectionΒ 5 we conduct a number of experiments with systems from diverse fields that show that our method is robust. Finally, in Section 6 we summarize the paper.

2 Mathematical Background for Optimal Design

Consider a parameter estimation problem, and let πͺβˆˆπ’¬πͺ𝒬{\bf q}\in{\cal Q}bold_q ∈ caligraphic_Q be a p𝑝pitalic_p dimensional parameter vector in a differential equation. Let 𝝎∈Ω𝝎Ω{\bm{\omega}}\in\Omegabold_italic_Ο‰ ∈ roman_Ξ© be a set of e𝑒eitalic_e parameters that represent some experimental settings. Finally, let πβˆˆπ’Ÿππ’Ÿ{\bf d}\in{\cal D}bold_d ∈ caligraphic_D be the data measured. Formally, we write

𝐝=F⁒(πͺ,𝝎)+ϡ𝐝𝐹πͺ𝝎bold-italic-Ο΅\displaystyle{\bf d}=F({\bf q},{\bm{\omega}})+{\bm{\epsilon}}bold_d = italic_F ( bold_q , bold_italic_Ο‰ ) + bold_italic_Ο΅ (1)

Here F:π’¬Γ—Ξ©β†’π’Ÿ:πΉβ†’π’¬Ξ©π’ŸF:{\cal Q}\times\Omega\rightarrow{\cal D}italic_F : caligraphic_Q Γ— roman_Ξ© β†’ caligraphic_D is the forward problem that takes the parameter πͺπͺ{\bf q}bold_q and experimental setting 𝝎𝝎{\bm{\omega}}bold_italic_Ο‰ and yields some measured data, 𝐝𝐝{\bf d}bold_d. In our context, the application of F𝐹Fitalic_F requires the solution of a system of ODEs, typically an initial value problem. The data are assumed to be polluted with random noise ϡ∼N⁒(0,Οƒ2⁒𝐈)similar-tobold-italic-ϡ𝑁0superscript𝜎2𝐈{\bm{\epsilon}}\sim N(0,\sigma^{2}{\bf I})bold_italic_Ο΅ ∼ italic_N ( 0 , italic_Οƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) which is assumed to be normal with 00 mean and Οƒ2superscript𝜎2\sigma^{2}italic_Οƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT standard deviation. We assume that πͺπͺ{\bf q}bold_q is associated with a known probability density function πͺβˆΌΟ€β’(πͺ)similar-toπͺπœ‹πͺ{\bf q}\sim\pi({\bf q})bold_q ∼ italic_Ο€ ( bold_q ). In practice one does not require to have a mathematical expression for the density, however, we assume to have sufficient amount of samples, πͺi,i=1,…,Nqformulae-sequencesubscriptπͺ𝑖𝑖1…subscriptπ‘π‘ž{\bf q}_{i},i=1,\ldots,N_{q}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, that can be generated from this density. The experimental setting vector 𝝎𝝎{\bm{\omega}}bold_italic_Ο‰ is assumed to be under our control. Such a vector may represent measurement times, frequencies for the forward problem, source terms or other controllable parameters in the experiment.

A common approach to estimate the parameter πͺπͺ{\bf q}bold_q given the data 𝐝𝐝{\bf d}bold_d is to use Bayesian inference. Using Bayes’ theorem, the probability density function of πͺπͺ{\bf q}bold_q given 𝐝𝐝{\bf d}bold_d is

Ο€πŽβ’(πͺ|𝐝)subscriptπœ‹πŽconditionalπͺ𝐝\displaystyle\pi_{{\bm{\omega}}}({\bf q}|{\bf d})italic_Ο€ start_POSTSUBSCRIPT bold_italic_Ο‰ end_POSTSUBSCRIPT ( bold_q | bold_d ) ∝proportional-to\displaystyle\propto∝ π⁒(πͺ)β’Ο€πŽβ’(𝐝|πͺ)πœ‹πͺsubscriptπœ‹πŽconditional𝐝πͺ\displaystyle\pi({\bf q})\pi_{{\bm{\omega}}}({\bf d}|{\bf q})italic_Ο€ ( bold_q ) italic_Ο€ start_POSTSUBSCRIPT bold_italic_Ο‰ end_POSTSUBSCRIPT ( bold_d | bold_q )
=\displaystyle== π⁒(πͺ)⁒exp⁑(12⁒σ2β’β€–πβˆ’F⁒(πͺ;𝝎)β€–2).πœ‹πͺ12superscript𝜎2superscriptnorm𝐝𝐹πͺ𝝎2\displaystyle\pi({\bf q})\exp\left({\frac{1}{2\sigma^{2}}}\|{\bf d}-F({\bf q};% {\bm{\omega}})\|^{2}\right).italic_Ο€ ( bold_q ) roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 italic_Οƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG βˆ₯ bold_d - italic_F ( bold_q ; bold_italic_Ο‰ ) βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

where π⁒(𝐝|πͺ;𝝎)πœ‹conditional𝐝πͺ𝝎\pi({\bf d}|{\bf q};{\bm{\omega}})italic_Ο€ ( bold_d | bold_q ; bold_italic_Ο‰ ) is the likelihood, that is, the probability of the data, 𝐝𝐝{\bf d}bold_d given the parameter πͺπͺ{\bf q}bold_q and the experimental setting 𝝎𝝎{\bm{\omega}}bold_italic_Ο‰. The Maximum A Posteriori (MAP) estimate is the parameter that maximizes this distribution; that is

πͺ^=arg⁒minπͺ⁑12⁒σ2⁒‖F⁒(πͺ,𝝎)βˆ’πβ€–2+R⁒(πͺ)^πͺargsubscriptπͺ12superscript𝜎2superscriptnorm𝐹πͺ𝝎𝐝2𝑅πͺ\displaystyle\widehat{\bf q}={\rm arg}\min_{{\bf q}}{\frac{1}{2\sigma^{2}}}\|F% ({\bf q},{\bm{\omega}})-{\bf d}\|^{2}+R({\bf q})over^ start_ARG bold_q end_ARG = roman_arg roman_min start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_Οƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG βˆ₯ italic_F ( bold_q , bold_italic_Ο‰ ) - bold_d βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_R ( bold_q ) (3)

Here R⁒(πͺ)=βˆ’log⁑(π⁒(πͺ))𝑅πͺπœ‹πͺR({\bf q})=-\log(\pi({\bf q}))italic_R ( bold_q ) = - roman_log ( italic_Ο€ ( bold_q ) ) is the negative log of the prior.

Associated with any estimator πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG is the recovery loss or the risk defined by

β„“q⁒(𝝎)=𝔼πͺ,ϡ⁒12⁒‖πͺ^βˆ’πͺβ€–2subscriptβ„“π‘žπŽsubscript𝔼πͺbold-italic-Ο΅12superscriptnorm^πͺπͺ2\displaystyle\ell_{q}({\bm{\omega}})={\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{% \frac{1}{2}}\|\widehat{\bf q}-{\bf q}\|^{2}roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) = blackboard_E start_POSTSUBSCRIPT bold_q , bold_italic_Ο΅ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG βˆ₯ over^ start_ARG bold_q end_ARG - bold_q βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

Note that the risk is obtained by taking the expectation over the parameter πͺπͺ{\bf q}bold_q and the noise Ο΅bold-italic-Ο΅{\bm{\epsilon}}bold_italic_Ο΅, and therefore, the risk depends on the experimental setting alone. Associated with the estimator πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG is the estimated data

𝐝^=F⁒(πͺ^,𝝎)^𝐝𝐹^πͺ𝝎\displaystyle\widehat{\bf d}=F(\widehat{\bf q},{\bm{\omega}})over^ start_ARG bold_d end_ARG = italic_F ( over^ start_ARG bold_q end_ARG , bold_italic_Ο‰ ) (5)

We also consider the data risk

β„“d⁒(𝝎)=𝔼πͺ,ϡ⁒12⁒∫0Tβ€–F⁒(πͺ^,𝝎)βˆ’πβ’(t)β€–2⁒𝑑t.subscriptβ„“π‘‘πŽsubscript𝔼πͺbold-italic-Ο΅12superscriptsubscript0𝑇superscriptnorm𝐹^πͺπŽππ‘‘2differential-d𝑑\displaystyle\ell_{d}({\bm{\omega}})={\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{% \frac{1}{2}}\int_{0}^{T}\|F(\widehat{\bf q},{\bm{\omega}})-{\bf d}(t)\|^{2}dt.roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) = blackboard_E start_POSTSUBSCRIPT bold_q , bold_italic_Ο΅ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT βˆ₯ italic_F ( over^ start_ARG bold_q end_ARG , bold_italic_Ο‰ ) - bold_d ( italic_t ) βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t . (6)

When approximating the integral, we use the trapezoidal method, which allows the application of non-uniform time intervals.

Note that two possible estimated solutions πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG with similar risk β„“q⁒(𝝎)subscriptβ„“π‘žπŽ\ell_{q}({\bm{\omega}})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) may have different data risk β„“d⁒(𝝎)subscriptβ„“π‘‘πŽ\ell_{d}({\bm{\omega}})roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_italic_Ο‰ ). We therefore consider the total loss as a weighted sum of the risk equationΒ 4 and the data risk equationΒ 6, that is

β„“T⁒(𝝎)=β„“q⁒(𝝎)+γ⁒ℓd⁒(𝝎).subscriptβ„“π‘‡πŽsubscriptβ„“π‘žπŽπ›Ύsubscriptβ„“π‘‘πŽ\displaystyle\ell_{T}({\bm{\omega}})=\ell_{q}({\bm{\omega}})+\gamma\ell_{d}({% \bm{\omega}}).roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) = roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) + italic_Ξ³ roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) . (7)

where γ𝛾\gammaitalic_Ξ³ is a hyper-parameter (we chose Ξ³=1𝛾1\gamma=1italic_Ξ³ = 1 in our experiments). Using the definitions equationΒ 7 allows for a framework for the optimal design. In particular let

πŽβˆ—=arg⁒minπŽβ‘β„“T⁒(𝝎)superscript𝝎argsubscript𝝎subscriptβ„“π‘‡πŽ\displaystyle{\bm{\omega}}^{*}={\rm arg}{\min}_{{\bm{\omega}}}\,\ell_{T}({\bm{% \omega}})bold_italic_Ο‰ start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_Ο‰ end_POSTSUBSCRIPT roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) (8)

The solution πŽβˆ—superscript𝝎{\bm{\omega}}^{*}bold_italic_Ο‰ start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT is interpreted as the experimental setting that yields the best recovery of the parameter, πͺπͺ{\bf q}bold_q, on average that also fit the measure data.

One main difficulty in solving equationΒ 8 is that the problem may not be differentiable with respect to the experimental setting 𝝎𝝎{\bm{\omega}}bold_italic_Ο‰. A common way to overcome this difficulty that has been reviewed in Boyd & Vandenberghe (2004) is to discretize the space of experimental setting, ΩΩ\Omegaroman_Ξ©, finely obtaining many plausible experiments [F⁒(πͺ,𝝎1),…,F⁒(πͺ,𝝎s)]𝐹πͺsubscript𝝎1…𝐹πͺsubscriptπŽπ‘ [F({\bf q},{\bm{\omega}}_{1}),\ldots,F({\bf q},{\bm{\omega}}_{s})][ italic_F ( bold_q , bold_italic_Ο‰ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_F ( bold_q , bold_italic_Ο‰ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ], that can be done. Assume that we have s𝑠sitalic_s possible different experiments and assume for simplicity that each datum 𝐝isubscript𝐝𝑖{\bf d}_{i}bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to the data that is recorded from a different experiment 𝝎isubscriptπŽπ‘–{\bm{\omega}}_{i}bold_italic_Ο‰ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Introducing weights 0≀𝐰0𝐰0\leq{\bf w}0 ≀ bold_w, one replaces the original posterior equationΒ 2 with one that contains all the data for all possible experiments, weighted by 𝐰𝐰{\bf w}bold_w, that is

π𝐰⁒(πͺ|𝐝)βˆΟ€β’(πͺ)⁒exp⁑(12⁒σ2β’β€–π°βŠ™(πβˆ’F⁒(πͺ))β€–2).proportional-tosubscriptπœ‹π°conditionalπͺππœ‹πͺ12superscript𝜎2superscriptnormdirect-product𝐰𝐝𝐹πͺ2\displaystyle\pi_{{\bf w}}({\bf q}|{\bf d})\propto\pi({\bf q})\exp\left({\frac% {1}{2\sigma^{2}}}\|{\bf w}\odot({\bf d}-F({\bf q}))\|^{2}\right).italic_Ο€ start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_q | bold_d ) ∝ italic_Ο€ ( bold_q ) roman_exp ( divide start_ARG 1 end_ARG start_ARG 2 italic_Οƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG βˆ₯ bold_w βŠ™ ( bold_d - italic_F ( bold_q ) ) βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (9)

where F⁒(πͺ)𝐹πͺF({\bf q})italic_F ( bold_q ) assumes that all possible experiments are conducted. The problem of estimating 𝝎𝝎{\bm{\omega}}bold_italic_Ο‰ is replaced with the estimation of the weights 𝐰𝐰{\bf w}bold_w. Clearly, if 𝐰𝐰{\bf w}bold_w is sparse then only a few experiments are to be conducted. Therefore, it has been proposed Haber etΒ al. (2008) to replace the original problem with a penalized problem

π°βˆ—=arg⁒min𝐰⁑ℓT⁒(𝐰)+α⁒Sp⁒(𝐰)superscript𝐰argsubscript𝐰subscriptℓ𝑇𝐰𝛼Sp𝐰\displaystyle{\bf w}^{*}={\rm arg}{\min}_{{\bf w}}\,\ell_{T}({\bf w})+\alpha\,% {\rm Sp}({\bf w})bold_w start_POSTSUPERSCRIPT βˆ— end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) + italic_Ξ± roman_Sp ( bold_w ) (10)

where Sp⁒(𝐰)Sp𝐰{\rm Sp}({\bf w})roman_Sp ( bold_w ) promotes sparsity in 𝐰𝐰{\bf w}bold_w, and α𝛼\alphaitalic_Ξ± is a hyper-parameter. A common approach is using the 1-norm although using approximations to zero-norm have been proposed Donoho (2006); Bruckstein etΒ al. (2009).

Even after the reformulation of the problem, solving the optimization equationΒ 10 is very difficult. It requires solving the optimization problem equationΒ 3 for many different πͺπͺ{\bf q}bold_q’s in order to estimate the risk β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ). This in turn requires solving many forward problems which can be very expensive especially in the context of ODEs. Furthermore, computing derivatives with respect to 𝐰𝐰{\bf w}bold_w is challenging and requires implicit differentiation.

Therefore, in the next section we describe an alternative to this process that leads to a much more efficient algorithm.

3 Likelihood Free Estimators and Optimal Design

The optimization problem equationΒ 10 is difficult because we require to estimate β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) which in turn requires the solution of the optimization problem equationΒ 3. The optimization stems from our use of the MAP estimator. The MAP estimator can be thought of as a nonlinear function of the form

πͺ^=Fmap†⁒(π°βŠ™π)^πͺsubscriptsuperscript𝐹†mapdirect-product𝐰𝐝\displaystyle\widehat{\bf q}=F^{{\dagger}}_{\rm map}({\bf w}\odot{\bf d})over^ start_ARG bold_q end_ARG = italic_F start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_map end_POSTSUBSCRIPT ( bold_w βŠ™ bold_d ) (11)

where the nonlinear function Fmap†subscriptsuperscript𝐹†mapF^{{\dagger}}_{\rm map}italic_F start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_map end_POSTSUBSCRIPT maps the data to an estimator πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG.

The MAP estimator can be highly useful and has very desirable properties. However, it may be far from optimal. If the posterior is highly skewed, then the MAP may not lead to the minimization of the risk equationΒ 4. In practice, the main reason that the MAP estimator is commonly used is our ability to compute it. In many cases, estimators such as the conditional mean may be attractive; however they require non-trivial computations and Monte-Carlo integration. Thus, the MAP estimator is certainly not the only estimator that can be used. In the context of learning it is possible to directly learn an estimator that minimizes the riskΒ equationΒ 4. In many cases, such an estimator can perform even better than the MAP estimator (see for example Papamakarios (2019)). Furthermore, by appropriately setting its architecture, it is possible to include different experimental setting and noise levels in this estimator and then use it for the solution of the optimal design.

We now derive a likelihood free estimator for the solution of the problem, that enable us to solve both the estimation problem as well as the design problem. To this end, let us define the estimator

πͺ^=Fπœ½β€ β’(π°βŠ™π,𝐰,Οƒ).^πͺsubscriptsuperscriptπΉβ€ πœ½direct-product𝐰𝐝𝐰𝜎\displaystyle\widehat{\bf q}=F^{{\dagger}}_{{\bm{\theta}}}({\bf w}\odot{\bf d}% ,{\bf w},\sigma).over^ start_ARG bold_q end_ARG = italic_F start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ΞΈ end_POSTSUBSCRIPT ( bold_w βŠ™ bold_d , bold_w , italic_Οƒ ) . (12)

The estimator depends on the experimental design vector 𝐰𝐰{\bf w}bold_w that controls the design and trainable parameters 𝜽𝜽{\bm{\theta}}bold_italic_ΞΈ. It also uses the parameter ΟƒπœŽ\sigmaitalic_Οƒ that represents the noise level. Note that this estimator does not require solving the forward problem directly and computing the likelihood, hence its name, Likelihood Free Estimator (LFE).

The risk for this estimator can be written as

β„“T⁒(𝜽,𝐰)=𝔼πͺ,ϡ⁒‖Fπœ½β€ β’(π°βŠ™π,𝐰,Οƒ)βˆ’πͺβ€–2+γ⁒ℓd⁒(𝜽,𝐰)subscriptβ„“π‘‡πœ½π°subscript𝔼πͺbold-italic-Ο΅superscriptnormsubscriptsuperscriptπΉβ€ πœ½direct-product𝐰𝐝𝐰𝜎πͺ2𝛾subscriptβ„“π‘‘πœ½π°\displaystyle\ell_{T}({\bm{\theta}},{\bf w})={\mathbb{E}}_{{\bf q},{\bm{% \epsilon}}}\|F^{{\dagger}}_{{\bm{\theta}}}({\bf w}\odot{\bf d},{\bf w},\sigma)% -{\bf q}\|^{2}+\gamma\ell_{d}({\bm{\theta}},{\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_ΞΈ , bold_w ) = blackboard_E start_POSTSUBSCRIPT bold_q , bold_italic_Ο΅ end_POSTSUBSCRIPT βˆ₯ italic_F start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ΞΈ end_POSTSUBSCRIPT ( bold_w βŠ™ bold_d , bold_w , italic_Οƒ ) - bold_q βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_Ξ³ roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_italic_ΞΈ , bold_w ) (13)

Clearly, the best estimator is obtained by minimizing the risk with respect to weights 𝜽𝜽{\bm{\theta}}bold_italic_θ and the experimental setting 𝐰𝐰{\bf w}bold_w. It is important to note that since the estimator is a simple function evaluation, we can solve a single optimization problem to estimate both the estimator parameter 𝜽𝜽{\bm{\theta}}bold_italic_θ and the experimental design parameters 𝐰𝐰{\bf w}bold_w. Thus, this formulation avoids the need for bilevel optimization all-together.

We now discuss a proposed architecture that we use for the solution of the problem. The architecture uses the data, 𝐝𝐝{\bf d}bold_d, the experimental setting 𝐰𝐰{\bf w}bold_w and the noise level ΟƒπœŽ\sigmaitalic_Οƒ in order to compute the estimator πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG. A minimal structure of The network is summarized in AlgorithmΒ 1.

Algorithm 1 Neural Network Architecture for Design
0:  𝐝,ΟƒππœŽ{\bf d},\sigmabold_d , italic_Οƒ
Β Β Set 𝐲=𝐐0⁒(π°βŠ™π)𝐲subscript𝐐0direct-product𝐰𝐝{\bf y}={\bf Q}_{0}({\bf w}\odot{\bf d})bold_y = bold_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_w βŠ™ bold_d )
Β Β Embed 𝐬=𝐄⁒(Οƒ),𝐳=𝐖0⁒𝐰formulae-sequenceπ¬π„πœŽπ³subscript𝐖0𝐰{\bf s}={\bf E}(\sigma),{\bf z}={\bf W}_{0}{\bf w}bold_s = bold_E ( italic_Οƒ ) , bold_z = bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_w
  Generate the augmented tensor 𝐱0=[𝐲,𝐳,𝐬]subscript𝐱0𝐲𝐳𝐬{\bf x}_{0}=[{\bf y},{\bf z},{\bf s}]bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ bold_y , bold_z , bold_s ]
Β Β forΒ i=0,…,(nlayersβˆ’1)𝑖0…nlayers1i=0,...,{\rm(nlayers-1)}italic_i = 0 , … , ( roman_nlayers - 1 )Β do
     𝐱i+1=𝐱i+𝐊i⁒η⁒(𝐐i⁒𝐱i+𝐛i)subscript𝐱𝑖1subscript𝐱𝑖subscriptπŠπ‘–πœ‚subscript𝐐𝑖subscript𝐱𝑖subscript𝐛𝑖{\bf x}_{i+1}={\bf x}_{i}+{\bf K}_{i}\eta({\bf Q}_{i}{\bf x}_{i}+{\bf b}_{i})bold_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_Ξ· ( bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
Β Β endΒ for
Β Β πͺ^=𝐐f⁒𝐱^πͺsubscript𝐐𝑓𝐱\widehat{\bf q}={\bf Q}_{f}{\bf x}over^ start_ARG bold_q end_ARG = bold_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_x
Β Β return πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG

The network is a simple residual network with weights 𝜽={𝐖0,𝐐0,𝐐f,𝐐i,𝐊i,𝐛i},i=0,…,(nlayersβˆ’1)formulae-sequence𝜽subscript𝐖0subscript𝐐0subscript𝐐𝑓subscript𝐐𝑖subscriptπŠπ‘–subscript𝐛𝑖𝑖0…nlayers1{\bm{\theta}}=\{{\bf W}_{0},{\bf Q}_{0},{\bf Q}_{f},{\bf Q}_{i},{\bf K}_{i},{% \bf b}_{i}\},i=0,...,{\rm(nlayers-1)}bold_italic_ΞΈ = { bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , italic_i = 0 , … , ( roman_nlayers - 1 ) with a few important differences from standard residual networks. The first difference is that the input data 𝐝𝐝{\bf d}bold_d is weighted with the parameters 𝐰𝐰{\bf w}bold_w, allowing us to control the experimental design and train for the optimal parameters. Second, the parameters 𝐰𝐰{\bf w}bold_w are embedded into the network. This allows for the network to be aware of the experimental setting. Finally, similar to diffusion models Li etΒ al. (2018), we embed the noise into the network, allowing it to be trained for different noise levels. In the network Ξ·πœ‚\etaitalic_Ξ· stands for the activation function bundled with possible layer normalization.

The network above can be modified to have different embedding for ΟƒπœŽ\sigmaitalic_Οƒ and 𝐰𝐰{\bf w}bold_w at each layer. However, we have found that even this simple architecture yields very good results. The network needs to be trained for its weights as well as for the best experimental settings, 𝐰𝐰{\bf w}bold_w. This is discussed in the next section.

4 Training the Estimator

In this section we discuss training the network by using a self-supervised approach Jaiswal et al. (2020); Zhai et al. (2019). Such techniques have been the cornerstone of recent advances in the application of deep learning for realistic application. However, before we proceed, we need to discuss the particular parametrization used for this problem. While choosing the network weights is standard, there are two options for the choice of the design vector 𝐰𝐰{\bf w}bold_w. In the first option, we allow 𝐰𝐰{\bf w}bold_w to have any real non-negative value. This is the simplest case as it allows us to simply train the experimental setting together with the parameters of the network, estimating the network and the optimal design simultaneously. In the second option we treat 𝐰𝐰{\bf w}bold_w as a binary variable. Binary variables are considered where we either conduct the experiment or not implying that the data is measured or not. In this case, training the network implies solving a mixed nonlinear-binary programming problem. We solve this problem by a combination of Tabu-search for the binary variables and stochastic gradient descent for the continuous variables.

Regardless to the optimization process, we use a self supervised approach. That is, the algorithm generates its own data and trains on it. A summary of the algorithm is given in AlgorithmΒ 2.

Algorithm 2 Training an LFE estimator
0:Β Β Network, Sampling routine for πͺπͺ{\bf q}bold_q
Β Β forΒ i=0,…𝑖0…i=0,...italic_i = 0 , … do
Β Β Β Β Β Sample a batch πͺπͺ{\bf q}bold_q
Β Β Β Β Β Use the forward problem to compute 𝐝=F⁒(πͺ)+ϡ𝐝𝐹πͺbold-italic-Ο΅{\bf d}=F({\bf q})+{\bm{\epsilon}}bold_d = italic_F ( bold_q ) + bold_italic_Ο΅
Β Β Β Β Β Use the network to recover πͺ^^πͺ\widehat{\bf q}over^ start_ARG bold_q end_ARG
Β Β Β Β Β Compute 𝐝^=F⁒(πͺ^)^𝐝𝐹^πͺ\widehat{\bf d}=F(\widehat{\bf q})over^ start_ARG bold_d end_ARG = italic_F ( over^ start_ARG bold_q end_ARG )
Β Β Β Β Β Compute the loss in equationΒ 10
Β Β Β Β Β Update parameters to minimize the loss
Β Β endΒ for

The algorithm is self trained in a sense that it continuously samples πͺπͺ{\bf q}bold_q and computes its corresponding data, and then uses the data in order to estimate πͺπͺ{\bf q}bold_q. The algorithm requires a way to sample from the prior distribution of πͺπͺ{\bf q}bold_q. For many parameter estimation problems such a distribution is readily available. In the examples below we sample the parameters from a log-normal distribution. The ability to generate an infinite amount of data, allows for a very robust training procedure in which over-fitting does not play a role.

The algorithm 2 minimizes the loss with respect to the design variables 𝐰𝐰{\bf w}bold_w and the network parameters 𝜽𝜽{\bm{\theta}}bold_italic_θ. However, updating the design variables changes, depending on their type. We now discuss the two approaches used for the update of the design variables.

4.1 Continuous Design Variables (Method 1)

Consider first the case where the design vector 𝐰𝐰{\bf w}bold_w can take on any values ℝ>0subscriptℝabsent0\mathbb{R}_{>0}blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT. In this case we seek to minimize the recovery loss β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) and enhance the sparsity of 𝐰𝐰{\bf w}bold_w.

A simple algorithm to achieve that is to use the soft shrink function Bruckstein etΒ al. (2009). Here we modify the classical soft shrink function to keep non-negativity. The modified soft shrink function is defined as

sρ(t,ρ)={tβˆ’Οt>ρ0otherwise\displaystyle s_{\rho}(t,\rho)=\left\{\begin{matrix}t-\rho&t>\rho\\ 0&{\rm otherwise}\end{matrix}\right.italic_s start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_t , italic_ρ ) = { start_ARG start_ROW start_CELL italic_t - italic_ρ end_CELL start_CELL italic_t > italic_ρ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_otherwise end_CELL end_ROW end_ARG (14)

This function is used during minimization in a straight forward manner (see Bruckstein et al. (2009) for convergence proofs). Using the stochastic gradient descent algorithm we update 𝜽𝜽{\bm{\theta}}bold_italic_θ in a standard way but for 𝐰𝐰{\bf w}bold_w we use the following formula

𝐰←sρ⁒(π°βˆ’ΞΌβ’Ξ΄β’π°,ρ)←𝐰subscriptπ‘ πœŒπ°πœ‡π›Ώπ°πœŒ\displaystyle{\bf w}\leftarrow s_{\rho}({\bf w}-\mu\delta{\bf w},\rho)bold_w ← italic_s start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( bold_w - italic_ΞΌ italic_Ξ΄ bold_w , italic_ρ ) (15)

Here the soft shrink function is applied element-wise, ΞΌπœ‡\muitalic_ΞΌ is the learning rate and δ⁒𝐰𝛿𝐰\delta{\bf w}italic_Ξ΄ bold_w is the descent direction, obtained by the stochastic gradient descent algorithm.

4.2 Binary 𝐰𝐰{\bf w}bold_w and Tabu Search (Method 2)

Using continuous weights 𝐰𝐰{\bf w}bold_w is attractive as it enables the use of continuous optimization techniques. Nonetheless, the result of such an algorithm can lead to many weights that are small, raising the question about the importance of the data that is associated with those weights. In many applications, a binary 𝐰𝐰{\bf w}bold_w makes more sense. In this case, 𝐰i=1subscript𝐰𝑖1{\bf w}_{i}=1bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 implies that the data is measured and when 𝐰i=0subscript𝐰𝑖0{\bf w}_{i}=0bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 the data is ignored and not measured. Thus, minimizing the loss β„“Tsubscriptℓ𝑇\ell_{T}roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is involved with mixed nonlinear binary programming, which is a challenging problem. To approximate the solution of this problem we combine a Tabu search method (for the binary variables) and a standard stochastic gradient descent for the continuous variables. Tabu search methods are commonly used in binary programming Hertz etΒ al. (1995). The algorithm uses the current iterate to define neighbors. These neighbors are tested, and the algorithm steps into the lowest one. The algorithm retains a list (the Tabu list) of points that are visited in order to not repeat points. Although the algorithm is very simple, it finds reasonable approximations to the minimum of binary programming problems (see Gendreau & Potvin (2005) for details). To use Tabu within the context of the optimization problem we incorporate it in a block coordinate descent algorithm. The algorithm is summarized in Algorithm 3.

Algorithm 3 Training LFE with binary variables
0:Β Β Network
  Initialize 𝐰𝐰{\bf w}bold_w and 𝜽𝜽{\bm{\theta}}bold_italic_θ
Β Β forΒ i=0,…,outer⁒_⁒iter𝑖0…outer_iteri=0,...,{\rm outer\_iter}italic_i = 0 , … , roman_outer _ roman_iterΒ do
Β Β Β Β Β forΒ j=0,…,inner⁒_⁒iter𝑗0…inner_iterj=0,...,{\rm inner\_iter}italic_j = 0 , … , roman_inner _ roman_iterΒ do
Β Β Β Β Β Β Β Β Compute the loss in equationΒ 10
        Compute gradients with respect to 𝜽𝜽{\bm{\theta}}bold_italic_θ
        Update 𝜽𝜽{\bm{\theta}}bold_italic_θ
Β Β Β Β Β endΒ for
Β Β Β Β Β Set the Tabu list to empty.
Β Β Β Β Β forΒ j=0,…,inner⁒_⁒iter𝑗0…inner_iterj=0,...,{\rm inner\_iter}italic_j = 0 , … , roman_inner _ roman_iterΒ do
        Compute the neighbors for 𝐰𝐰{\bf w}bold_w
Β Β Β Β Β Β Β Β Compute the loss in equationΒ 10 for all neighbors
        Update 𝐰𝐰{\bf w}bold_w to the lowest loss
Β Β Β Β Β Β Β Β Update the Tabu list
Β Β Β Β Β endΒ for
Β Β endΒ for

Solving a mixed binary nonlinear programming problem is substantially more expensive compared with solving the problem with continuous variables. However, in our application we have found that a very small number of steps on the outer iteration usually suffice to obtain reasonable accuracy.

5 Numerical Experiments

In this section we study two different problems. The first problem is of practical importance to the field of nuclear medicine, utilizing a ”3-Tissue Compartment” (3-TC) model, an ODE system amongst a range of compartmental models in the field of kinetic modeling Morris etΒ al. (2004) Bentourkia & Zaidi (2007), to estimate physiological parameters of importance using nuclear medicine imaging. The second is an example of a non-linear system of differential equations that model the populations of predators and prey in a system known as the Lotka-Volterra predator-prey model (PPM). We now shortly describe these problems, a detailed implementation can be found in Appendix A.

5.1 3-Tissue Compartment Model

A multi-compartment model is a simplified (or reduced) mathematical model used to describe the movement of mass or energy between different compartments within a system. Such models are often used to generate a reduced yet useful model of a complex system that is difficult to represent accurately. Each compartment in the model is considered to be homogeneous, thus allowing to model only the interaction between compartments. Compartment models have various applications in several fields, including pharmacokinetics, epidemiology, systems theory, complexity theory, engineering, physics, and social science Morris etΒ al. (2004) Zhang etΒ al. (2022) Wagner (1969). Here we consider the 3-Tissue Compartment Model (3-TC) model Zakariaei etΒ al. (2023) from the field of nuclear medicine and molecular imaging. The model is a system of ODEs, and is an example of a compartmental model from kinetic modeling. The 3-TC model (Figure 6 in the Appendix) describes the kinetics of a pharmaceutical in tissue, and how its concentration changes with time in each of the compartments. The model is written as

d⁒Pi⁒n⁒td⁒t𝑑subscript𝑃𝑖𝑛𝑑𝑑𝑑\displaystyle\frac{dP_{int}}{dt}divide start_ARG italic_d italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =\displaystyle== Pv⁒k1βˆ’(k2+k3)⁒Pi⁒n⁒t+Pb⁒k4subscript𝑃𝑣subscriptπ‘˜1subscriptπ‘˜2subscriptπ‘˜3subscript𝑃𝑖𝑛𝑑subscript𝑃𝑏subscriptπ‘˜4\displaystyle P_{v}k_{1}-(k_{2}+{k_{3}})P_{int}+P_{b}k_{4}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT (16a)
d⁒Pbd⁒t𝑑subscript𝑃𝑏𝑑𝑑\displaystyle\frac{dP_{b}}{dt}divide start_ARG italic_d italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =\displaystyle== Pi⁒n⁒t⁒k3βˆ’(k4+k5)⁒Pbsubscript𝑃𝑖𝑛𝑑subscriptπ‘˜3subscriptπ‘˜4subscriptπ‘˜5subscript𝑃𝑏\displaystyle P_{int}k_{3}-(k_{4}+k_{5})P_{b}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - ( italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (16b)
d⁒Pi⁒n⁒t⁒e⁒r⁒nd⁒t𝑑subscriptπ‘ƒπ‘–π‘›π‘‘π‘’π‘Ÿπ‘›π‘‘π‘‘\displaystyle\frac{dP_{intern}}{dt}divide start_ARG italic_d italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =\displaystyle== Pb⁒k5βˆ’k6⁒Pi⁒n⁒t⁒e⁒r⁒nsubscript𝑃𝑏subscriptπ‘˜5subscriptπ‘˜6subscriptπ‘ƒπ‘–π‘›π‘‘π‘’π‘Ÿπ‘›\displaystyle P_{b}k_{5}-k_{6}P_{intern}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT (16c)

The equations describe the dynamics of the three states Pi⁒n⁒t,Pbsubscript𝑃𝑖𝑛𝑑subscript𝑃𝑏P_{int},P_{b}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Pi⁒n⁒t⁒e⁒r⁒nsubscriptπ‘ƒπ‘–π‘›π‘‘π‘’π‘Ÿπ‘›P_{intern}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT which represent, respectively, the concentration of the radiopharmaceutical in the interstitial space between cells, bound to cell receptors, and finally internalized into the cells. Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT represents the blood input function (source) describing the delivery of the radiopharmaceutical to the tissue by the vascular system, and is the instantaneous concentration of the radiopharmaceutical in the blood. This input function exhibits a sharp spike in concentration at earlier times. The dynamics of the model depends on the parameter vector πͺ=[k1,…,k6]πͺsubscriptπ‘˜1…subscriptπ‘˜6{\bf q}=[k_{1},\ldots,k_{6}]bold_q = [ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ]. The parameters typically exhibit a log-normal prior distribution, that is

ln⁑(πͺ)βˆΌπ’©β’(ΞΌi,Οƒi2)for ⁒i=k1,k2,k3,k4,k5,k6formulae-sequencesimilar-toπͺ𝒩subscriptπœ‡π‘–superscriptsubscriptπœŽπ‘–2for 𝑖subscriptπ‘˜1subscriptπ‘˜2subscriptπ‘˜3subscriptπ‘˜4subscriptπ‘˜5subscriptπ‘˜6\ln({\bf q})\sim\mathcal{N}(\mu_{i},\sigma_{i}^{2})\quad\text{for }i=k_{1},k_{% 2},k_{3},k_{4},k_{5},k_{6}roman_ln ( bold_q ) ∼ caligraphic_N ( italic_ΞΌ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for italic_i = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT (17)
Refer to caption
Figure 2: Synthetic Time Activity Curves for the 3-TC model at various (multiplicative) Gaussian noise levels. The vertical dotted lines correspond to an optimal data sampling scheme for s⁒p⁒a⁒r⁒s⁒i⁒t⁒y=6π‘ π‘π‘Žπ‘Ÿπ‘ π‘–π‘‘π‘¦6sparsity=6italic_s italic_p italic_a italic_r italic_s italic_i italic_t italic_y = 6 that minimizes β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ), obtained using binary design variables 𝐰𝐰{\bf w}bold_w (Method 2). The design weight vector 𝐰𝐰{\bf w}bold_w has the value 1111 at the optimal time points and 00 for the others.

The parameters ΞΌi,Οƒisubscriptπœ‡π‘–subscriptπœŽπ‘–\mu_{i},\sigma_{i}italic_ΞΌ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT were calculated from Kletting etΒ al. (2016), and the numbers used are given in Appendix A. Data collection is done as follows: a patient is injected with a radiopharmaceutical and is subsequently imaged with a PET/CT scanner. The imaging data comprises so-called Time-Activity-Curves (TACs), obtained from dynamic nuclear medicine imaging, that show the evolution of radioactivity concentration (kBq/ml) in the patient over time. We model these TAC signals as the sum of contributions from the 3 tissues, namely Pi⁒n⁒t+Pb+Pi⁒n⁒t⁒e⁒r⁒nsubscript𝑃𝑖𝑛𝑑subscript𝑃𝑏subscriptπ‘ƒπ‘–π‘›π‘‘π‘’π‘Ÿπ‘›P_{int}+P_{b}+P_{intern}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT, with Gaussian noise levels Οƒ=0%,1%,2%⁒…⁒19%𝜎percent0percent1percent2…percent19\sigma=0\%,1\%,2\%...19\%italic_Οƒ = 0 % , 1 % , 2 % … 19 % (picked at random at each sampling). Imaging at numerous time points is costly, and an optimal design aims to identify the best times for data collection. An example of noisy TACs that were used in our experiments with the assumed multiplicative Gaussian noise at varying levels ΟƒπœŽ\sigmaitalic_Οƒ is presented in FigureΒ 2.

Refer to caption
Figure 3: A comparison of the performance of the estimator trained using continuous 𝐰𝐰{\bf w}bold_w (Method 1) and binary 𝐰𝐰{\bf w}bold_w (Method 2) for the 3-Tissue Compartment (3-TC) model. Networks trained for optimal designs 𝐰optsubscript𝐰opt{\bf w}_{\rm opt}bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT at the sparsities shown were evaluated for β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) over 175⁒K175𝐾175K175 italic_K unseen samples of πͺπͺ{\bf q}bold_q (and corresponding 𝐝𝐝{\bf d}bold_d). Error bars that appear point-like are plotted showcasing the Standard Error of the Mean (SEM) for the mean risk β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) exhibited by each network at its corresponding sparsity. See Appendix B for details.

5.2 Lotka-Volterra Predator-Prey Model

In our second set of experiments we consider the Predator-Prey Model (PPM) which is a system of non-linear differential equations that exhibits very different behaviour from the 3-TC model. While the 3-TC model is controlled by decay, the PPM model is cyclical in nature. The model describes the dynamics of two species xπ‘₯xitalic_x and y𝑦yitalic_y and can be written as

d⁒xd⁒t=α⁒xβˆ’Ξ²β’x⁒yd⁒yd⁒t=δ⁒x⁒yβˆ’Ξ³β’yformulae-sequence𝑑π‘₯𝑑𝑑𝛼π‘₯𝛽π‘₯𝑦𝑑𝑦𝑑𝑑𝛿π‘₯𝑦𝛾𝑦\vspace{-0.2\baselineskip}\frac{dx}{dt}=\alpha x-\beta xy\quad\quad\frac{dy}{% dt}=\delta xy-\gamma ydivide start_ARG italic_d italic_x end_ARG start_ARG italic_d italic_t end_ARG = italic_Ξ± italic_x - italic_Ξ² italic_x italic_y divide start_ARG italic_d italic_y end_ARG start_ARG italic_d italic_t end_ARG = italic_Ξ΄ italic_x italic_y - italic_Ξ³ italic_y (18)

equipped with some initial conditions. The model depends on the parameter vector πͺ=[Ξ±,Ξ²,Ξ³,Ξ΄]πͺ𝛼𝛽𝛾𝛿{\bf q}=[\alpha,\beta,\gamma,\delta]bold_q = [ italic_Ξ± , italic_Ξ² , italic_Ξ³ , italic_Ξ΄ ]. The parameter α𝛼\alphaitalic_Ξ± denotes the maximum per capita growth rate of prey, while β𝛽\betaitalic_Ξ² signifies the rate at which prey are consumed by predators, reflecting the predator’s impact on prey population reduction. Parameters δ𝛿\deltaitalic_Ξ΄ and γ𝛾\gammaitalic_Ξ³ for predators represent their natural growth rate through prey consumption and per capita death rate, respectively. The model assumes prey have an unlimited food supply, reproducing exponentially unless preyed upon. Additionally, it assumes prey are the sole food source for predators, with all environmental variables held constant.

Refer to caption
Figure 4: Synthetic prey population data 𝐝=x⁒(t)𝐝π‘₯𝑑{\bf d}=x(t)bold_d = italic_x ( italic_t ) for the Predator-Prey system at noise levels Οƒ=0%,1%,2%,3%⁒…⁒10%𝜎percent0percent1percent2percent3…percent10\sigma=0\%,1\%,2\%,3\%...10\%italic_Οƒ = 0 % , 1 % , 2 % , 3 % … 10 %. 200 sampled parameter sets [Ξ±,Ξ²,Ξ³,Ξ΄]𝛼𝛽𝛾𝛿[\alpha,\beta,\gamma,\delta][ italic_Ξ± , italic_Ξ² , italic_Ξ³ , italic_Ξ΄ ] were used to generate 𝐝𝐝{\bf d}bold_d which are plotted for t=0𝑑0t=0italic_t = 0 to t=30𝑑30t=30italic_t = 30 years. The four dotted lines indicate an optimal sampling scheme of s⁒p⁒a⁒r⁒s⁒i⁒t⁒y=4π‘ π‘π‘Žπ‘Ÿπ‘ π‘–π‘‘π‘¦4sparsity=4italic_s italic_p italic_a italic_r italic_s italic_i italic_t italic_y = 4 at t=5.1,8.9,26.1,29.9𝑑5.18.926.129.9t=5.1,8.9,26.1,29.9italic_t = 5.1 , 8.9 , 26.1 , 29.9 years obtained using continuous design variables 𝐰𝐰{\bf w}bold_w (Method 1).

In order to estimate the parameters measuring the quantities xπ‘₯xitalic_x or y𝑦yitalic_y (or both) are needed. In practice, it is difficult to measure both species and therefore we assume that the experimentalist is only able to measure the population density of the prey. Using this data the goal is to infer the unknown parameters α𝛼\alphaitalic_Ξ±, β𝛽\betaitalic_Ξ², δ𝛿\deltaitalic_Ξ΄ and γ𝛾\gammaitalic_Ξ³. In practice, it is impossible to obtain a continuous monitoring of xπ‘₯xitalic_x or y𝑦yitalic_y. Assume that we are given a budget for the measurement process. Similar to the 3-TC model, the experimental design parameters are the number of measurements and the times that at which they need to be performed. Finally, in order to learn a likelihood free estimator, we require to have samples from the parameters πͺπͺ{\bf q}bold_q. In Mahaffy (2009), data on the populations of lynx and hares from the Hudson Bay Company is used to fit the four unknown parameters α𝛼\alphaitalic_Ξ±, β𝛽\betaitalic_Ξ², γ𝛾\gammaitalic_Ξ³ and δ𝛿\deltaitalic_Ξ΄. To exhibit our proposed methods, we assumed lognormal prior distributions with means equal to their fitted parameters and a standard deviation of 5%percent55\%5 % of the respective means:

ln⁑(πͺ)βˆΌπ’©β’(ΞΌi,Οƒi2)for ⁒i=Ξ±,Ξ²,Ξ³,Ξ΄formulae-sequencesimilar-toπͺ𝒩subscriptπœ‡π‘–superscriptsubscriptπœŽπ‘–2for 𝑖𝛼𝛽𝛾𝛿\ln({\bf q})\sim\mathcal{N}(\mu_{i},\sigma_{i}^{2})\quad\text{for }i=\alpha,% \beta,\gamma,\deltaroman_ln ( bold_q ) ∼ caligraphic_N ( italic_ΞΌ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for italic_i = italic_Ξ± , italic_Ξ² , italic_Ξ³ , italic_Ξ΄ (19)

with numbers specified in Appendix A. Samples are drawn from this prior distribution to generate noisy data samples 𝐝𝐝{\bf d}bold_d (equationΒ 1) for training. Gaussian noise of 0%⁒…⁒10%percent0…percent100\%...10\%0 % … 10 % was added to generate the synethetic noisy prey data 𝐝𝐝{\bf d}bold_d (see Figure 4).

Refer to caption
Figure 5: A comparison of the performance of the estimator trained using continuous 𝐰𝐰{\bf w}bold_w (Method 1) and binary 𝐰𝐰{\bf w}bold_w (Method 2) for the Predator-Prey model (PPM). Networks trained for optimal designs 𝐰optsubscript𝐰opt{\bf w}_{\rm opt}bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT at the sparsities shown were evaluated for β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) over 175⁒K175𝐾175K175 italic_K unseen samples of πͺπͺ{\bf q}bold_q (and corresponding 𝐝𝐝{\bf d}bold_d). Error bars that appear point-like are plotted showcasing the Standard Error of the Mean (SEM) for the mean risk β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) exhibited by each network at its corresponding sparsity. See Appendix B for details.

5.3 Numerical Results

The network structure for the experiments is given in Algorithm (1), trained for continuous 𝐰𝐰{\bf w}bold_w (section 4.1) and binary 𝐰𝐰{\bf w}bold_w (section 4.2). Each ODE system was trialled with the two methods. For the 3-TC model, 𝐝𝐝{\bf d}bold_d was generated from the sampled πͺπͺ{\bf q}bold_q added with multiplicative Gaussian noise at levels Οƒ=0%,1%,…,19%𝜎percent0percent1…percent19\sigma=0\%,1\%,...,19\%italic_Οƒ = 0 % , 1 % , … , 19 %, and for the PPM model at Οƒ=0%,1%,…,10%𝜎percent0percent1…percent10\sigma=0\%,1\%,...,10\%italic_Οƒ = 0 % , 1 % , … , 10 %. A logarithmically spaced time-grid of size 400400400400 for t=0𝑑0t=0italic_t = 0 to t=104𝑑superscript104t=10^{4}italic_t = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT minutes was used for the 3-TC model as it exhibits a spike at earlier times (see Appendix A). For the PPM system, 200 equally spaced time points were chosen from t=0𝑑0t=0italic_t = 0 to t=30𝑑30t=30italic_t = 30 years. Here we provide an exposition on the details of the trials.

Table 1: Total risk β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) and parameter risk β„“q⁒(𝐰opt)subscriptβ„“π‘žsubscript𝐰opt\ell_{q}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) obtained from optimal designs for each ODE system at different sparsities, using both proposed methods. Corresponding averages β„“T⁒(𝐰rand)Β―Β―subscriptℓ𝑇subscript𝐰rand\overline{\ell_{T}({\bf w}_{\rm rand})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG and β„“q⁒(𝐰rand)Β―Β―subscriptβ„“π‘žsubscript𝐰rand\overline{\ell_{q}({\bf w}_{\rm rand})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG for 100 networks trained on random 𝐰𝐰{\bf w}bold_w are shown for comparison. See Appendix B for further details.
ODE Method Sparsity β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\text{opt}})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) β„“T⁒(𝐰rand)Β―Β―subscriptℓ𝑇subscript𝐰rand\overline{\ell_{T}({\bf w}_{\text{rand}})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT ) end_ARG β„“q⁒(𝐰opt)subscriptβ„“π‘žsubscript𝐰opt\ell_{q}({\bf w}_{\text{opt}})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) β„“q⁒(𝐰rand)Β―Β―subscriptβ„“π‘žsubscript𝐰rand\overline{\ell_{q}({\bf w}_{\text{rand}})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT ) end_ARG
3-TC Cont. w 2 6.18Γ—10βˆ’26.18superscript1026.18\times 10^{-2}6.18 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.46Γ—10+01.46superscript1001.46\times 10^{+0}1.46 Γ— 10 start_POSTSUPERSCRIPT + 0 end_POSTSUPERSCRIPT 3.81Γ—10βˆ’23.81superscript1023.81\times 10^{-2}3.81 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 8.94Γ—10βˆ’28.94superscript1028.94\times 10^{-2}8.94 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
3-TC Cont. w 6 5.42Γ—10βˆ’25.42superscript1025.42\times 10^{-2}5.42 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.87Γ—10βˆ’11.87superscript1011.87\times 10^{-1}1.87 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 3.84Γ—10βˆ’23.84superscript1023.84\times 10^{-2}3.84 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.79Γ—10βˆ’25.79superscript1025.79\times 10^{-2}5.79 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
3-TC Cont. w 10 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 8.73Γ—10βˆ’28.73superscript1028.73\times 10^{-2}8.73 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.84Γ—10βˆ’23.84superscript1023.84\times 10^{-2}3.84 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 4.35Γ—10βˆ’24.35superscript1024.35\times 10^{-2}4.35 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 2 5.78Γ—10βˆ’25.78superscript1025.78\times 10^{-2}5.78 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 8.07Γ—10βˆ’28.07superscript1028.07\times 10^{-2}8.07 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.82Γ—10βˆ’23.82superscript1023.82\times 10^{-2}3.82 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.83Γ—10βˆ’23.83superscript1023.83\times 10^{-2}3.83 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 6 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.81Γ—10βˆ’25.81superscript1025.81\times 10^{-2}5.81 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.80Γ—10βˆ’23.80superscript1023.80\times 10^{-2}3.80 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.82Γ—10βˆ’23.82superscript1023.82\times 10^{-2}3.82 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 10 5.22Γ—10βˆ’25.22superscript1025.22\times 10^{-2}5.22 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.55Γ—10βˆ’25.55superscript1025.55\times 10^{-2}5.55 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.82Γ—10βˆ’23.82superscript1023.82\times 10^{-2}3.82 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.84Γ—10βˆ’23.84superscript1023.84\times 10^{-2}3.84 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
PPM Cont. w 2 8.96Γ—10βˆ’38.96superscript1038.96\times 10^{-3}8.96 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.07Γ—10βˆ’22.07superscript1022.07\times 10^{-2}2.07 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.07Γ—10βˆ’31.07superscript1031.07\times 10^{-3}1.07 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.38Γ—10βˆ’31.38superscript1031.38\times 10^{-3}1.38 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM Cont. w 4 6.45Γ—10βˆ’36.45superscript1036.45\times 10^{-3}6.45 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.43Γ—10βˆ’21.43superscript1021.43\times 10^{-2}1.43 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.45Γ—10βˆ’45.45superscript1045.45\times 10^{-4}5.45 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.11Γ—10βˆ’31.11superscript1031.11\times 10^{-3}1.11 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM Cont. w 10 4.72Γ—10βˆ’34.72superscript1034.72\times 10^{-3}4.72 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.18Γ—10βˆ’39.18superscript1039.18\times 10^{-3}9.18 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.27Γ—10βˆ’44.27superscript1044.27\times 10^{-4}4.27 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.54Γ—10βˆ’45.54superscript1045.54\times 10^{-4}5.54 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM TS (Binary w) 2 7.42Γ—10βˆ’37.42superscript1037.42\times 10^{-3}7.42 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.45Γ—10βˆ’22.45superscript1022.45\times 10^{-2}2.45 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 7.88Γ—10βˆ’47.88superscript1047.88\times 10^{-4}7.88 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.60Γ—10βˆ’31.60superscript1031.60\times 10^{-3}1.60 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM TS (Binary w) 4 5.87Γ—10βˆ’35.87superscript1035.87\times 10^{-3}5.87 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.19Γ—10βˆ’21.19superscript1021.19\times 10^{-2}1.19 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.73Γ—10βˆ’45.73superscript1045.73\times 10^{-4}5.73 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 8.18Γ—10βˆ’48.18superscript1048.18\times 10^{-4}8.18 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM TS (Binary w) 10 4.61Γ—10βˆ’34.61superscript1034.61\times 10^{-3}4.61 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.16Γ—10βˆ’38.16superscript1038.16\times 10^{-3}8.16 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 3.58Γ—10βˆ’43.58superscript1043.58\times 10^{-4}3.58 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 6.00Γ—10βˆ’46.00superscript1046.00\times 10^{-4}6.00 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT

For both training methods, at each iteration 3500350035003500 realizations of πͺπͺ{\bf q}bold_q were generated from the prior distributions for both ODE systems. Learning rates were chosen by trial and error.

We defined sparsity as the number of weights higher than a threshold of 10βˆ’3superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. For Method 1, the design vector 𝐰𝐰{\bf w}bold_w was initialized as a vector of ones 𝐰init=𝟏subscript𝐰init1{\bf w}_{{\rm init}}=\mathbf{1}bold_w start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT = bold_1. The training iterations were continued until the desired sparsity level was reached. Weights that were lower than the threshold at the end of the first phase were set to 00, and a further 5000500050005000 iterations were performed with the obtained 𝐰optsubscript𝐰opt{\bf w}_{\rm opt}bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT fixed, which concluded the training phase. For training using Algorithm 3 (Method 2) at each sparsity, initialization was found to be important for convergence to a 𝐰optsubscript𝐰opt{\bf w}_{\rm opt}bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT after 200200200200 Tabu iterations. Hyper-parameters such as the size of the subset of neighbours used and the Tabu list length were found using trial and error, and are 10101010 and 8888 respectively for both ODEs. For initialization of 𝐰𝐰{\bf w}bold_w, a network was trained on a random binary 𝐰𝐰{\bf w}bold_w at the chosen sparsity, and then used as a pre-trained model to train 100100100100 networks using random binary designs 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT (at fixed sparsity) for a smaller number of iterations. The random 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT that gave the network with the lowest risk β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) was used to initialize the Tabu Search routine given by Algorithm 3 to obtain a final 𝐰optsubscript𝐰opt{\bf w}_{\rm opt}bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT. The two training methods that we propose as algorithms for fast and efficient experimental design was repeated for each ODE system at a number of different sparsities. For each ODE system, method, and sparsity, the model was evaluated on 175⁒K175𝐾175K175 italic_K newly sampled πͺπͺ{\bf q}bold_q’s (and corresponding 𝐝𝐝{\bf d}bold_d’s generated as in equationΒ 1). The results are given in Table 1. Higher sparsities are expected to give better parameter recoveries and data fits, lowering β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ), while the opposite is expected for lower sparsities. This trend is seen in the comparison of the two methods used in Figure 5 for the PPM ODE system, and in Figure 3 for the 3-TC ODE system.

6 Conclusions

In this paper we have introduced a methodology for experimental design for parameter estimation using a likelihood free estimator that involves the use of deep networks. Similar to other design methodologies we require the availability of training data, that samples the potential parameters to be recovered. Our methodology is self-supervised. It utilizes parameters to obtain the data under some experimental settings, and then, uses the likelihood free estimator to evaluate the parameters. Since likelihood free estimators circumvent the need of solving an optimization problem for the parameter, the experimental design problem does not require the solution of a bilevel optimization problem. We have introduced two design criteria. The first is where the data is multiplied by a continuous weight, and the second, when the data is multiplied by a binary weight. We experiment with these methods on two problems and show that it is possible to obtain an efficient design that can substantially reduce the cost of collecting data that is used in parameter estimation.

7 Impact Statement

Differential equation models are ubiquitous in science and engineering research. Experimental design is commonly used and better identifying the parameters in the face of resource or budgetary constraints can aid in various disciplines such as cancer therapy and controlling the propagation of disease. To our understanding, our work lacks discernible negative ethical or social impacts.

8 Acknowledgements

We acknowledge the Canadian Institutes of Health Research (CIHR) Project Grant PJT-180251. We would like to thank Tamila Kalimullina for help with Figures 1 and 6, and Shadab Ahamed for technical support.

References

  • Allaire etΒ al. (2001) Allaire, G., Aubry, S., and Jouve, F. Eigenfrequency optimization in optimal design. Comp. Meth. Appl. Mech. Eng., 190(28):3565–3579, 2001.
  • Atkinson & Donev (1992) Atkinson, A.Β C. and Donev, A.Β N. Optimum Experimental Designs. Oxford University Press, 1992.
  • Banga & Balsa-Canto (2008) Banga, J.Β R. and Balsa-Canto, E. Parameter estimation and optimal experimental design. Essays in biochemistry, 45:195–210, 2008.
  • Bardow (2008) Bardow, A. Optimal experimental design for ill-posed problems, the meter approach. Computers and chemical engineering, 32, 2008.
  • Bellman etΒ al. (1966) Bellman, R., Kagiwada, H., and Kalaba, R. Inverse problems in ecology. J. Theor. Biol., 11(1):164–167, 1966.
  • Bentourkia & Zaidi (2007) Bentourkia, M. and Zaidi, H. Tracer kinetic modeling in pet. Pet Clinics, 2(2):267–277, 2007.
  • Betts (2001) Betts, J. Practical Methods for Optimal Control using Nonlinear Programming. Advances in Design and Control. SIAM, Philadelphia, 2001.
  • BjΓΆrck (1996) BjΓΆrck, Γ…. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.
  • Bock etΒ al. (2013) Bock, H.Β G., KΓΆrkel, S., and SchlΓΆder, J.Β P. Parameter estimation and optimum experimental design for differential equation models. Model Based Parameter Estimation: Theory and Applications, pp.Β  1–30, 2013.
  • Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University press, 2004.
  • Bruckstein etΒ al. (2009) Bruckstein, A., Donoho, D., and Elad, M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51:34–81, 2009.
  • Chaloner & Verdinelli (1995) Chaloner, K. and Verdinelli, I. Bayesian experimental design: A review. Statis. Sci., 10:237–304, 1995.
  • Chung & Haber (2013) Chung, M. and Haber, E. Experimental design for biological systems. SIAM Journal on Control and Optimization, 50:471–489, 2013.
  • Didelot etΒ al. (2011) Didelot, X., Everitt, R.Β G., Johansen, A.Β M., and Lawson, D.Β J. Likelihood-free estimation of model evidence. 2011.
  • Donoho (2006) Donoho, D.Β D. For most large underdetermined systems of linear equations the minimal β„“1subscriptβ„“1\ell_{1}roman_β„“ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006.
  • Ette & Williams (2007) Ette, E.Β I. and Williams, P. (eds.). Pharmacometrics: The Science of Quantitative Pharmacology. Wiley-Interscience, New York, 2007.
  • Feng etΒ al. (1994) Feng, D., Wang, X., and Yan, H. A computer simulation study on the input function sampling schedules in tracer kinetic modeling with positron emission tomography (pet). Computer Methods and Programs in Biomedicine, 45(3):175–186, 1994. ISSN 0169-2607. doi: https://doi.org/10.1016/0169-2607(94)90201-1. URL https://www.sciencedirect.com/science/article/pii/0169260794902011.
  • Gendreau & Potvin (2005) Gendreau, M. and Potvin, J.-Y. Tabu search. Search methodologies: introductory tutorials in optimization and decision support techniques, pp.Β  165–186, 2005.
  • Haber etΒ al. (2008) Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems, 24, 2008.
  • Haber etΒ al. (2009) Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of nonlinear ill-posed inverse problems. Inverse Problems, 25, 2009.
  • Helms (2008) Helms, V. (ed.). Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks. Wiley-VCH, New York, 2008.
  • Hertz etΒ al. (1995) Hertz, A., Taillard, E., and DeΒ Werra, D. A tutorial on tabu search. In Proc. of Giornate di Lavoro AIRO, volumeΒ 95, pp.Β  13–24, 1995.
  • Hunt etΒ al. (2007) Hunt, B.Β R., Kostelich, E.Β J., and Szunyogh, I. Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D, 230:112–126, 2007.
  • Jaiswal etΒ al. (2020) Jaiswal, A., Babu, A.Β R., Zadeh, M.Β Z., Banerjee, D., and Makedon, F. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  • Keane etΒ al. (2017) Keane, A., Krauskopf, B., and Postlethwaite, C.Β M. Climate models with delay differential equations. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(11), 2017.
  • Kletting etΒ al. (2016) Kletting, P., Schuchardt, C., Kulkarni, H., Shahinfar, M., Singh, A., Glatting, G., Baum, R.Β P., and Beer, A. Investigating the effect of ligand amount and injected therapeutic activity: A simulation study for 177lu-labeled psma-targeting peptides. PloS one, 11:e0162303, 09 2016. doi: 10.1371/journal.pone.0162303.
  • Lenhart & Workman (2007) Lenhart, S. and Workman, J. (eds.). Optimal Control Applied to Biological Models (1st ed.). Chapman and Hall/CRC, New York, 2007.
  • Li etΒ al. (2018) Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In International Conference on Learning Representations, 2018.
  • Mahaffy (2009) Mahaffy, J. Qualitative analysis of 2D systems of Ordinary Differential Equations. https://jmahaffy.sdsu.edu/courses/f09/math636/lectures/lotka/qualde2.html, 2009. Accessed: Dec 25, 2023.
  • McLellan (1994) McLellan, P. A differential-algebraic perspective on nonlinear controller design methodologies. Chem. Eng. Science, 49:1663–1679, 1994.
  • Morris etΒ al. (2004) Morris, E.Β D., Endres, C.Β J., Schmidt, K.Β C., Christian, B.Β T., Muzic, R.Β F., and Fisher, R.Β E. Kinetic modeling in positron emission tomography. Emission Tomography: The Fundamentals of PET and SPECT, 46(1):499–540, 2004.
  • Papamakarios (2019) Papamakarios, G. Neural density estimation and likelihood-free inference. arXiv preprint arXiv:1910.13233, 2019.
  • Pukelsheim (1993) Pukelsheim, F. Optimal design of experiments. John Wiley & Sons, 1993.
  • Sainsbury-Dale etΒ al. (2022) Sainsbury-Dale, M., Zammit-Mangion, A., and Huser, R. Fast optimal estimation with intractable models using permutation-invariant neural networks. arXiv preprint arXiv:2208.12942, 2022.
  • Ushijima & Yeh (2015) Ushijima, T.Β T. and Yeh, W.Β W. Experimental design for estimating unknown hydraulic conductivity in an aquifer using a genetic algorithm and reduced order model. Advances in Water Resources, 86:193–208, 2015.
  • Wagner (1969) Wagner, J.Β G. Pharmacokinetics: 10. introduction to compartment models. Drug Intelligence & Clinical Pharmacy, 3(9):250–257, 1969. doi: 10.1177/106002806900300904. URL https://doi.org/10.1177/106002806900300904.
  • Wang etΒ al. (2023) Wang, L., Martins, J.Β R., and Du, X. Optimal experimental design-based reduced order modeling for learning optimal aerodynamic designs. In AIAA AVIATION 2023 Forum, pp.Β  3716, 2023.
  • Yang etΒ al. (2023) Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  • Zakariaei etΒ al. (2023) Zakariaei, N., Paranj, A.Β F., Abdollahi, H., and Rahmim, A. Using the cluster gauss newton algorithm to estimate theranostic pharmacokinetic model parameters. Journal of Nuclear Medicine, 64(supplement 1):P1449–P1449, 2023. ISSN 0161-5505. URL https://jnm.snmjournals.org/content/64/supplement_1/P1449.
  • Zenker etΒ al. (2007) Zenker, S., Rubin, J., and Clermont, G. From inverse problems in mathematical physiology to quantitative differential diagnoses. PLOS Computational Biology, 3(11):1–15, 2007.
  • Zhai etΒ al. (2019) Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.Β  1476–1485, 2019.
  • Zhang etΒ al. (2022) Zhang, P., Feng, K., Gong, Y., Lee, J., Lomonaco, S., and Zhao, L. Usage of compartmental models in predicting covid-19 outbreaks. The AAPS Journal, 24(5):98, 2022. ISSN 1550-7416. doi: 10.1208/s12248-022-00743-9. URL https://doi.org/10.1208/s12248-022-00743-9. PMID: 36056223.

Appendix A Appendix: Additional Details on ODE Models and Setup

Here we provide additional details on our experiments. For all experiments concerned, the network structure given by Algorithm 1 used n⁒l⁒a⁒y⁒e⁒r⁒s=3π‘›π‘™π‘Žπ‘¦π‘’π‘Ÿπ‘ 3nlayers=3italic_n italic_l italic_a italic_y italic_e italic_r italic_s = 3, and no layer normalization was used. The S⁒i⁒L⁒Uπ‘†π‘–πΏπ‘ˆSiLUitalic_S italic_i italic_L italic_U function was used as the activation function. Further results of our experimental runs are given in the tables shown herein.

The samples πͺπͺ{\bf q}bold_q (and therefore 𝐝𝐝{\bf d}bold_d that were generated from them) were generated from a lognormal prior distribution for each of the parameters in πͺπͺ{\bf q}bold_q as mentioned in the main text (equationΒ 17 and equationΒ 19). The mean ΞΌπœ‡\muitalic_ΞΌ and standard deviation ΟƒπœŽ\sigmaitalic_Οƒ parameters of a lognormal distribution are related to the mean ΞΌqsubscriptπœ‡π‘ž\mu_{q}italic_ΞΌ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and standard deviation ΟƒqsubscriptπœŽπ‘ž\sigma_{q}italic_Οƒ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT of the random variable qπ‘žqitalic_q as follows:

ΞΌ=ln⁑(ΞΌq2ΞΌq2+Οƒq2),Οƒ2=ln⁑(1+Οƒq2ΞΌq2)formulae-sequenceπœ‡superscriptsubscriptπœ‡π‘ž2superscriptsubscriptπœ‡π‘ž2superscriptsubscriptπœŽπ‘ž2superscript𝜎21superscriptsubscriptπœŽπ‘ž2superscriptsubscriptπœ‡π‘ž2\mu=\ln\left(\frac{\mu_{q}^{2}}{\sqrt{\mu_{q}^{2}+\sigma_{q}^{2}}}\right),% \quad\sigma^{2}=\ln\left(1+\frac{\sigma_{q}^{2}}{\mu_{q}^{2}}\right)italic_ΞΌ = roman_ln ( divide start_ARG italic_ΞΌ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_ΞΌ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_Οƒ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) , italic_Οƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_ln ( 1 + divide start_ARG italic_Οƒ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ΞΌ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (20)

where qπ‘žqitalic_q in our case refers to each of the parameters being used in the corresponding ODE model.

A.1 Model Setup: 3-Tissue Compartment Model

A diagram of the 4 compartments in the 3-TC ODE model is provided in Figure 6. The source function Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a known blood input function (prior to entry into tissue).

Refer to caption
Figure 6: The three-tissue compartment model that describes radiopharmaceutical dynamics, containing six different parameters to be estimated. The compartment Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is a known source term. A PET scan allows the noisy measurement of Pi⁒n⁒t+Pb+Pi⁒n⁒t⁒e⁒r⁒nsubscript𝑃𝑖𝑛𝑑subscript𝑃𝑏subscriptπ‘ƒπ‘–π‘›π‘‘π‘’π‘Ÿπ‘›P_{int}+P_{b}+P_{intern}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT.

In our experiments for the ODE system, we used the input function proposed in Feng etΒ al. (1994) given by:

Pv⁒(t)=subscript𝑃𝑣𝑑absent\displaystyle P_{v}(t)=italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_t ) = (A1⁒tβˆ’A2βˆ’A3)⁒exp⁑(βˆ’Ξ»1⁒t)subscript𝐴1𝑑subscript𝐴2subscript𝐴3subscriptπœ†1𝑑\displaystyle(A_{1}t-A_{2}-A_{3})\exp(-\lambda_{1}t)( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t - italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) roman_exp ( - italic_Ξ» start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t ) (21)
+A2⁒exp⁑(βˆ’Ξ»2⁒t)+A3⁒exp⁑(βˆ’Ξ»3⁒t)subscript𝐴2subscriptπœ†2𝑑subscript𝐴3subscriptπœ†3𝑑\displaystyle+A_{2}\exp(-\lambda_{2}t)+A_{3}\exp(-\lambda_{3}t)+ italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_exp ( - italic_Ξ» start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t ) + italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_exp ( - italic_Ξ» start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_t )

We fit equationΒ 21 to patient data available to us at our institution, arriving at A1=408.87,A2=A3=14.78formulae-sequencesubscript𝐴1408.87subscript𝐴2subscript𝐴314.78A_{1}=408.87,A_{2}=A_{3}=14.78italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 408.87 , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 14.78 and Ξ»1=βˆ’8.46,Ξ»2=Ξ»3=βˆ’0.1362formulae-sequencesubscriptπœ†18.46subscriptπœ†2subscriptπœ†30.1362\lambda_{1}=-8.46,\lambda_{2}=\lambda_{3}=-0.1362italic_Ξ» start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 8.46 , italic_Ξ» start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_Ξ» start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 0.1362 which we used in equationΒ 21 to obtain our input function Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. This input function exhibits a sharp spike in concentration at earlier times, and as such, we selected a log-spaced discretization of 400400400400 points for 𝐭𝐭{\bf t}bold_t from t=0𝑑0t=0italic_t = 0 to t=104𝑑superscript104t=10^{4}italic_t = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT minutes. For the initial conditions we set Pvsubscript𝑃𝑣P_{v}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, Pi⁒n⁒tsubscript𝑃𝑖𝑛𝑑P_{int}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT, Pbsubscript𝑃𝑏P_{b}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Pi⁒n⁒t⁒e⁒r⁒nsubscriptπ‘ƒπ‘–π‘›π‘‘π‘’π‘Ÿπ‘›P_{intern}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r italic_n end_POSTSUBSCRIPT to 00 kBq/mlkBqml\rm kBq/mlroman_kBq / roman_ml.

The means and standard deviations ΞΌqsubscriptπœ‡π‘ž\mu_{q}italic_ΞΌ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and ΟƒqsubscriptπœŽπ‘ž\sigma_{q}italic_Οƒ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT for the prior distribution of the parameters ΞΌi,i∈{k1,k2,k4,k5,k6}subscriptπœ‡π‘–π‘–subscriptπ‘˜1subscriptπ‘˜2subscriptπ‘˜4subscriptπ‘˜5subscriptπ‘˜6\mu_{i},i\in\{k_{1},k_{2},k_{4},k_{5},k_{6}\}italic_ΞΌ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT }, and Οƒi,i∈{k1,k2,k4,k5,k6}subscriptπœŽπ‘–π‘–subscriptπ‘˜1subscriptπ‘˜2subscriptπ‘˜4subscriptπ‘˜5subscriptπ‘˜6\sigma_{i},i\in\{k_{1},k_{2},k_{4},k_{5},k_{6}\}italic_Οƒ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT } that were used were ΞΌk1=1.5Γ—10βˆ’2subscriptπœ‡subscriptπ‘˜11.5superscript102\mu_{k_{1}}=1.5\times 10^{-2}italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1.5 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, ΞΌk2=1.6Γ—10βˆ’3subscriptπœ‡subscriptπ‘˜21.6superscript103\mu_{k_{2}}=1.6\times 10^{-3}italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1.6 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, ΞΌk3=121.0subscriptπœ‡subscriptπ‘˜3121.0\mu_{k_{3}}=121.0italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 121.0, ΞΌk4=4Γ—10βˆ’2subscriptπœ‡subscriptπ‘˜44superscript102\mu_{k_{4}}=4\times 10^{-2}italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 4 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, ΞΌk5=1Γ—10βˆ’3subscriptπœ‡subscriptπ‘˜51superscript103\mu_{k_{5}}=1\times 10^{-3}italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, ΞΌk6=2Γ—10βˆ’4subscriptπœ‡subscriptπ‘˜62superscript104\mu_{k_{6}}=2\times 10^{-4}italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 2 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and Οƒk1=0.2⁒μk1,Οƒk2=0.2⁒μk2,Οƒk3=0.2⁒μk3,Οƒk4=0.2⁒μk4,Οƒk5=0.2⁒μk5,Οƒk6=0.2⁒μk6formulae-sequencesubscript𝜎subscriptπ‘˜10.2subscriptπœ‡subscriptπ‘˜1formulae-sequencesubscript𝜎subscriptπ‘˜20.2subscriptπœ‡subscriptπ‘˜2formulae-sequencesubscript𝜎subscriptπ‘˜30.2subscriptπœ‡subscriptπ‘˜3formulae-sequencesubscript𝜎subscriptπ‘˜40.2subscriptπœ‡subscriptπ‘˜4formulae-sequencesubscript𝜎subscriptπ‘˜50.2subscriptπœ‡subscriptπ‘˜5subscript𝜎subscriptπ‘˜60.2subscriptπœ‡subscriptπ‘˜6\sigma_{k_{1}}=0.2\mu_{k_{1}},\sigma_{k_{2}}=0.2\mu_{k_{2}},\sigma_{k_{3}}=0.2% \mu_{k_{3}},\sigma_{k_{4}}=0.2\mu_{k_{4}},\sigma_{k_{5}}=0.2\mu_{k_{5}},\sigma% _{k_{6}}=0.2\mu_{k_{6}}italic_Οƒ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.2 italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.2 italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.2 italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.2 italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.2 italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0.2 italic_ΞΌ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_POSTSUBSCRIPT respectively. As can be seen, we took the standard deviations of the parameters to be 20%percent2020\%20 % of the means. The noise ΟƒπœŽ\sigmaitalic_Οƒ added to generated 𝐝𝐝{\bf d}bold_d were randomly picked at each sampling from {0%,1%,2%⁒…⁒19%}percent0percent1percent2…percent19\{0\%,1\%,2\%...19\%\}{ 0 % , 1 % , 2 % … 19 % }.

A.2 2: PPM Model Setup

For the Predator-Prey Model described in Section 5, we used an equally spaced time grid of 200 points from t=0𝑑0t=0italic_t = 0 to t=30𝑑30t=30italic_t = 30 years. In Mahaffy (2009), data on the populations of lynx and hares from the Hudson Bay Company is used to fit the four unknown parameters α𝛼\alphaitalic_Ξ±, β𝛽\betaitalic_Ξ², γ𝛾\gammaitalic_Ξ³ and δ𝛿\deltaitalic_Ξ΄, where they used an initial population of hares (prey) and lynx (predator) as 30 and 4 (in units of thousands) respectively, which we used. We took the fitted parameters obtained by Mahaffy (2009) as the means of the respective prior distributions of the parameters k1,k2,k3,k4,k5subscriptπ‘˜1subscriptπ‘˜2subscriptπ‘˜3subscriptπ‘˜4subscriptπ‘˜5k_{1},k_{2},k_{3},k_{4},k_{5}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT and k6subscriptπ‘˜6k_{6}italic_k start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, and 5%percent55\%5 % of the means as their respective standard deviations. The means and standard deviations of the parameters for their respective prior distributions were: (ΞΌΞ±,σα)=(0.4,0.05Γ—0.4)subscriptπœ‡π›ΌsubscriptπœŽπ›Ό0.40.050.4(\mu_{\alpha},\sigma_{\alpha})=(0.4,0.05\times 0.4)( italic_ΞΌ start_POSTSUBSCRIPT italic_Ξ± end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_Ξ± end_POSTSUBSCRIPT ) = ( 0.4 , 0.05 Γ— 0.4 ), (ΞΌΞ²,σβ)=(0.018,0.05Γ—0.018)subscriptπœ‡π›½subscriptπœŽπ›½0.0180.050.018(\mu_{\beta},\sigma_{\beta})=(0.018,0.05\times 0.018)( italic_ΞΌ start_POSTSUBSCRIPT italic_Ξ² end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_Ξ² end_POSTSUBSCRIPT ) = ( 0.018 , 0.05 Γ— 0.018 ), (ΞΌΞ³,σγ)=(0.8,0.05Γ—0.8)subscriptπœ‡π›ΎsubscriptπœŽπ›Ύ0.80.050.8(\mu_{\gamma},\sigma_{\gamma})=(0.8,0.05\times 0.8)( italic_ΞΌ start_POSTSUBSCRIPT italic_Ξ³ end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_Ξ³ end_POSTSUBSCRIPT ) = ( 0.8 , 0.05 Γ— 0.8 ), (ΞΌΞ΄,σδ)=(0.023,0.05Γ—0.023)subscriptπœ‡π›ΏsubscriptπœŽπ›Ώ0.0230.050.023(\mu_{\delta},\sigma_{\delta})=(0.023,0.05\times 0.023)( italic_ΞΌ start_POSTSUBSCRIPT italic_Ξ΄ end_POSTSUBSCRIPT , italic_Οƒ start_POSTSUBSCRIPT italic_Ξ΄ end_POSTSUBSCRIPT ) = ( 0.023 , 0.05 Γ— 0.023 ). The noise ΟƒπœŽ\sigmaitalic_Οƒ added to the generated 𝐝𝐝{\bf d}bold_d was randomly picked at each sampling from {0%,1%,2%⁒…⁒10%}percent0percent1percent2…percent10\{0\%,1\%,2\%...10\%\}{ 0 % , 1 % , 2 % … 10 % }.

A.3 Scaling of Risks

The loss function equationΒ 7 is a sum of the data risk β„“d⁒(Ο‰)subscriptβ„“π‘‘πœ”\ell_{d}(\omega)roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_Ο‰ ) and the parameter risk β„“q⁒(Ο‰)subscriptβ„“π‘žπœ”\ell_{q}(\omega)roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_Ο‰ ). For the data-risk β„“d⁒(Ο‰)subscriptβ„“π‘‘πœ”\ell_{d}(\omega)roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_Ο‰ ) we used the trapezoidal rule for variable time widths (owing to our use of logarithmically spaced time points). Note that the risks ℓ⁒(Ο‰)β„“πœ”\ell(\omega)roman_β„“ ( italic_Ο‰ ) reported in our results and that used during training had the scaling kπ‘˜kitalic_k shown below in equationΒ 22, and whose discretized form was used during training.

β„“d⁒(Ο‰)subscriptβ„“π‘‘πœ”\displaystyle\ell_{d}(\omega)roman_β„“ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_Ο‰ ) =𝔼πͺ,ϡ⁒12⁒∫0Tkβ‹…β€–F⁒(πͺ^i,𝝎)βˆ’πi⁒(t)β€–2⁒𝑑tabsentsubscript𝔼πͺbold-italic-Ο΅12superscriptsubscript0π‘‡β‹…π‘˜superscriptnorm𝐹subscript^πͺπ‘–πŽsubscript𝐝𝑖𝑑2differential-d𝑑\displaystyle={\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{\frac{1}{2}}\int_{0}^{T% }k\cdot\|F(\widehat{\bf q}_{i},{\bm{\omega}})-{\bf d}_{i}(t)\|^{2}dt= blackboard_E start_POSTSUBSCRIPT bold_q , bold_italic_Ο΅ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_k β‹… βˆ₯ italic_F ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_Ο‰ ) - bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t (22)
kπ‘˜\displaystyle kitalic_k =1∫0T‖𝐝i⁒(t)β€–2⁒𝑑tabsent1superscriptsubscript0𝑇superscriptnormsubscript𝐝𝑖𝑑2differential-d𝑑\displaystyle=\frac{1}{\int_{0}^{T}\|{\bf d}_{i}(t)\|^{2}dt}= divide start_ARG 1 end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT βˆ₯ bold_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG

where kπ‘˜kitalic_k is a scaling factor. Similarly for the parameter recovery: risk β„“q⁒(Ο‰)subscriptβ„“π‘žπœ”\ell_{q}(\omega)roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_Ο‰ ):

β„“q⁒(𝝎)subscriptβ„“π‘žπŽ\displaystyle\ell_{q}({\bm{\omega}})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_Ο‰ ) =\displaystyle== 𝔼πͺ,ϡ⁒12β‹…kβ‹…β€–πͺ^βˆ’πͺβ€–2β‹…subscript𝔼πͺbold-italic-Ο΅12π‘˜superscriptnorm^πͺπͺ2\displaystyle{\mathbb{E}}_{{\bf q},{\bm{\epsilon}}}\,{\frac{1}{2}}\cdot k\cdot% \|\widehat{\bf q}-{\bf q}\|^{2}blackboard_E start_POSTSUBSCRIPT bold_q , bold_italic_Ο΅ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG β‹… italic_k β‹… βˆ₯ over^ start_ARG bold_q end_ARG - bold_q βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (23)
kπ‘˜\displaystyle kitalic_k =\displaystyle== 1β€–πͺβ€–21superscriptnormπͺ2\displaystyle\frac{1}{\|{\bf q}\|^{2}}divide start_ARG 1 end_ARG start_ARG βˆ₯ bold_q βˆ₯ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (24)

Appendix B Appendix: Numerical Details of Figures and Tables

In Figure 3 and Figure 5 we exhibit the performance of networks trained using the proposed method at different sparsities. For each training method at each sparsity, a new batch of 175⁒K175𝐾175K175 italic_K samples of πͺπͺ{\bf q}bold_q (and corresponding noisy 𝐝𝐝{\bf d}bold_d) were sampled for the trained optimal network to be evaluated on. For the computation of β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ), the network was evaluated on 50505050 sets of 3500350035003500 newly sampled qπ‘žqitalic_q and d𝑑ditalic_d, amounting to an evaluation over 175⁒K175𝐾175K175 italic_K unseen samples. The mean value for the total risk β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) of the optimal network over these 50505050 sets was computed and taken as β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ), and the Standard Error of the Mean (SEM) for this value was calculated. The β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) are plotted against s⁒p⁒a⁒r⁒s⁒i⁒t⁒yπ‘ π‘π‘Žπ‘Ÿπ‘ π‘–π‘‘π‘¦sparsityitalic_s italic_p italic_a italic_r italic_s italic_i italic_t italic_y (number of non-zero elements in 𝐰𝐰{\bf w}bold_w) in Figure 3 and in Figure 5. The corresponding numerical values are presented in Table 2. Note that in Figure 3 and Figure 5, the error bars look point-like due to the SEM(β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w )) values being ∼10βˆ’2similar-toabsentsuperscript102\sim 10^{-2}∼ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT smaller than the values of β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) that are plotted (see Table 2). The error bars in the figures were calculated as:

log10(β„“T(𝐰)+SEM(β„“T(𝐰))βˆ’log10(β„“T(𝐰))\displaystyle\log_{10}\left(\ell_{T}({\bf w})+\text{SEM}(\ell_{T}({\bf w})% \right)-\log_{10}\left(\ell_{T}({\bf w})\right)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) + SEM ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) ) - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) ) (25)

The corresponding numerical values for β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) and SEM(β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT )) are in Table 2.

We chose to conduct numerical experiments to compare the risks obtained by our methods to random designs for a subset of sparsities. For the 3-TC model, this was at s⁒p⁒a⁒r⁒s⁒i⁒t⁒y=2,6,10,20π‘ π‘π‘Žπ‘Ÿπ‘ π‘–π‘‘π‘¦261020sparsity=2,6,10,20italic_s italic_p italic_a italic_r italic_s italic_i italic_t italic_y = 2 , 6 , 10 , 20, and at s⁒p⁒a⁒r⁒s⁒i⁒t⁒y=2,4,10,20π‘ π‘π‘Žπ‘Ÿπ‘ π‘–π‘‘π‘¦241020sparsity=2,4,10,20italic_s italic_p italic_a italic_r italic_s italic_i italic_t italic_y = 2 , 4 , 10 , 20 for the PPM system of equations.

Namely, the risks β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ), β„“q⁒(𝐰)subscriptβ„“π‘žπ°\ell_{q}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w ) obtained by training optimal networks for a given method and sparsity are compared against networks trained on random 𝐰𝐰{\bf w}bold_w. In Table 1, Table 3 and Table 5, for each ODE model, method, and sparsity, 100 networks were trained on random designs 𝐰r⁒a⁒n⁒dsubscriptπ°π‘Ÿπ‘Žπ‘›π‘‘{\bf w}_{rand}bold_w start_POSTSUBSCRIPT italic_r italic_a italic_n italic_d end_POSTSUBSCRIPT for comparison. For the random runs for continuous 𝐰𝐰{\bf w}bold_w, the entries of 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT were sampled from a uniform distribution 𝐰rand∼Uniform⁒(0,2)similar-tosubscript𝐰randUniform02{{\bf w}}_{\rm{rand}}\sim\text{Uniform}(0,2)bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ∼ Uniform ( 0 , 2 ). For binary design variables, 𝐰r⁒a⁒n⁒dsubscriptπ°π‘Ÿπ‘Žπ‘›π‘‘{\bf w}_{rand}bold_w start_POSTSUBSCRIPT italic_r italic_a italic_n italic_d end_POSTSUBSCRIPT had a random subset of weights (equal to s⁒p⁒a⁒r⁒s⁒i⁒t⁒yπ‘ π‘π‘Žπ‘Ÿπ‘ π‘–π‘‘π‘¦sparsityitalic_s italic_p italic_a italic_r italic_s italic_i italic_t italic_y) set to 1111, with the rest set to 00. For the computation of each β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ), the optimal network was evaluated on 50505050 sets of 3500350035003500 newly sampled πͺπͺ{\bf q}bold_q and 𝐝𝐝{\bf d}bold_d, amounting to an evaluation over 175⁒K175𝐾175K175 italic_K unseen samples. A mean value for the risk of the optimal network over these samples was computed and taken as β„“T⁒(𝐰)optsubscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w})_{\rm opt}roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT, and the Standard Error of the Mean for this value was calculated. This is given as the SEM(β„“T⁒(𝐰)optsubscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w})_{\rm opt}roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT) column in Table 3 and Table 2. The parameter risks β„“q⁒(𝐰opt)subscriptβ„“π‘žsubscript𝐰opt\ell_{q}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) for the optimal networks were calculated similarly. Each of the 100100100100 networks trained on random designs 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT were evaluated for β„“T⁒(𝐰rand)subscriptℓ𝑇subscript𝐰rand\ell_{T}({\bf w}_{\rm rand})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) on 70⁒K70𝐾70K70 italic_K new samples of πͺπͺ{\bf q}bold_q and 𝐝𝐝{\bf d}bold_d. The STD(β„“T⁒(𝐰rand)subscriptℓ𝑇subscript𝐰rand\ell_{T}({\bf w}_{\rm rand})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT )) and STD(β„“q(𝐰rand))\ell_{q}({\bf w}_{\rm rand}))roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) ) columns in Table 3 and Table 5 are the standard deviations of the corresponding risks among the 100 networks trained on random 𝐰r⁒a⁒n⁒dsubscriptπ°π‘Ÿπ‘Žπ‘›π‘‘{\bf w}_{rand}bold_w start_POSTSUBSCRIPT italic_r italic_a italic_n italic_d end_POSTSUBSCRIPT for each corresponding entry in the tables. In Table 4 we confirm that the total risks for the trained optimal networks β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) are lower than the risk obtained by the best performing network trained on random 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT for each entry.

Table 2: Total risks β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) obtained by networks trained for optimal designs 𝐰optsubscript𝐰opt{\bf w}_{\rm opt}bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT

on 175⁒K175𝐾175K175 italic_K unseen samples of πͺπͺ{\bf q}bold_q and 𝐝𝐝{\bf d}bold_d. The Standard Error of the Mean (SEM) for each are also presented. ODE Method Sparsity β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\text{opt}})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) SEM(β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\text{opt}})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT )) 3-TC Cont. w 1 6.44Γ—10βˆ’26.44superscript1026.44\times 10^{-2}6.44 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.42Γ—10βˆ’41.42superscript1041.42\times 10^{-4}1.42 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 2 6.18Γ—10βˆ’26.18superscript1026.18\times 10^{-2}6.18 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.81Γ—10βˆ’41.81superscript1041.81\times 10^{-4}1.81 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 4 5.48Γ—10βˆ’25.48superscript1025.48\times 10^{-2}5.48 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.67Γ—10βˆ’41.67superscript1041.67\times 10^{-4}1.67 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 6 5.42Γ—10βˆ’25.42superscript1025.42\times 10^{-2}5.42 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.66Γ—10βˆ’41.66superscript1041.66\times 10^{-4}1.66 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 10 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.47Γ—10βˆ’41.47superscript1041.47\times 10^{-4}1.47 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 15 5.17Γ—10βˆ’25.17superscript1025.17\times 10^{-2}5.17 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.23Γ—10βˆ’41.23superscript1041.23\times 10^{-4}1.23 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 20 5.125Γ—10βˆ’25.125superscript1025.125\times 10^{-2}5.125 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.07Γ—10βˆ’41.07superscript1041.07\times 10^{-4}1.07 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC Cont. w 30 5.120Γ—10βˆ’25.120superscript1025.120\times 10^{-2}5.120 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.60Γ—10βˆ’41.60superscript1041.60\times 10^{-4}1.60 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 1 6.28Γ—10βˆ’26.28superscript1026.28\times 10^{-2}6.28 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.43Γ—10βˆ’41.43superscript1041.43\times 10^{-4}1.43 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 2 5.78Γ—10βˆ’25.78superscript1025.78\times 10^{-2}5.78 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.77Γ—10βˆ’41.77superscript1041.77\times 10^{-4}1.77 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 4 5.48Γ—10βˆ’25.48superscript1025.48\times 10^{-2}5.48 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.51Γ—10βˆ’41.51superscript1041.51\times 10^{-4}1.51 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 6 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.52Γ—10βˆ’41.52superscript1041.52\times 10^{-4}1.52 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 10 5.22Γ—10βˆ’25.22superscript1025.22\times 10^{-2}5.22 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.48Γ—10βˆ’41.48superscript1041.48\times 10^{-4}1.48 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 15 5.17Γ—10βˆ’25.17superscript1025.17\times 10^{-2}5.17 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.19Γ—10βˆ’41.19superscript1041.19\times 10^{-4}1.19 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 20 5.14Γ—10βˆ’25.14superscript1025.14\times 10^{-2}5.14 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.22Γ—10βˆ’41.22superscript1041.22\times 10^{-4}1.22 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3-TC TS (Binary w) 30 5.11Γ—10βˆ’25.11superscript1025.11\times 10^{-2}5.11 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.55Γ—10βˆ’41.55superscript1041.55\times 10^{-4}1.55 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT PPM Cont. w 1 1.49Γ—10βˆ’21.49superscript1021.49\times 10^{-2}1.49 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.28Γ—10βˆ’53.28superscript1053.28\times 10^{-5}3.28 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM Cont. w 2 8.96Γ—10βˆ’38.96superscript1038.96\times 10^{-3}8.96 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.88Γ—10βˆ’51.88superscript1051.88\times 10^{-5}1.88 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM Cont. w 4 6.45Γ—10βˆ’36.45superscript1036.45\times 10^{-3}6.45 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.74Γ—10βˆ’51.74superscript1051.74\times 10^{-5}1.74 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM Cont. w 6 5.83Γ—10βˆ’35.83superscript1035.83\times 10^{-3}5.83 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.32Γ—10βˆ’51.32superscript1051.32\times 10^{-5}1.32 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM Cont. w 10 4.72Γ—10βˆ’34.72superscript1034.72\times 10^{-3}4.72 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.76Γ—10βˆ’69.76superscript1069.76\times 10^{-6}9.76 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT PPM Cont. w 15 4.60Γ—10βˆ’34.60superscript1034.60\times 10^{-3}4.60 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.29Γ—10βˆ’51.29superscript1051.29\times 10^{-5}1.29 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM Cont. w 20 4.28Γ—10βˆ’34.28superscript1034.28\times 10^{-3}4.28 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.14Γ—10βˆ’51.14superscript1051.14\times 10^{-5}1.14 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM Cont. w 30 4.08Γ—10βˆ’34.08superscript1034.08\times 10^{-3}4.08 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.24Γ—10βˆ’68.24superscript1068.24\times 10^{-6}8.24 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT PPM TS (Binary w) 1 1.32Γ—10βˆ’21.32superscript1021.32\times 10^{-2}1.32 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 3.30Γ—10βˆ’53.30superscript1053.30\times 10^{-5}3.30 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM TS (Binary w) 2 7.42Γ—10βˆ’37.42superscript1037.42\times 10^{-3}7.42 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.17Γ—10βˆ’51.17superscript1051.17\times 10^{-5}1.17 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM TS (Binary w) 4 5.87Γ—10βˆ’35.87superscript1035.87\times 10^{-3}5.87 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.24Γ—10βˆ’51.24superscript1051.24\times 10^{-5}1.24 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM TS (Binary w) 6 5.12Γ—10βˆ’35.12superscript1035.12\times 10^{-3}5.12 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.25Γ—10βˆ’51.25superscript1051.25\times 10^{-5}1.25 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM TS (Binary w) 10 4.61Γ—10βˆ’34.61superscript1034.61\times 10^{-3}4.61 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.44Γ—10βˆ’69.44superscript1069.44\times 10^{-6}9.44 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT PPM TS (Binary w) 15 4.35Γ—10βˆ’34.35superscript1034.35\times 10^{-3}4.35 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.11Γ—10βˆ’51.11superscript1051.11\times 10^{-5}1.11 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT PPM TS (Binary w) 20 4.15Γ—10βˆ’34.15superscript1034.15\times 10^{-3}4.15 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.90Γ—10βˆ’66.90superscript1066.90\times 10^{-6}6.90 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT PPM TS (Binary w) 30 4.09Γ—10βˆ’34.09superscript1034.09\times 10^{-3}4.09 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.11Γ—10βˆ’69.11superscript1069.11\times 10^{-6}9.11 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT

Table 3: Experimental results for total risks β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ). The β„“T⁒(𝐰)subscriptℓ𝑇𝐰\ell_{T}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w ) were computed over 175⁒k175π‘˜175k175 italic_k unseen samples for each ODE system at the given sparsity values. The corresponding Standard Error of the Mean (SEM) are shown. Also given are the mean parameter risks β„“T⁒(𝐰rand)Β―Β―subscriptℓ𝑇subscript𝐰rand\overline{\ell_{T}({\bf w}_{\rm rand})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG obtained by 100 networks trained on random designs 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT. The standard deviations of β„“T⁒(𝐰rand)subscriptℓ𝑇subscript𝐰rand\ell_{T}({\bf w}_{\rm rand})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) and the percentage difference of β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) from the random runs are similarly presented. 𝐒𝐏𝐒𝐏\bf{SP}bold_SP refers to the sparsity of the corresponding row.
ODE Method SP β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm{opt}})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) SEM(β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm{opt}})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT )) β„“T⁒(𝐰rand)Β―Β―subscriptℓ𝑇subscript𝐰rand\overline{\ell_{T}({\bf w}_{\rm{rand}})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG β„“T(𝐰opt)βˆ’β„“T⁒(𝐰rand)Β―β„“T⁒(𝐰rand)Β―Γ—100%\frac{\ell_{T}({\bf w}_{\rm{opt})}-\overline{\ell_{T}({\bf w}_{\rm{rand}})}}{% \overline{\ell_{T}({\bf w}_{\rm{rand}})}}\times 100\%divide start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt ) end_POSTSUBSCRIPT - overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG end_ARG start_ARG overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG end_ARG Γ— 100 % STD⁒(β„“T⁒(𝐰rand))STDsubscriptℓ𝑇subscript𝐰rand\textbf{STD}(\ell_{T}({\bf w}_{\rm rand}))STD ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) )
3-TC Cont. w 2 6.18Γ—10βˆ’26.18superscript1026.18\times 10^{-2}6.18 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.81Γ—10βˆ’41.81superscript1041.81\times 10^{-4}1.81 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.46Γ—10+01.46superscript1001.46\times 10^{+0}1.46 Γ— 10 start_POSTSUPERSCRIPT + 0 end_POSTSUPERSCRIPT βˆ’95.78%percent95.78-95.78\%- 95.78 % 2.05Γ—10+02.05superscript1002.05\times 10^{+0}2.05 Γ— 10 start_POSTSUPERSCRIPT + 0 end_POSTSUPERSCRIPT
3-TC Cont. w 6 5.42Γ—10βˆ’25.42superscript1025.42\times 10^{-2}5.42 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.66Γ—10βˆ’41.66superscript1041.66\times 10^{-4}1.66 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.87Γ—10βˆ’11.87superscript1011.87\times 10^{-1}1.87 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT βˆ’71.10%percent71.10-71.10\%- 71.10 % 2.88Γ—10βˆ’12.88superscript1012.88\times 10^{-1}2.88 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
3-TC Cont. w 10 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.47Γ—10βˆ’41.47superscript1041.47\times 10^{-4}1.47 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 8.73Γ—10βˆ’28.73superscript1028.73\times 10^{-2}8.73 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’39.43%percent39.43-39.43\%- 39.43 % 1.49Γ—10βˆ’11.49superscript1011.49\times 10^{-1}1.49 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
3-TC Cont. w 20 5.125Γ—10βˆ’25.125superscript1025.125\times 10^{-2}5.125 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.23Γ—10βˆ’41.23superscript1041.23\times 10^{-4}1.23 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.98Γ—10βˆ’25.98superscript1025.98\times 10^{-2}5.98 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’14.30%percent14.30-14.30\%- 14.30 % 2.44Γ—10βˆ’32.44superscript1032.44\times 10^{-3}2.44 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 2 5.78Γ—10βˆ’25.78superscript1025.78\times 10^{-2}5.78 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.77Γ—10βˆ’41.77superscript1041.77\times 10^{-4}1.77 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 8.07Γ—10βˆ’28.07superscript1028.07\times 10^{-2}8.07 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’28.37%percent28.37-28.37\%- 28.37 % 8.69Γ—10βˆ’38.69superscript1038.69\times 10^{-3}8.69 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 6 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.52Γ—10βˆ’41.52superscript1041.52\times 10^{-4}1.52 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.81Γ—10βˆ’25.81superscript1025.81\times 10^{-2}5.81 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’9.06%percent9.06-9.06\%- 9.06 % 2.19Γ—10βˆ’32.19superscript1032.19\times 10^{-3}2.19 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 10 5.22Γ—10βˆ’25.22superscript1025.22\times 10^{-2}5.22 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.48Γ—10βˆ’41.48superscript1041.48\times 10^{-4}1.48 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.55Γ—10βˆ’25.55superscript1025.55\times 10^{-2}5.55 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’5.92%percent5.92-5.92\%- 5.92 % 1.43Γ—10βˆ’31.43superscript1031.43\times 10^{-3}1.43 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 20 5.14Γ—10βˆ’25.14superscript1025.14\times 10^{-2}5.14 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.22Γ—10βˆ’41.22superscript1041.22\times 10^{-4}1.22 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.35Γ—10βˆ’25.35superscript1025.35\times 10^{-2}5.35 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’3.76%percent3.76-3.76\%- 3.76 % 7.66Γ—10βˆ’47.66superscript1047.66\times 10^{-4}7.66 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM Cont. w 2 8.96Γ—10βˆ’38.96superscript1038.96\times 10^{-3}8.96 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.88Γ—10βˆ’51.88superscript1051.88\times 10^{-5}1.88 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2.07Γ—10βˆ’22.07superscript1022.07\times 10^{-2}2.07 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’56.72%percent56.72-56.72\%- 56.72 % 7.46Γ—10βˆ’37.46superscript1037.46\times 10^{-3}7.46 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM Cont. w 4 6.45Γ—10βˆ’36.45superscript1036.45\times 10^{-3}6.45 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.74Γ—10βˆ’51.74superscript1051.74\times 10^{-5}1.74 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 1.43Γ—10βˆ’21.43superscript1021.43\times 10^{-2}1.43 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’54.89%percent54.89-54.89\%- 54.89 % 8.62Γ—10βˆ’38.62superscript1038.62\times 10^{-3}8.62 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM Cont. w 10 4.72Γ—10βˆ’34.72superscript1034.72\times 10^{-3}4.72 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.76Γ—10βˆ’69.76superscript1069.76\times 10^{-6}9.76 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 9.18Γ—10βˆ’39.18superscript1039.18\times 10^{-3}9.18 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’48.59%percent48.59-48.59\%- 48.59 % 2.33Γ—10βˆ’32.33superscript1032.33\times 10^{-3}2.33 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM Cont. w 20 4.28Γ—10βˆ’34.28superscript1034.28\times 10^{-3}4.28 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.14Γ—10βˆ’51.14superscript1051.14\times 10^{-5}1.14 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 7.16Γ—10βˆ’37.16superscript1037.16\times 10^{-3}7.16 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’40.22%percent40.22-40.22\%- 40.22 % 3.14Γ—10βˆ’33.14superscript1033.14\times 10^{-3}3.14 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM TS (Binary w) 2 7.42Γ—10βˆ’37.42superscript1037.42\times 10^{-3}7.42 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.17Γ—10βˆ’51.17superscript1051.17\times 10^{-5}1.17 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2.45Γ—10βˆ’22.45superscript1022.45\times 10^{-2}2.45 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’69.77%percent69.77-69.77\%- 69.77 % 8.93Γ—10βˆ’38.93superscript1038.93\times 10^{-3}8.93 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM TS (Binary w) 4 5.87Γ—10βˆ’35.87superscript1035.87\times 10^{-3}5.87 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.24Γ—10βˆ’51.24superscript1051.24\times 10^{-5}1.24 Γ— 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 1.19Γ—10βˆ’21.19superscript1021.19\times 10^{-2}1.19 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’50.89%percent50.89-50.89\%- 50.89 % 4.76Γ—10βˆ’34.76superscript1034.76\times 10^{-3}4.76 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM TS (Binary w) 10 4.61Γ—10βˆ’34.61superscript1034.61\times 10^{-3}4.61 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.44Γ—10βˆ’69.44superscript1069.44\times 10^{-6}9.44 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 8.16Γ—10βˆ’38.16superscript1038.16\times 10^{-3}8.16 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’43.51%percent43.51-43.51\%- 43.51 % 5.82Γ—10βˆ’35.82superscript1035.82\times 10^{-3}5.82 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
PPM TS (Binary w) 20 4.15Γ—10βˆ’34.15superscript1034.15\times 10^{-3}4.15 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.90Γ—10βˆ’66.90superscript1066.90\times 10^{-6}6.90 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 5.63Γ—10βˆ’35.63superscript1035.63\times 10^{-3}5.63 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’26.25%percent26.25-26.25\%- 26.25 % 2.16Γ—10βˆ’32.16superscript1032.16\times 10^{-3}2.16 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
Table 4: Performance comparison between total risks β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) obtained by evaluating optimal networks, with the best performing network with the minimum total risk min⁑(β„“T⁒(𝐰rand))subscriptℓ𝑇subscript𝐰rand\min(\ell_{T}({\bf w}_{\rm rand}))roman_min ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) ) among 100100100100 networks that were trained on random designs 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT .
ODE Method Sparsity β„“T⁒(𝐰opt)subscriptℓ𝑇subscript𝐰opt\ell_{T}({\bf w}_{\rm{opt}})roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) min⁑(β„“T⁒(𝐰rand))subscriptℓ𝑇subscript𝐰rand\min(\ell_{T}({\bf w}_{\rm rand}))roman_min ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) ) β„“T⁒(𝐰opt)βˆ’min⁑(β„“T⁒(𝐰rand))min⁑(β„“T⁒(𝐰rand))Γ—100%subscriptℓ𝑇subscript𝐰optsubscriptℓ𝑇subscript𝐰randsubscriptℓ𝑇subscript𝐰randpercent100\frac{\ell_{T}({\bf w}_{\rm{opt}})-\min(\ell_{T}({\bf w}_{\rm rand}))}{\min(% \ell_{T}({\bf w}_{\rm rand}))}\times 100\%divide start_ARG roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) - roman_min ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) ) end_ARG start_ARG roman_min ( roman_β„“ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) ) end_ARG Γ— 100 %
3-TC Cont. w 2 6.18Γ—10βˆ’26.18superscript1026.18\times 10^{-2}6.18 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.35Γ—10βˆ’11.35superscript1011.35\times 10^{-1}1.35 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT βˆ’54.38%percent54.38-54.38\%- 54.38 %
3-TC Cont. w 6 5.42Γ—10βˆ’25.42superscript1025.42\times 10^{-2}5.42 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 6.28Γ—10βˆ’26.28superscript1026.28\times 10^{-2}6.28 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’13.80%percent13.80-13.80\%- 13.80 %
3-TC Cont. w 10 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.32Γ—10βˆ’25.32superscript1025.32\times 10^{-2}5.32 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’0.62%percent0.62-0.62\%- 0.62 %
3-TC Cont. w 20 5.11Γ—10βˆ’25.11superscript1025.11\times 10^{-2}5.11 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.34Γ—10βˆ’25.34superscript1025.34\times 10^{-2}5.34 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’4.36%percent4.36-4.36\%- 4.36 %
3-TC TS (Binary w) 2 5.78Γ—10βˆ’25.78superscript1025.78\times 10^{-2}5.78 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.49Γ—10βˆ’25.49superscript1025.49\times 10^{-2}5.49 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’5.60%percent5.60-5.60\%- 5.60 %
3-TC TS (Binary w) 6 5.29Γ—10βˆ’25.29superscript1025.29\times 10^{-2}5.29 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.49Γ—10βˆ’25.49superscript1025.49\times 10^{-2}5.49 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’3.65%percent3.65-3.65\%- 3.65 %
3-TC TS (Binary w) 10 5.22Γ—10βˆ’25.22superscript1025.22\times 10^{-2}5.22 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.32Γ—10βˆ’25.32superscript1025.32\times 10^{-2}5.32 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’1.72%percent1.72-1.72\%- 1.72 %
3-TC TS (Binary w) 20 5.14Γ—10βˆ’25.14superscript1025.14\times 10^{-2}5.14 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.20Γ—10βˆ’25.20superscript1025.20\times 10^{-2}5.20 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’0.99%percent0.99-0.99\%- 0.99 %
PPM Cont. w 2 8.96Γ—10βˆ’38.96superscript1038.96\times 10^{-3}8.96 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.23Γ—10βˆ’39.23superscript1039.23\times 10^{-3}9.23 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’3.01%percent3.01-3.01\%- 3.01 %
PPM Cont. w 4 6.45Γ—10βˆ’36.45superscript1036.45\times 10^{-3}6.45 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.94Γ—10βˆ’36.94superscript1036.94\times 10^{-3}6.94 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’7.02%percent7.02-7.02\%- 7.02 %
PPM Cont. w 10 4.72Γ—10βˆ’34.72superscript1034.72\times 10^{-3}4.72 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.01Γ—10βˆ’36.01superscript1036.01\times 10^{-3}6.01 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’21.53%percent21.53-21.53\%- 21.53 %
PPM Cont. w 20 4.28Γ—10βˆ’34.28superscript1034.28\times 10^{-3}4.28 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.91Γ—10βˆ’34.91superscript1034.91\times 10^{-3}4.91 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’12.83%percent12.83-12.83\%- 12.83 %
PPM TS (Binary w) 2 7.42Γ—10βˆ’37.42superscript1037.42\times 10^{-3}7.42 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.14Γ—10βˆ’21.14superscript1021.14\times 10^{-2}1.14 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’35.10%percent35.10-35.10\%- 35.10 %
PPM TS (Binary w) 4 5.87Γ—10βˆ’35.87superscript1035.87\times 10^{-3}5.87 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.14Γ—10βˆ’36.14superscript1036.14\times 10^{-3}6.14 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’4.47%percent4.47-4.47\%- 4.47 %
PPM TS (Binary w) 10 4.61Γ—10βˆ’34.61superscript1034.61\times 10^{-3}4.61 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.27Γ—10βˆ’35.27superscript1035.27\times 10^{-3}5.27 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’12.49%percent12.49-12.49\%- 12.49 %
PPM TS (Binary w) 20 4.15Γ—10βˆ’34.15superscript1034.15\times 10^{-3}4.15 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.33Γ—10βˆ’34.33superscript1034.33\times 10^{-3}4.33 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’4.31%percent4.31-4.31\%- 4.31 %
Table 5: Experimental results for parameter risks β„“q⁒(𝐰)subscriptβ„“π‘žπ°\ell_{q}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w ). The β„“q⁒(𝐰)subscriptβ„“π‘žπ°\ell_{q}({\bf w})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w ) were computed over 175⁒k175π‘˜175k175 italic_k unseen samples for each ODE system at the given sparsity values. The corresponding Standard Error of the Mean (SEM) are shown. Also given are the mean parameter risks β„“q⁒(𝐰rand)Β―Β―subscriptβ„“π‘žsubscript𝐰rand\overline{\ell_{q}({\bf w}_{\rm rand})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG obtained by 100 networks trained on random designs 𝐰randsubscript𝐰rand{\bf w}_{\rm rand}bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT. The standard deviations of β„“q⁒(𝐰rand)subscriptβ„“π‘žsubscript𝐰rand\ell_{q}({\bf w}_{\rm rand})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) and the percentage difference of β„“q⁒(𝐰opt)subscriptβ„“π‘žsubscript𝐰opt\ell_{q}({\bf w}_{\rm opt})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) from the random runs are similarly presented. 𝐒𝐏𝐒𝐏\bf{SP}bold_SP refers to the sparsity of the corresponding row.
ODE Method SP β„“q⁒(𝐰opt)subscriptβ„“π‘žsubscript𝐰opt\ell_{q}({\bf w}_{\rm{opt}})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) SEM(β„“q⁒(𝐰opt)subscriptβ„“π‘žsubscript𝐰opt\ell_{q}({\bf w}_{\rm{opt}})roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT )) β„“q⁒(𝐰rand)Β―Β―subscriptβ„“π‘žsubscript𝐰rand\overline{\ell_{q}({\bf w}_{\rm{rand}})}overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG β„“q⁒(𝐰opt)βˆ’β„“q⁒(𝐰rand)Β―β„“q⁒(𝐰rand)Β―Γ—100%subscriptβ„“π‘žsubscript𝐰optΒ―subscriptβ„“π‘žsubscript𝐰randΒ―subscriptβ„“π‘žsubscript𝐰randpercent100\frac{\ell_{q}({\bf w}_{\rm{opt}})-\overline{\ell_{q}({\bf w}_{\rm{rand}})}}{% \overline{\ell_{q}({\bf w}_{\rm{rand}})}}\times 100\%divide start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_opt end_POSTSUBSCRIPT ) - overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG end_ARG start_ARG overΒ― start_ARG roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ) end_ARG end_ARG Γ— 100 % STD(β„“q(𝐰rand)\textbf{STD}(\ell_{q}({\bf w}_{\rm{rand}})STD ( roman_β„“ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT roman_rand end_POSTSUBSCRIPT ))
3-TC Cont. w 2 3.81Γ—10βˆ’23.81superscript1023.81\times 10^{-2}3.81 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.35Γ—10βˆ’41.35superscript1041.35\times 10^{-4}1.35 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 8.94Γ—10βˆ’28.94superscript1028.94\times 10^{-2}8.94 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’57.34%percent57.34-57.34\%- 57.34 % 2.77Γ—10βˆ’12.77superscript1012.77\times 10^{-1}2.77 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
3-TC Cont. w 6 3.84Γ—10βˆ’23.84superscript1023.84\times 10^{-2}3.84 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.60Γ—10βˆ’41.60superscript1041.60\times 10^{-4}1.60 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.79Γ—10βˆ’25.79superscript1025.79\times 10^{-2}5.79 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’33.69%percent33.69-33.69\%- 33.69 % 1.14Γ—10βˆ’11.14superscript1011.14\times 10^{-1}1.14 Γ— 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
3-TC Cont. w 10 3.84Γ—10βˆ’23.84superscript1023.84\times 10^{-2}3.84 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.49Γ—10βˆ’41.49superscript1041.49\times 10^{-4}1.49 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 4.35Γ—10βˆ’24.35superscript1024.35\times 10^{-2}4.35 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’11.87%percent11.87-11.87\%- 11.87 % 3.69Γ—10βˆ’23.69superscript1023.69\times 10^{-2}3.69 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
3-TC Cont. w 20 3.78Γ—10βˆ’23.78superscript1023.78\times 10^{-2}3.78 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.07Γ—10βˆ’41.07superscript1041.07\times 10^{-4}1.07 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3.84Γ—10βˆ’23.84superscript1023.84\times 10^{-2}3.84 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’1.57%percent1.57-1.57\%- 1.57 % 6.19Γ—10βˆ’46.19superscript1046.19\times 10^{-4}6.19 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 2 3.82Γ—10βˆ’23.82superscript1023.82\times 10^{-2}3.82 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.65Γ—10βˆ’41.65superscript1041.65\times 10^{-4}1.65 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3.828Γ—10βˆ’23.828superscript1023.828\times 10^{-2}3.828 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’0.28%percent0.28-0.28\%- 0.28 % 3.94Γ—10βˆ’43.94superscript1043.94\times 10^{-4}3.94 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 6 3.80Γ—10βˆ’23.80superscript1023.80\times 10^{-2}3.80 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.43Γ—10βˆ’41.43superscript1041.43\times 10^{-4}1.43 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3.819Γ—10βˆ’23.819superscript1023.819\times 10^{-2}3.819 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’0.43%percent0.43-0.43\%- 0.43 % 2.39Γ—10βˆ’42.39superscript1042.39\times 10^{-4}2.39 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 10 3.83Γ—10βˆ’23.83superscript1023.83\times 10^{-2}3.83 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.34Γ—10βˆ’41.34superscript1041.34\times 10^{-4}1.34 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3.821Γ—10βˆ’23.821superscript1023.821\times 10^{-2}3.821 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT +0.37%percent0.37+0.37\%+ 0.37 % 2.13Γ—10βˆ’42.13superscript1042.13\times 10^{-4}2.13 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
3-TC TS (Binary w) 20 3.81Γ—10βˆ’23.81superscript1023.81\times 10^{-2}3.81 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 1.25Γ—10βˆ’41.25superscript1041.25\times 10^{-4}1.25 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3.82Γ—10βˆ’23.82superscript1023.82\times 10^{-2}3.82 Γ— 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT βˆ’0.22%percent0.22-0.22\%- 0.22 % 2.34Γ—10βˆ’42.34superscript1042.34\times 10^{-4}2.34 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM Cont. w 2 1.07Γ—10βˆ’31.07superscript1031.07\times 10^{-3}1.07 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 3.27Γ—10βˆ’63.27superscript1063.27\times 10^{-6}3.27 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 1.38Γ—10βˆ’31.38superscript1031.38\times 10^{-3}1.38 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’22.31%percent22.31-22.31\%- 22.31 % 4.88Γ—10βˆ’44.88superscript1044.88\times 10^{-4}4.88 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM Cont. w 4 5.45Γ—10βˆ’45.45superscript1045.45\times 10^{-4}5.45 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.99Γ—10βˆ’61.99superscript1061.99\times 10^{-6}1.99 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 1.11Γ—10βˆ’31.11superscript1031.11\times 10^{-3}1.11 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’51.05%percent51.05-51.05\%- 51.05 % 6.76Γ—10βˆ’46.76superscript1046.76\times 10^{-4}6.76 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM Cont. w 10 4.27Γ—10βˆ’44.27superscript1044.27\times 10^{-4}4.27 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.29Γ—10βˆ’61.29superscript1061.29\times 10^{-6}1.29 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 5.54Γ—10βˆ’45.54superscript1045.54\times 10^{-4}5.54 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT βˆ’22.97%percent22.97-22.97\%- 22.97 % 1.95Γ—10βˆ’41.95superscript1041.95\times 10^{-4}1.95 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM Cont. w 20 3.75Γ—10βˆ’43.75superscript1043.75\times 10^{-4}3.75 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 2.10Γ—10βˆ’42.10superscript1042.10\times 10^{-4}2.10 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 5.09Γ—10βˆ’45.09superscript1045.09\times 10^{-4}5.09 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT βˆ’26.44%percent26.44-26.44\%- 26.44 % 2.10Γ—10βˆ’42.10superscript1042.10\times 10^{-4}2.10 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM TS (Binary w) 2 7.88Γ—10βˆ’47.88superscript1047.88\times 10^{-4}7.88 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 2.31Γ—10βˆ’62.31superscript1062.31\times 10^{-6}2.31 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 1.60Γ—10βˆ’31.60superscript1031.60\times 10^{-3}1.60 Γ— 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT βˆ’50.76%percent50.76-50.76\%- 50.76 % 5.24Γ—10βˆ’45.24superscript1045.24\times 10^{-4}5.24 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM TS (Binary w) 4 5.73Γ—10βˆ’45.73superscript1045.73\times 10^{-4}5.73 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.85Γ—10βˆ’61.85superscript1061.85\times 10^{-6}1.85 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 8.18Γ—10βˆ’48.18superscript1048.18\times 10^{-4}8.18 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT βˆ’29.98%percent29.98-29.98\%- 29.98 % 3.11Γ—10βˆ’43.11superscript1043.11\times 10^{-4}3.11 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM TS (Binary w) 10 3.58Γ—10βˆ’43.58superscript1043.58\times 10^{-4}3.58 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.15Γ—10βˆ’61.15superscript1061.15\times 10^{-6}1.15 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 6.00Γ—10βˆ’46.00superscript1046.00\times 10^{-4}6.00 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT βˆ’40.42%percent40.42-40.42\%- 40.42 % 3.92Γ—10βˆ’43.92superscript1043.92\times 10^{-4}3.92 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
PPM TS (Binary w) 20 3.13Γ—10βˆ’43.13superscript1043.13\times 10^{-4}3.13 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 1.15Γ—10βˆ’61.15superscript1061.15\times 10^{-6}1.15 Γ— 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 3.91Γ—10βˆ’43.91superscript1043.91\times 10^{-4}3.91 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT βˆ’19.95%percent19.95-19.95\%- 19.95 % 1.29Γ—10βˆ’41.29superscript1041.29\times 10^{-4}1.29 Γ— 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT