marginparsep has been altered.
topmargin has been altered.
marginparpush has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
Weβre not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.
Β
Deep Optimal Experimental Design for Parameter Estimation Problems
Β
Md Shahriar Rahim Siddiquiβ1β Arman Rahmimβ1β Eldad Haberβ2β
Copyright 2024 by the author(s).
Abstract
Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.
1 Introduction
Mathematical models that are based on ordinary differential equations (ODEs) are used in various fields such as biology Helms (2008), ecology Bellman etΒ al. (1966), physiology Zenker etΒ al. (2007), pharmacometrics Ette & Williams (2007), climate modeling Keane etΒ al. (2017) and more Lenhart & Workman (2007); Betts (2001). These models often involve unknown parameters that need to be estimated from experimental data. The parameters represent rates and constants that are usually important in understanding the dynamics of the underlying system, and could be crucial to decision making and to systemβs control. Examples range from determining the insulin resistance of an individual from the clinically used intravenous glucose tolerance test (IVGTT) Chung & Haber (2013), finding the rates of transfer of radio-pharmaceuticals and determining when a system will exhibit chaotic behavior Hunt etΒ al. (2007).
Parameter identification is typically performed in two steps: data collection and data fitting. In the data collection stage experiments are performed and data are measured. In the data fitting stage, a data fitting procedure is used to evaluate parameters from the (typically noisy) measurements. For this process to be effective, the experiment must be carefully designed.
In many cases one can densely measure the data yielding a grossly over-determined problem BjΓΆrck (1996), however, in other cases, the measurement process is difficult and expensive, and this leads to restrictions on the type, amount and quality (that is, signal to noise) of the data that is collected. A decision must be made about which data should be measured, at what frequency and to what accuracy. A consequent trade-off exists between the accuracy of parameter recovery and the amount and quality of the data. The goal of this paper is to propose an experimental design technique that balances parameter estimation and the cost of the experiment.
Previous Work: Optimal experimental design is an important topic in applied science and engineering with applications in medical imaging, geoscience, optimal control, public health and many other fields (see Bardow (2008); Pukelsheim (1993); Allaire etΒ al. (2001); Chaloner & Verdinelli (1995); Haber etΒ al. (2008); Atkinson & Donev (1992) and references therein).
In this paper we focus our attention to the design of experiments for nonlinear systems that are governed by ODEs, with the aim to identify parameters within the equations. While there is significant work on the topic (see Haber etΒ al. (2009); Chung & Haber (2013); McLellan (1994); Banga & Balsa-Canto (2008); Bock etΒ al. (2013) and references therein) the methods proposed for the solution of the problem are difficult to apply and require classical recovery techniques.
The design process is based on three pillars. In the core stands the ability to quickly solve the so called forward problem for different experimental settings and parameters. In our case, the forward problem is a discretized system of ordinary differential equations that needs to be solved many (in some cases hundreds of thousands) of times. The second pillar of the design is the solution of the so-called inverse problem. That is, the identification of parameters given some experimental settings. For linear problems with quadratic regularization (that is, Gaussian priors), it is possible to obtain an analytic expression for this problem (see Haber etΒ al. (2008)). However, for nonlinear inverse problems such expressions are impossible to obtain and therefore, numerical optimization techniques are typically used for the solution of the problem and estimating the parameters. The solution of the inverse problem is sometimes referred to as the inner optimization problem. Finally, in the third and outer pillar of the design process stands an optimization problem that aims to obtain the best parameter recovery (on average) from the inner optimization problem by changing the experimental setting. The process is illustrated in FigureΒ 1.
![Refer to caption](extracted/5684597/workflow_schematic.png)
Optimal experimental design for nonlinear systems is therefore a bilevel optimization problem. The process of solving the outer problem (the design problem) is iterative. For a given design, one chooses some parameters to be identified from some distribution and solves many inner optimization problems to estimate those parameters. Comparing the parameters to their true values and taking derivatives it is possible to obtain a direction that improves the recovery of those parameters. Design methods for parameter estimation differ in their outer objective function and the statistical framework, Bayesian vs Frequentist Chaloner & Verdinelli (1995); Haber etΒ al. (2008). Nonetheless, all methods known to us rely on repeated solution of the inverse problem, for different parameters, which makes the problem very difficult to solve and computationally challenging. To this end, a few authors have proposed to replace the forward solution with reduced models Wang etΒ al. (2023); Ushijima & Yeh (2015); however those can behave poorly, especially when changing the design.
In recent years deep learning has been used in the context of generative models that are able to sample the main modes of a distribution (see Yang etΒ al. (2023) and references therein). The ability to learn a distribution or its main mode can be very useful when considering optimal design. In particular, we consider likelihood free estimators (see e.g.Sainsbury-Dale etΒ al. (2022); Didelot etΒ al. (2011); Papamakarios (2019)) that allow for the estimation of the solution of the inverse problem without the solution of the forward problem. With improvements in neural network architectures, such estimators have become very attractive for problems where the solution of the forward problem is computationally challenging.
The novelty of this work is the development of a methodology that uses likelihood free estimators for optimal experimental design for a system governed by differential equations. Our method circumvents the bilevel optimization problem that is traditionally used. We propose a mathematical framework that allows for the construction of a deep network that yields an optimal recovery of the parameters and the experimental setting. Given a parameter to be recovered and a forward problem, our framework requires the ability to sample from the prior distribution of the parameter and to solve the forward problem for those samples. Our framework does not require the solution of the inverse problem via optimization and rather learns a direct map from the data to the parameters under different experimental settings. By using an appropriate loss function, we are able to find a network that is optimal for the recovery of the parameter and the optimal experimental settings.
The rest of this paper is organized as follows. In SectionΒ 2 we give a mathematical background to the field of design in the context of differential equations. In SectionΒ 3 we discuss likelihood free estimators and show how they can be used for the design problem. In SectionΒ 4 we propose two training methods to train such estimators. In SectionΒ 5 we conduct a number of experiments with systems from diverse fields that show that our method is robust. Finally, in Section 6 we summarize the paper.
2 Mathematical Background for Optimal Design
Consider a parameter estimation problem, and let be a dimensional parameter vector in a differential equation. Let be a set of parameters that represent some experimental settings. Finally, let be the data measured. Formally, we write
(1) |
Here is the forward problem that takes the parameter and experimental setting and yields some measured data, . In our context, the application of requires the solution of a system of ODEs, typically an initial value problem. The data are assumed to be polluted with random noise which is assumed to be normal with mean and standard deviation. We assume that is associated with a known probability density function . In practice one does not require to have a mathematical expression for the density, however, we assume to have sufficient amount of samples, , that can be generated from this density. The experimental setting vector is assumed to be under our control. Such a vector may represent measurement times, frequencies for the forward problem, source terms or other controllable parameters in the experiment.
A common approach to estimate the parameter given the data is to use Bayesian inference. Using Bayesβ theorem, the probability density function of given is
where is the likelihood, that is, the probability of the data, given the parameter and the experimental setting . The Maximum A Posteriori (MAP) estimate is the parameter that maximizes this distribution; that is
(3) |
Here is the negative log of the prior.
Associated with any estimator is the recovery loss or the risk defined by
(4) |
Note that the risk is obtained by taking the expectation over the parameter and the noise , and therefore, the risk depends on the experimental setting alone. Associated with the estimator is the estimated data
(5) |
We also consider the data risk
(6) |
When approximating the integral, we use the trapezoidal method, which allows the application of non-uniform time intervals.
Note that two possible estimated solutions with similar risk may have different data risk . We therefore consider the total loss as a weighted sum of the risk equationΒ 4 and the data risk equationΒ 6, that is
(7) |
where is a hyper-parameter (we chose in our experiments). Using the definitions equationΒ 7 allows for a framework for the optimal design. In particular let
(8) |
The solution is interpreted as the experimental setting that yields the best recovery of the parameter, , on average that also fit the measure data.
One main difficulty in solving equationΒ 8 is that the problem may not be differentiable with respect to the experimental setting . A common way to overcome this difficulty that has been reviewed in Boyd & Vandenberghe (2004) is to discretize the space of experimental setting, , finely obtaining many plausible experiments , that can be done. Assume that we have possible different experiments and assume for simplicity that each datum corresponds to the data that is recorded from a different experiment . Introducing weights , one replaces the original posterior equationΒ 2 with one that contains all the data for all possible experiments, weighted by , that is
(9) |
where assumes that all possible experiments are conducted. The problem of estimating is replaced with the estimation of the weights . Clearly, if is sparse then only a few experiments are to be conducted. Therefore, it has been proposed Haber etΒ al. (2008) to replace the original problem with a penalized problem
(10) |
where promotes sparsity in , and is a hyper-parameter. A common approach is using the 1-norm although using approximations to zero-norm have been proposed Donoho (2006); Bruckstein etΒ al. (2009).
Even after the reformulation of the problem, solving the optimization equationΒ 10 is very difficult. It requires solving the optimization problem equationΒ 3 for many different βs in order to estimate the risk . This in turn requires solving many forward problems which can be very expensive especially in the context of ODEs. Furthermore, computing derivatives with respect to is challenging and requires implicit differentiation.
Therefore, in the next section we describe an alternative to this process that leads to a much more efficient algorithm.
3 Likelihood Free Estimators and Optimal Design
The optimization problem equationΒ 10 is difficult because we require to estimate which in turn requires the solution of the optimization problem equationΒ 3. The optimization stems from our use of the MAP estimator. The MAP estimator can be thought of as a nonlinear function of the form
(11) |
where the nonlinear function maps the data to an estimator .
The MAP estimator can be highly useful and has very desirable properties. However, it may be far from optimal. If the posterior is highly skewed, then the MAP may not lead to the minimization of the risk equationΒ 4. In practice, the main reason that the MAP estimator is commonly used is our ability to compute it. In many cases, estimators such as the conditional mean may be attractive; however they require non-trivial computations and Monte-Carlo integration. Thus, the MAP estimator is certainly not the only estimator that can be used. In the context of learning it is possible to directly learn an estimator that minimizes the riskΒ equationΒ 4. In many cases, such an estimator can perform even better than the MAP estimator (see for example Papamakarios (2019)). Furthermore, by appropriately setting its architecture, it is possible to include different experimental setting and noise levels in this estimator and then use it for the solution of the optimal design.
We now derive a likelihood free estimator for the solution of the problem, that enable us to solve both the estimation problem as well as the design problem. To this end, let us define the estimator
(12) |
The estimator depends on the experimental design vector that controls the design and trainable parameters . It also uses the parameter that represents the noise level. Note that this estimator does not require solving the forward problem directly and computing the likelihood, hence its name, Likelihood Free Estimator (LFE).
The risk for this estimator can be written as
(13) |
Clearly, the best estimator is obtained by minimizing the risk with respect to weights and the experimental setting . It is important to note that since the estimator is a simple function evaluation, we can solve a single optimization problem to estimate both the estimator parameter and the experimental design parameters . Thus, this formulation avoids the need for bilevel optimization all-together.
We now discuss a proposed architecture that we use for the solution of the problem. The architecture uses the data, , the experimental setting and the noise level in order to compute the estimator . A minimal structure of The network is summarized in AlgorithmΒ 1.
The network is a simple residual network with weights with a few important differences from standard residual networks. The first difference is that the input data is weighted with the parameters , allowing us to control the experimental design and train for the optimal parameters. Second, the parameters are embedded into the network. This allows for the network to be aware of the experimental setting. Finally, similar to diffusion models Li etΒ al. (2018), we embed the noise into the network, allowing it to be trained for different noise levels. In the network stands for the activation function bundled with possible layer normalization.
The network above can be modified to have different embedding for and at each layer. However, we have found that even this simple architecture yields very good results. The network needs to be trained for its weights as well as for the best experimental settings, . This is discussed in the next section.
4 Training the Estimator
In this section we discuss training the network by using a self-supervised approach Jaiswal etΒ al. (2020); Zhai etΒ al. (2019). Such techniques have been the cornerstone of recent advances in the application of deep learning for realistic application. However, before we proceed, we need to discuss the particular parametrization used for this problem. While choosing the network weights is standard, there are two options for the choice of the design vector . In the first option, we allow to have any real non-negative value. This is the simplest case as it allows us to simply train the experimental setting together with the parameters of the network, estimating the network and the optimal design simultaneously. In the second option we treat as a binary variable. Binary variables are considered where we either conduct the experiment or not implying that the data is measured or not. In this case, training the network implies solving a mixed nonlinear-binary programming problem. We solve this problem by a combination of Tabu-search for the binary variables and stochastic gradient descent for the continuous variables.
Regardless to the optimization process, we use a self supervised approach. That is, the algorithm generates its own data and trains on it. A summary of the algorithm is given in AlgorithmΒ 2.
The algorithm is self trained in a sense that it continuously samples and computes its corresponding data, and then uses the data in order to estimate . The algorithm requires a way to sample from the prior distribution of . For many parameter estimation problems such a distribution is readily available. In the examples below we sample the parameters from a log-normal distribution. The ability to generate an infinite amount of data, allows for a very robust training procedure in which over-fitting does not play a role.
The algorithmΒ 2 minimizes the loss with respect to the design variables and the network parameters . However, updating the design variables changes, depending on their type. We now discuss the two approaches used for the update of the design variables.
4.1 Continuous Design Variables (Method 1)
Consider first the case where the design vector can take on any values . In this case we seek to minimize the recovery loss and enhance the sparsity of .
A simple algorithm to achieve that is to use the soft shrink function Bruckstein etΒ al. (2009). Here we modify the classical soft shrink function to keep non-negativity. The modified soft shrink function is defined as
(14) |
This function is used during minimization in a straight forward manner (see Bruckstein etΒ al. (2009) for convergence proofs). Using the stochastic gradient descent algorithm we update in a standard way but for we use the following formula
(15) |
Here the soft shrink function is applied element-wise, is the learning rate and is the descent direction, obtained by the stochastic gradient descent algorithm.
4.2 Binary and Tabu Search (Method 2)
Using continuous weights is attractive as it enables the use of continuous optimization techniques. Nonetheless, the result of such an algorithm can lead to many weights that are small, raising the question about the importance of the data that is associated with those weights. In many applications, a binary makes more sense. In this case, implies that the data is measured and when the data is ignored and not measured. Thus, minimizing the loss is involved with mixed nonlinear binary programming, which is a challenging problem. To approximate the solution of this problem we combine a Tabu search method (for the binary variables) and a standard stochastic gradient descent for the continuous variables. Tabu search methods are commonly used in binary programming Hertz etΒ al. (1995). The algorithm uses the current iterate to define neighbors. These neighbors are tested, and the algorithm steps into the lowest one. The algorithm retains a list (the Tabu list) of points that are visited in order to not repeat points. Although the algorithm is very simple, it finds reasonable approximations to the minimum of binary programming problems (see Gendreau & Potvin (2005) for details). To use Tabu within the context of the optimization problem we incorporate it in a block coordinate descent algorithm. The algorithm is summarized in Algorithm 3.
Solving a mixed binary nonlinear programming problem is substantially more expensive compared with solving the problem with continuous variables. However, in our application we have found that a very small number of steps on the outer iteration usually suffice to obtain reasonable accuracy.
5 Numerical Experiments
In this section we study two different problems. The first problem is of practical importance to the field of nuclear medicine, utilizing a β3-Tissue Compartmentβ (3-TC) model, an ODE system amongst a range of compartmental models in the field of kinetic modeling Morris etΒ al. (2004) Bentourkia & Zaidi (2007), to estimate physiological parameters of importance using nuclear medicine imaging. The second is an example of a non-linear system of differential equations that model the populations of predators and prey in a system known as the Lotka-Volterra predator-prey model (PPM). We now shortly describe these problems, a detailed implementation can be found in Appendix A.
5.1 3-Tissue Compartment Model
A multi-compartment model is a simplified (or reduced) mathematical model used to describe the movement of mass or energy between different compartments within a system. Such models are often used to generate a reduced yet useful model of a complex system that is difficult to represent accurately. Each compartment in the model is considered to be homogeneous, thus allowing to model only the interaction between compartments. Compartment models have various applications in several fields, including pharmacokinetics, epidemiology, systems theory, complexity theory, engineering, physics, and social science Morris etΒ al. (2004) Zhang etΒ al. (2022) Wagner (1969). Here we consider the 3-Tissue Compartment Model (3-TC) model Zakariaei etΒ al. (2023) from the field of nuclear medicine and molecular imaging. The model is a system of ODEs, and is an example of a compartmental model from kinetic modeling. The 3-TC model (Figure 6 in the Appendix) describes the kinetics of a pharmaceutical in tissue, and how its concentration changes with time in each of the compartments. The model is written as
(16a) | |||||
(16b) | |||||
(16c) |
The equations describe the dynamics of the three states and which represent, respectively, the concentration of the radiopharmaceutical in the interstitial space between cells, bound to cell receptors, and finally internalized into the cells. represents the blood input function (source) describing the delivery of the radiopharmaceutical to the tissue by the vascular system, and is the instantaneous concentration of the radiopharmaceutical in the blood. This input function exhibits a sharp spike in concentration at earlier times. The dynamics of the model depends on the parameter vector . The parameters typically exhibit a log-normal prior distribution, that is
(17) |
![Refer to caption](extracted/5684597/10_ExamplePetSignals_TabuOptLines.png)
The parameters were calculated from Kletting etΒ al. (2016), and the numbers used are given in Appendix A. Data collection is done as follows: a patient is injected with a radiopharmaceutical and is subsequently imaged with a PET/CT scanner. The imaging data comprises so-called Time-Activity-Curves (TACs), obtained from dynamic nuclear medicine imaging, that show the evolution of radioactivity concentration (kBq/ml) in the patient over time. We model these TAC signals as the sum of contributions from the 3 tissues, namely , with Gaussian noise levels (picked at random at each sampling). Imaging at numerous time points is costly, and an optimal design aims to identify the best times for data collection. An example of noisy TACs that were used in our experiments with the assumed multiplicative Gaussian noise at varying levels is presented in FigureΒ 2.
![Refer to caption](extracted/5684597/3TC_SparsityPlot_4_updated.png)
5.2 Lotka-Volterra Predator-Prey Model
In our second set of experiments we consider the Predator-Prey Model (PPM) which is a system of non-linear differential equations that exhibits very different behaviour from the 3-TC model. While the 3-TC model is controlled by decay, the PPM model is cyclical in nature. The model describes the dynamics of two species and and can be written as
(18) |
equipped with some initial conditions. The model depends on the parameter vector . The parameter denotes the maximum per capita growth rate of prey, while signifies the rate at which prey are consumed by predators, reflecting the predatorβs impact on prey population reduction. Parameters and for predators represent their natural growth rate through prey consumption and per capita death rate, respectively. The model assumes prey have an unlimited food supply, reproducing exponentially unless preyed upon. Additionally, it assumes prey are the sole food source for predators, with all environmental variables held constant.
![Refer to caption](extracted/5684597/ContW_Sparsity4_NoisySignals_OptTimes_200_signals.png)
In order to estimate the parameters measuring the quantities or (or both) are needed. In practice, it is difficult to measure both species and therefore we assume that the experimentalist is only able to measure the population density of the prey. Using this data the goal is to infer the unknown parameters , , and . In practice, it is impossible to obtain a continuous monitoring of or . Assume that we are given a budget for the measurement process. Similar to the 3-TC model, the experimental design parameters are the number of measurements and the times that at which they need to be performed. Finally, in order to learn a likelihood free estimator, we require to have samples from the parameters . In Mahaffy (2009), data on the populations of lynx and hares from the Hudson Bay Company is used to fit the four unknown parameters , , and . To exhibit our proposed methods, we assumed lognormal prior distributions with means equal to their fitted parameters and a standard deviation of of the respective means:
(19) |
with numbers specified in Appendix A. Samples are drawn from this prior distribution to generate noisy data samples (equationΒ 1) for training. Gaussian noise of was added to generate the synethetic noisy prey data (see Figure 4).
![Refer to caption](extracted/5684597/LVSparsityPlot_2_updated.png)
5.3 Numerical Results
The network structure for the experiments is given in Algorithm (1), trained for continuous (section 4.1) and binary (section 4.2). Each ODE system was trialled with the two methods. For the 3-TC model, was generated from the sampled added with multiplicative Gaussian noise at levels , and for the PPM model at . A logarithmically spaced time-grid of size for to minutes was used for the 3-TC model as it exhibits a spike at earlier times (see Appendix A). For the PPM system, 200 equally spaced time points were chosen from to years. Here we provide an exposition on the details of the trials.
ODE | Method | Sparsity | ||||
---|---|---|---|---|---|---|
3-TC | Cont. w | 2 | ||||
3-TC | Cont. w | 6 | ||||
3-TC | Cont. w | 10 | ||||
3-TC | TS (Binary w) | 2 | ||||
3-TC | TS (Binary w) | 6 | ||||
3-TC | TS (Binary w) | 10 | ||||
PPM | Cont. w | 2 | ||||
PPM | Cont. w | 4 | ||||
PPM | Cont. w | 10 | ||||
PPM | TS (Binary w) | 2 | ||||
PPM | TS (Binary w) | 4 | ||||
PPM | TS (Binary w) | 10 |
For both training methods, at each iteration realizations of were generated from the prior distributions for both ODE systems. Learning rates were chosen by trial and error.
We defined sparsity as the number of weights higher than a threshold of . For Method 1, the design vector was initialized as a vector of ones . The training iterations were continued until the desired sparsity level was reached. Weights that were lower than the threshold at the end of the first phase were set to , and a further iterations were performed with the obtained fixed, which concluded the training phase. For training using Algorithm 3 (Method 2) at each sparsity, initialization was found to be important for convergence to a after Tabu iterations. Hyper-parameters such as the size of the subset of neighbours used and the Tabu list length were found using trial and error, and are and respectively for both ODEs. For initialization of , a network was trained on a random binary at the chosen sparsity, and then used as a pre-trained model to train networks using random binary designs (at fixed sparsity) for a smaller number of iterations. The random that gave the network with the lowest risk was used to initialize the Tabu Search routine given by Algorithm 3 to obtain a final . The two training methods that we propose as algorithms for fast and efficient experimental design was repeated for each ODE system at a number of different sparsities. For each ODE system, method, and sparsity, the model was evaluated on newly sampled βs (and corresponding βs generated as in equationΒ 1). The results are given in Table 1. Higher sparsities are expected to give better parameter recoveries and data fits, lowering , while the opposite is expected for lower sparsities. This trend is seen in the comparison of the two methods used in Figure 5 for the PPM ODE system, and in Figure 3 for the 3-TC ODE system.
6 Conclusions
In this paper we have introduced a methodology for experimental design for parameter estimation using a likelihood free estimator that involves the use of deep networks. Similar to other design methodologies we require the availability of training data, that samples the potential parameters to be recovered. Our methodology is self-supervised. It utilizes parameters to obtain the data under some experimental settings, and then, uses the likelihood free estimator to evaluate the parameters. Since likelihood free estimators circumvent the need of solving an optimization problem for the parameter, the experimental design problem does not require the solution of a bilevel optimization problem. We have introduced two design criteria. The first is where the data is multiplied by a continuous weight, and the second, when the data is multiplied by a binary weight. We experiment with these methods on two problems and show that it is possible to obtain an efficient design that can substantially reduce the cost of collecting data that is used in parameter estimation.
7 Impact Statement
Differential equation models are ubiquitous in science and engineering research. Experimental design is commonly used and better identifying the parameters in the face of resource or budgetary constraints can aid in various disciplines such as cancer therapy and controlling the propagation of disease. To our understanding, our work lacks discernible negative ethical or social impacts.
8 Acknowledgements
References
- Allaire etΒ al. (2001) Allaire, G., Aubry, S., and Jouve, F. Eigenfrequency optimization in optimal design. Comp. Meth. Appl. Mech. Eng., 190(28):3565β3579, 2001.
- Atkinson & Donev (1992) Atkinson, A.Β C. and Donev, A.Β N. Optimum Experimental Designs. Oxford University Press, 1992.
- Banga & Balsa-Canto (2008) Banga, J.Β R. and Balsa-Canto, E. Parameter estimation and optimal experimental design. Essays in biochemistry, 45:195β210, 2008.
- Bardow (2008) Bardow, A. Optimal experimental design for ill-posed problems, the meter approach. Computers and chemical engineering, 32, 2008.
- Bellman etΒ al. (1966) Bellman, R., Kagiwada, H., and Kalaba, R. Inverse problems in ecology. J. Theor. Biol., 11(1):164β167, 1966.
- Bentourkia & Zaidi (2007) Bentourkia, M. and Zaidi, H. Tracer kinetic modeling in pet. Pet Clinics, 2(2):267β277, 2007.
- Betts (2001) Betts, J. Practical Methods for Optimal Control using Nonlinear Programming. Advances in Design and Control. SIAM, Philadelphia, 2001.
- BjΓΆrck (1996) BjΓΆrck, Γ . Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996.
- Bock etΒ al. (2013) Bock, H.Β G., KΓΆrkel, S., and SchlΓΆder, J.Β P. Parameter estimation and optimum experimental design for differential equation models. Model Based Parameter Estimation: Theory and Applications, pp.Β 1β30, 2013.
- Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University press, 2004.
- Bruckstein etΒ al. (2009) Bruckstein, A., Donoho, D., and Elad, M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51:34β81, 2009.
- Chaloner & Verdinelli (1995) Chaloner, K. and Verdinelli, I. Bayesian experimental design: A review. Statis. Sci., 10:237β304, 1995.
- Chung & Haber (2013) Chung, M. and Haber, E. Experimental design for biological systems. SIAM Journal on Control and Optimization, 50:471β489, 2013.
- Didelot etΒ al. (2011) Didelot, X., Everitt, R.Β G., Johansen, A.Β M., and Lawson, D.Β J. Likelihood-free estimation of model evidence. 2011.
- Donoho (2006) Donoho, D.Β D. For most large underdetermined systems of linear equations the minimal -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797β829, 2006.
- Ette & Williams (2007) Ette, E.Β I. and Williams, P. (eds.). Pharmacometrics: The Science of Quantitative Pharmacology. Wiley-Interscience, New York, 2007.
- Feng etΒ al. (1994) Feng, D., Wang, X., and Yan, H. A computer simulation study on the input function sampling schedules in tracer kinetic modeling with positron emission tomography (pet). Computer Methods and Programs in Biomedicine, 45(3):175β186, 1994. ISSN 0169-2607. doi: https://doi.org/10.1016/0169-2607(94)90201-1. URL https://www.sciencedirect.com/science/article/pii/0169260794902011.
- Gendreau & Potvin (2005) Gendreau, M. and Potvin, J.-Y. Tabu search. Search methodologies: introductory tutorials in optimization and decision support techniques, pp.Β 165β186, 2005.
- Haber etΒ al. (2008) Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems, 24, 2008.
- Haber etΒ al. (2009) Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of nonlinear ill-posed inverse problems. Inverse Problems, 25, 2009.
- Helms (2008) Helms, V. (ed.). Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks. Wiley-VCH, New York, 2008.
- Hertz etΒ al. (1995) Hertz, A., Taillard, E., and DeΒ Werra, D. A tutorial on tabu search. In Proc. of Giornate di Lavoro AIRO, volumeΒ 95, pp.Β 13β24, 1995.
- Hunt etΒ al. (2007) Hunt, B.Β R., Kostelich, E.Β J., and Szunyogh, I. Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Physica D, 230:112β126, 2007.
- Jaiswal etΒ al. (2020) Jaiswal, A., Babu, A.Β R., Zadeh, M.Β Z., Banerjee, D., and Makedon, F. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
- Keane etΒ al. (2017) Keane, A., Krauskopf, B., and Postlethwaite, C.Β M. Climate models with delay differential equations. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(11), 2017.
- Kletting etΒ al. (2016) Kletting, P., Schuchardt, C., Kulkarni, H., Shahinfar, M., Singh, A., Glatting, G., Baum, R.Β P., and Beer, A. Investigating the effect of ligand amount and injected therapeutic activity: A simulation study for 177lu-labeled psma-targeting peptides. PloS one, 11:e0162303, 09 2016. doi: 10.1371/journal.pone.0162303.
- Lenhart & Workman (2007) Lenhart, S. and Workman, J. (eds.). Optimal Control Applied to Biological Models (1st ed.). Chapman and Hall/CRC, New York, 2007.
- Li etΒ al. (2018) Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In International Conference on Learning Representations, 2018.
- Mahaffy (2009) Mahaffy, J. Qualitative analysis of 2D systems of Ordinary Differential Equations. https://jmahaffy.sdsu.edu/courses/f09/math636/lectures/lotka/qualde2.html, 2009. Accessed: Dec 25, 2023.
- McLellan (1994) McLellan, P. A differential-algebraic perspective on nonlinear controller design methodologies. Chem. Eng. Science, 49:1663β1679, 1994.
- Morris etΒ al. (2004) Morris, E.Β D., Endres, C.Β J., Schmidt, K.Β C., Christian, B.Β T., Muzic, R.Β F., and Fisher, R.Β E. Kinetic modeling in positron emission tomography. Emission Tomography: The Fundamentals of PET and SPECT, 46(1):499β540, 2004.
- Papamakarios (2019) Papamakarios, G. Neural density estimation and likelihood-free inference. arXiv preprint arXiv:1910.13233, 2019.
- Pukelsheim (1993) Pukelsheim, F. Optimal design of experiments. John Wiley & Sons, 1993.
- Sainsbury-Dale etΒ al. (2022) Sainsbury-Dale, M., Zammit-Mangion, A., and Huser, R. Fast optimal estimation with intractable models using permutation-invariant neural networks. arXiv preprint arXiv:2208.12942, 2022.
- Ushijima & Yeh (2015) Ushijima, T.Β T. and Yeh, W.Β W. Experimental design for estimating unknown hydraulic conductivity in an aquifer using a genetic algorithm and reduced order model. Advances in Water Resources, 86:193β208, 2015.
- Wagner (1969) Wagner, J.Β G. Pharmacokinetics: 10. introduction to compartment models. Drug Intelligence & Clinical Pharmacy, 3(9):250β257, 1969. doi: 10.1177/106002806900300904. URL https://doi.org/10.1177/106002806900300904.
- Wang etΒ al. (2023) Wang, L., Martins, J.Β R., and Du, X. Optimal experimental design-based reduced order modeling for learning optimal aerodynamic designs. In AIAA AVIATION 2023 Forum, pp.Β 3716, 2023.
- Yang etΒ al. (2023) Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1β39, 2023.
- Zakariaei etΒ al. (2023) Zakariaei, N., Paranj, A.Β F., Abdollahi, H., and Rahmim, A. Using the cluster gauss newton algorithm to estimate theranostic pharmacokinetic model parameters. Journal of Nuclear Medicine, 64(supplement 1):P1449βP1449, 2023. ISSN 0161-5505. URL https://jnm.snmjournals.org/content/64/supplement_1/P1449.
- Zenker etΒ al. (2007) Zenker, S., Rubin, J., and Clermont, G. From inverse problems in mathematical physiology to quantitative differential diagnoses. PLOS Computational Biology, 3(11):1β15, 2007.
- Zhai etΒ al. (2019) Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.Β 1476β1485, 2019.
- Zhang etΒ al. (2022) Zhang, P., Feng, K., Gong, Y., Lee, J., Lomonaco, S., and Zhao, L. Usage of compartmental models in predicting covid-19 outbreaks. The AAPS Journal, 24(5):98, 2022. ISSN 1550-7416. doi: 10.1208/s12248-022-00743-9. URL https://doi.org/10.1208/s12248-022-00743-9. PMID: 36056223.
Appendix A Appendix: Additional Details on ODE Models and Setup
Here we provide additional details on our experiments. For all experiments concerned, the network structure given by Algorithm 1 used , and no layer normalization was used. The function was used as the activation function. Further results of our experimental runs are given in the tables shown herein.
The samples (and therefore that were generated from them) were generated from a lognormal prior distribution for each of the parameters in as mentioned in the main text (equationΒ 17 and equationΒ 19). The mean and standard deviation parameters of a lognormal distribution are related to the mean and standard deviation of the random variable as follows:
(20) |
where in our case refers to each of the parameters being used in the corresponding ODE model.
A.1 Model Setup: 3-Tissue Compartment Model
A diagram of the 4 compartments in the 3-TC ODE model is provided in Figure 6. The source function is a known blood input function (prior to entry into tissue).
![Refer to caption](extracted/5684597/model_fig_corrected.png)
In our experiments for the ODE system, we used the input function proposed in Feng etΒ al. (1994) given by:
(21) | ||||
We fit equationΒ 21 to patient data available to us at our institution, arriving at and which we used in equationΒ 21 to obtain our input function . This input function exhibits a sharp spike in concentration at earlier times, and as such, we selected a log-spaced discretization of points for from to minutes. For the initial conditions we set , , and to .
The means and standard deviations and for the prior distribution of the parameters , and that were used were , , , , , , and respectively. As can be seen, we took the standard deviations of the parameters to be of the means. The noise added to generated were randomly picked at each sampling from .
A.2 2: PPM Model Setup
For the Predator-Prey Model described in Section 5, we used an equally spaced time grid of 200 points from to years. In Mahaffy (2009), data on the populations of lynx and hares from the Hudson Bay Company is used to fit the four unknown parameters , , and , where they used an initial population of hares (prey) and lynx (predator) as 30 and 4 (in units of thousands) respectively, which we used. We took the fitted parameters obtained by Mahaffy (2009) as the means of the respective prior distributions of the parameters and , and of the means as their respective standard deviations. The means and standard deviations of the parameters for their respective prior distributions were: , , , . The noise added to the generated was randomly picked at each sampling from .
A.3 Scaling of Risks
The loss function equationΒ 7 is a sum of the data risk and the parameter risk . For the data-risk we used the trapezoidal rule for variable time widths (owing to our use of logarithmically spaced time points). Note that the risks reported in our results and that used during training had the scaling shown below in equationΒ 22, and whose discretized form was used during training.
(22) | ||||
where is a scaling factor. Similarly for the parameter recovery: risk :
(23) | |||||
(24) |
Appendix B Appendix: Numerical Details of Figures and Tables
In Figure 3 and Figure 5 we exhibit the performance of networks trained using the proposed method at different sparsities. For each training method at each sparsity, a new batch of samples of (and corresponding noisy ) were sampled for the trained optimal network to be evaluated on. For the computation of , the network was evaluated on sets of newly sampled and , amounting to an evaluation over unseen samples. The mean value for the total risk of the optimal network over these sets was computed and taken as , and the Standard Error of the Mean (SEM) for this value was calculated. The are plotted against (number of non-zero elements in ) in Figure 3 and in Figure 5. The corresponding numerical values are presented in Table 2. Note that in Figure 3 and Figure 5, the error bars look point-like due to the SEM() values being smaller than the values of that are plotted (see Table 2). The error bars in the figures were calculated as:
(25) |
The corresponding numerical values for and SEM() are in Table 2.
We chose to conduct numerical experiments to compare the risks obtained by our methods to random designs for a subset of sparsities. For the 3-TC model, this was at , and at for the PPM system of equations.
Namely, the risks , obtained by training optimal networks for a given method and sparsity are compared against networks trained on random . In Table 1, Table 3 and Table 5, for each ODE model, method, and sparsity, 100 networks were trained on random designs for comparison. For the random runs for continuous , the entries of were sampled from a uniform distribution . For binary design variables, had a random subset of weights (equal to ) set to , with the rest set to . For the computation of each , the optimal network was evaluated on sets of newly sampled and , amounting to an evaluation over unseen samples. A mean value for the risk of the optimal network over these samples was computed and taken as , and the Standard Error of the Mean for this value was calculated. This is given as the SEM() column in Table 3 and Table 2. The parameter risks for the optimal networks were calculated similarly. Each of the networks trained on random designs were evaluated for on new samples of and . The STD() and STD( columns in Table 3 and Table 5 are the standard deviations of the corresponding risks among the 100 networks trained on random for each corresponding entry in the tables. In Table 4 we confirm that the total risks for the trained optimal networks are lower than the risk obtained by the best performing network trained on random for each entry.
on unseen samples of and . The Standard Error of the Mean (SEM) for each are also presented. ODE Method Sparsity SEM() 3-TC Cont. w 1 3-TC Cont. w 2 3-TC Cont. w 4 3-TC Cont. w 6 3-TC Cont. w 10 3-TC Cont. w 15 3-TC Cont. w 20 3-TC Cont. w 30 3-TC TS (Binary w) 1 3-TC TS (Binary w) 2 3-TC TS (Binary w) 4 3-TC TS (Binary w) 6 3-TC TS (Binary w) 10 3-TC TS (Binary w) 15 3-TC TS (Binary w) 20 3-TC TS (Binary w) 30 PPM Cont. w 1 PPM Cont. w 2 PPM Cont. w 4 PPM Cont. w 6 PPM Cont. w 10 PPM Cont. w 15 PPM Cont. w 20 PPM Cont. w 30 PPM TS (Binary w) 1 PPM TS (Binary w) 2 PPM TS (Binary w) 4 PPM TS (Binary w) 6 PPM TS (Binary w) 10 PPM TS (Binary w) 15 PPM TS (Binary w) 20 PPM TS (Binary w) 30
ODE | Method | SP | SEM() | ||||
---|---|---|---|---|---|---|---|
3-TC | Cont. w | 2 | |||||
3-TC | Cont. w | 6 | |||||
3-TC | Cont. w | 10 | |||||
3-TC | Cont. w | 20 | |||||
3-TC | TS (Binary w) | 2 | |||||
3-TC | TS (Binary w) | 6 | |||||
3-TC | TS (Binary w) | 10 | |||||
3-TC | TS (Binary w) | 20 | |||||
PPM | Cont. w | 2 | |||||
PPM | Cont. w | 4 | |||||
PPM | Cont. w | 10 | |||||
PPM | Cont. w | 20 | |||||
PPM | TS (Binary w) | 2 | |||||
PPM | TS (Binary w) | 4 | |||||
PPM | TS (Binary w) | 10 | |||||
PPM | TS (Binary w) | 20 |
ODE | Method | Sparsity | |||
---|---|---|---|---|---|
3-TC | Cont. w | 2 | |||
3-TC | Cont. w | 6 | |||
3-TC | Cont. w | 10 | |||
3-TC | Cont. w | 20 | |||
3-TC | TS (Binary w) | 2 | |||
3-TC | TS (Binary w) | 6 | |||
3-TC | TS (Binary w) | 10 | |||
3-TC | TS (Binary w) | 20 | |||
PPM | Cont. w | 2 | |||
PPM | Cont. w | 4 | |||
PPM | Cont. w | 10 | |||
PPM | Cont. w | 20 | |||
PPM | TS (Binary w) | 2 | |||
PPM | TS (Binary w) | 4 | |||
PPM | TS (Binary w) | 10 | |||
PPM | TS (Binary w) | 20 |
ODE | Method | SP | SEM() | ) | |||
---|---|---|---|---|---|---|---|
3-TC | Cont. w | 2 | |||||
3-TC | Cont. w | 6 | |||||
3-TC | Cont. w | 10 | |||||
3-TC | Cont. w | 20 | |||||
3-TC | TS (Binary w) | 2 | |||||
3-TC | TS (Binary w) | 6 | |||||
3-TC | TS (Binary w) | 10 | |||||
3-TC | TS (Binary w) | 20 | |||||
PPM | Cont. w | 2 | |||||
PPM | Cont. w | 4 | |||||
PPM | Cont. w | 10 | |||||
PPM | Cont. w | 20 | |||||
PPM | TS (Binary w) | 2 | |||||
PPM | TS (Binary w) | 4 | |||||
PPM | TS (Binary w) | 10 | |||||
PPM | TS (Binary w) | 20 |