\addbibresource

bib_files/MAP_refs1.bib \addbibresourcebib_files/MohammadRafi.bib \addbibresourcebib_files/naveed_references.bib \AtEveryBibitem\clearfieldissn \clearfielddoi \clearfieldurl \clearfieldeditor \DeclareFieldFormat*year \renewbibmacroin:

On the Feedback Law in Stochastic Optimal Nonlinear Control

Mohamed Naveed Gul Mohamed, Suman Chakravorty, Raman Goyal, and Ran Wang The authors are with the Department of Aerospace Engineering, Texas A&M University, College Station, TX 77843 USA. {naveed, schakrav, ramaniitrgoyal92, rwang0417}@tamu.edu

Abstract

We consider the problem of nonlinear stochastic optimal control. This problem is thought to be fundamentally intractable owing to Bellman’s “curse of dimensionality”. We present a result that shows that repeatedly solving an open-loop deterministic problem from the current state with progressively shorter horizons, similar to Model Predictive Control (MPC), results in a feedback policy that is $O(\epsilon^{4})$ near to the true global stochastic optimal policy, where $\epsilon$ is a perturbation parameter modulating the noise. We show that the optimal deterministic feedback problem has a perturbation structure in that higher-order terms of the feedback law do not affect lower-order terms, and that this structure is lost in the optimal stochastic feedback problem. Consequently, solving the Stochastic Dynamic Programming problem is highly susceptible to noise, even when tractable, and in practice, the MPC-type feedback law offers superior performance even for stochastic systems.

Index Terms:

Stochastic Optimal Control, Nonlinear Systems, Model Predictive Control.

I INTRODUCTION

In this paper, we consider the problem of finite-time nonlinear stochastic optimal control, specifically the stochastic dynamical system:

dx=(f(x)+g(x)u)dt+\epsilon dw,

where $w$ is a Wiener process, $\epsilon$ is perturbation parameter modulating the noise, and the cost to be optimized is $J^{\pi}=\mathop{\mathbb{E}}_{\begin{subarray}{c}{}\end{subarray}}\left[{\int_{% 0}^{T}c(x_{t},\pi_{t}(x_{t}))dt+c_{T}(x_{T})}\right]$ , where the incremental cost has the form $c(x,u)=l(x)+\frac{1}{2}{u}^{\mathsf{T}}Ru$ , $c_{T}(x_{T})$ is the terminal cost, $\pi_{t}(x_{t})$ is a control policy and the cost is minimized over all possible such policies. We present a result that establishes that repeatedly solving a deterministic optimal control, or open-loop problem, from the current state, shown to be equivalent to applying the deterministic feedback policy to the system, results in a feedback policy that is $O(\epsilon^{4})$ near-optimal to the optimal stochastic feedback policy, in terms of the small noise parameter $\epsilon$ . Our analysis shows that under the relatively mild conditions of affine in control dynamics and quadratic in control cost, the open-loop solution obtained by satisfying the Minimum Principle [bryson] is globally optimum. Further, the deterministic feedback law has a perturbation structure in the sense that the higher-order feedback terms do not affect the lower-order terms, and that this structure is lost for the optimal stochastic problem. We obtain the equations that need to be satisfied by the linear and higher-order feedback terms in the optimal feedback law. Although near-optimal, empirical evidence shows that this replanning based Model Predictive Control (MPC)-type policy is the best we can do in practice, in the sense that albeit the optimal stochastic law should, in theory, have better performance, solving the stochastic problem is highly susceptible to noise owing to its lack of the perturbation structure, and in practice, the MPC-type law gives better performance. Thus, this result resolves the trade-off between tractability and optimality in stochastic feedback control problems, showing that, in practice, “what is tractable is also optimal”. In this paper, we consider the case where an analytical model is available for the control synthesis, we consider the case of data-based control in a companion paper [wang2022search]. A final note here is that the system considered in this work is not the most general, and our goal is not to analyze the most general case, rather, we show that even in the simplest case considered here, the stochastic problem is fundamentally intractable, and the best we can do in practice is use the deterministic feedback law, which is implemented by re-planning the open-loop as necessary.

A large majority of sequential decision making problems under uncertainty can be posed as a nonlinear stochastic optimal control problem that requires the solution of an associated Dynamic Programming (DP) problem, however, as the state dimension increases, the computational complexity grows exponentially in the state dimension [bertsekas1]: the manifestation of the so-called Bellman’s “curse of dimensionality (CoD)” [bellman]. Approximate DP (ADP), or alternatively, in Reinforcement Learning (RL), simulations of the process under a policy, is used to get an approximation of the cost-to-go function by sampling the domain [parr3, bertsekas1]. But, as the dimension $d$ increases, the number of samples required for evaluation goes up exponentially. There has been recent success using the Deep RL paradigm where deep neural networks are used as nonlinear function approximators to keep the parametrization tractable [RLHD1, haarnoja2018soft, fujimoto2018addressing, RLHD4, RLHD5], however, the training times required for these approaches, and the variance of the solutions, is still prohibitive. Hence, the primary problem with ADP/ RL techniques is the CoD inherent in the complex representation of the cost-to-go function, and the exponentially large number of evaluations required for its estimation resulting in high solution variance which makes them unreliable and inaccurate.

In the case of continuous state, control, and observation space problems, the Model Predictive Control [Mayne_1, Mayne_2] approach has been used with a lot of success in the control system and robotics community. For deterministic systems, the process results in solving the original DP problem in a recursive online fashion. However, stochastic control problems, and the control of uncertain systems in general, is still an unresolved problem in MPC. As succinctly noted in [Mayne_1], the problem arises due to the fact that in stochastic control problems, the MPC optimization at every time step cannot be over deterministic control sequences, but rather has to be over feedback policies, which is, in general, difficult to accomplish since a tractable parameterization of such policies to perform the optimization over, is, in general, unavailable. Thus, the tube-based MPC approach, and its stochastic counterparts, typically consider linear systems [T-MPC1, T-MPC2, T-MPC3] for which a linear parametrization of the feedback policy suffices but the methods become intractable when dealing with nonlinear systems [Mayne_3]. In more recent work, event-triggered MPC [ETMPC1, ETMPC2] keeps the online planning computationally efficient by triggering replanning in an event driven fashion rather than at every time step. We note that event-triggered MPC inherits the same issues mentioned above with respect to the stochastic control problem, and consequently, the techniques are intractable for nonlinear systems. There has been recent work showing the near-optimality of MPC with a perturbation analysis [bounded_regret_mpc, bounded_regret_ltv, superconvergence_mpc], but this work considers a deterministic problem setting with unknown model parameters in the system dynamics, and the regret bound provided is with respect to the controller that has perfect knowledge of the model, in contrast, we show the near-optimality of the deterministic feedback to the optimal stochastic law.

The fundamental problem is that, albeit solving the open-loop problem via the Minimum Principle (MP) is much easier, solving for the optimal feedback control under uncertainty requires the solution of the DP equation, which is intractable. Moreover, this also begs the question, since all systems are subject to uncertainty, what is the utility of deterministic optimal control?
Contributions: In this work, we establish that the basic MPC-type approach of solving the deterministic open-loop problem (with progressively shorter horizons) at every time step results in a near-optimal policy, to $O(\epsilon^{4})$ , for a nonlinear stochastic system. The result uses a perturbation expansion of the cost-to-go function in terms of a perturbation parameter $\epsilon$ . We show the global optimality of the open-loop solution obtained by satisfying the Minimum Principle [bryson] using the classical Method of Characteristics [Courant-Hilbert] thereby establishing that the MPC feedback law is indeed the optimal deterministic feedback law. Further, we show that the deterministic feedback law has a perturbation structure that is lost in the stochastic problem. We obtain the true linear and higher order feedback gain equations of the optimal deterministic policy as a by-product, which is very different from the Riccati equation governing a typical LQR perturbation feedback design [bryson]. Finally, albeit the MPC law is only “near-optimum”, our empirical evidence shows that this deterministic law has better performance than the stochastic law, obtained by solving the stochastic DP problem computationally, showing the susceptibility of the stochastic DP problem to noise owing to the loss of the perturbation structure, quite apart from the usual curse of dimensionality. Thus, in practice, the MPC law is the best one can do. In contrast to [parunandi2019TPFC], we show fourth order near-optimality to the optimal stochastic solution, the global optimality of the open-loop solution and the perturbation structure of the deterministic feedback law, without which MPC is heuristic, and analytical as well as empirical evidence regarding the superiority of MPC to stochastic DP even when the DP problem is not subject to the curse of dimensionality. The current manuscript expands on our previously published conference paper [mohamed2022acc]. In particular, we provide detailed proofs of all our developments, explain the loss of perturbation structure in the stochastic problem, and provide a comprehensive empirical evaluation of the theory proposed in this manuscript.

The rest of the document is organized as follows: Section II states the problem, Section III presents three fundamental results that represent the three legs of the stool that supports the fact that the MPC feedback law is near-optimal, which is established in Section IV. We illustrate our results numerically in Section V using a simple 1-dimensional example for which the stochastic DP problem can be solved, and more practical examples from nonlinear robotic planning.

II PRELIMINARIES

The following outlines the finite time stochastic optimal control problem formulation, and the associated deterministic problem, along with the associated Dynamic Programming (DP) problems that we shall study in this work.

System Model

For a dynamical system, we denote the state and control vectors by $x\in\ \mathbb{X}\subset\ \mathbb{R}^{n_{x}}$ and $u\in\ \mathbb{U}\subset\ \mathbb{R}^{n_{u}}$ respectively. The dynamics of the system is governed by the stochastic differential equation (SDE):

dx=(\mathcal{F}(x)+\mathcal{G}(x)u)dt+\epsilon dw,

(1)

where $w\in\mathbb{R}^{n_{x}}$ is a Wiener process with covariance $Q\in\mathbb{R}^{n_{x}\times n_{x}}$ , and $\epsilon$ is a small parameter modulating the noise amplitude to the system and affects the signal-to-noise ratio.

1-Dimensional/ Scalar case: For the sake of simplicity in presenting the results, we will consider the scalar or 1-dimensional version of the problem, i.e., $n_{x}=n_{u}=1$ . The final results for the vector case will also be provided. The dynamics of the system for the scalar case is denoted by the following SDE:

dx=(f(x)+g(x)u)dt+\epsilon dw,

(2)

where $f(x)$ and $g(x)$ are the 1-dimensional equivalent of $\mathcal{F}(x)$ and $\mathcal{G}(x)$ , respectively.

Stochastic optimal control problem

The stochastic optimal control problem for an initial state $x_{0}$ is defined as:

J^{\pi^{*}}(x_{0})=\min_{\Pi}\ \mathop{\mathbb{E}}_{\begin{subarray}{c}{}\end{% subarray}}\left[{\int^{T}_{0}c(x_{t},\pi_{t}(x_{t}))dt+c_{T}(x_{T})}\right],

(3)

subject to the SDE (1), where the optimization is over a family of time-varying feedback policies $\Pi:=\{\pi_{t}(x);t\in[0,T]$ }; $J^{\pi^{*}}(\cdot):\mathbb{X}\rightarrow\mathbb{R}$ is the cost function on applying the optimal policy $\pi^{*}$ ; $c(\cdot,\cdot):\mathbb{X}\times\mathbb{U}\rightarrow\mathbb{R}$ is the incremental cost function; and $c_{T}(\cdot):\mathbb{X}\rightarrow\mathbb{R}$ is the terminal cost function; where $T$ is the “finite time horizon” of the problem.

Assumptions

We shall make the following assumptions in the rest of the paper, and unless otherwise stated, all results assume the following.

Assumption 1

(A1) Cost Structure. We assume that the incremental cost $c(x,u)$ is quadratic in the control variable, i.e., $c(x,u)=l(x)+\frac{1}{2}{u}^{\mathsf{T}}Ru$ , with $R$ positive definite. The matrix $R$ will be replaced by $r$ for the scalar case.

Assumption 2

(A2) Smoothness. We shall also assume that all the involved functions: $\mathcal{F}(x),\mathcal{G}(x)$ ( $f(x),g(x)$ for the scalar case), $l(x),c_{T}(x),\pi_{t}(x)$ are five times continuously differentiable ( $\mathcal{C}^{5}$ ) in their arguments.

II-A Stochastic Dynamic Programming

The continuous time DP or the stochastic Hamilton-Jacobi-Bellman (HJB) equation for the system in Eq. (1) is given by [OPT_todorov]

-\frac{\partial J}{\partial t}=\min_{u}H(x,u)+\frac{\epsilon^{2}}{2}\sum_{i}% \sum_{j}\frac{\partial^{2}J}{\partial x_{i}\partial x_{j}}Q_{ij},

(4)

where, $J=J(t,x)$ , $J(T,x)=c_{T}(x)$ is the terminal condition, $H(x,u)=l(x)+\frac{1}{2}{u}^{\mathsf{T}}Ru+{\frac{\partial J}{\partial x}}^{% \mathsf{T}}(\mathcal{F}(x)+\mathcal{G}(x)u)$ is the Hamiltonian of the system, and $Q=[Q_{ij}]$ is the intensity of the vector Wiener process.
Let $u(t,x)$ denote the corresponding optimal policy. Then, it is sufficient that the optimal control $u$ satisfies the first-order necessary condition (since the Hamiltonian $H(x,u)$ is strictly quadratic in $u$ ):

u=-R^{-1}{\mathcal{G}(x)}^{\mathsf{T}}J^{x},~{}\text{where}~{}J^{x}=\frac{% \partial J(t,x)}{\partial x}.

(5)

II-B The Deterministic Problem

Let us now consider the deterministic problem, i.e., Eq. (1) with $\epsilon=0$ and the same cost as in (3), except there is no expectation due to the lack of stochasticity. Utilizing essentially identical arguments as for the stochastic case, the optimal cost-to-go of the deterministic system, $\phi(t,x)$ , satisfies the deterministic HJB equation:

-\frac{\partial{\phi}}{\partial t}=\min_{u}H(x,u),

(6)

where the terminal condition $\phi(T,x)=c_{T}(x)$ , and the Hamiltonian $H(x,u)=l(x)+\frac{1}{2}{u}^{\mathsf{T}}Ru+{\frac{\partial\phi}{\partial x}}^{% \mathsf{T}}(\mathcal{F}(x)+\mathcal{G}(x)u)$ is the exact same as that in the stochastic problem and the only difference is the missing diffusion term $\epsilon^{2}\frac{\partial^{2}J}{\partial x^{2}}$ . Finally, identical to the stochastic case, the optimal control in the deterministic case is given by:

\displaystyle u^{d}=-R^{-1}{\mathcal{G}(x)}^{\mathsf{T}}\phi^{x},\text{where}~% {}\phi^{x}=\frac{\partial\phi}{\partial x}.

(7)

III A PERTURBATION ANALYSIS OF OPTIMAL FEEDBACK CONTROL

In the following four subsections, we establish four basic results that we shall use to establish the near optimality of the MPC law in Section IV. In Section III-A, we characterize the performance of any given feedback policy as a perturbation (series) expansion in the parameter $\epsilon$ . We establish that the $O(\epsilon^{0})$ term depends only on the nominal action, while the $O(\epsilon^{2})$ depends only on the linear part of the feedback law. In Section III-B, we find the differential equations satisfied by these different perturbation costs using the HJB equation and show that the stochastic and deterministic optimal feedback laws share the same nominal and first order costs. In Section III-C, we analyze the nominal/ open-loop problem using the Method of Characteristics and show that the open-loop optimal control has a unique global minimum. Also, we show that the deterministic optimal feedback control problem has a perturbation structure in that the higher order terms do not affect the lower order terms in a Taylor expansion of the optimal feedback law, and obtain the equations governing the optimal linear feedback term in the nonlinear problem, which is shown to be different from a traditional LQR design [bryson]. In Section III-D, we show that this perturbation structure is lost for the stochastic problem leading to a fundamental computational intractability, quite apart from the usual curse of dimensionality.

III-A Characterizing the Performance of a Feedback Policy

In order to derive the results in this section, we first discretize the SDE in Eq. (1) via a Forward Euler approximation [kloedon_numerical_sde, Ch.9] with discretization time $\Delta t$ :

\displaystyle x_{k+1}

\displaystyle=x_{k}+(\mathcal{F}(x_{k})+\mathcal{G}(x_{k})u_{k})\Delta t+% \epsilon w_{k}\sqrt{\Delta t}+o(\Delta t),

(8)

where $\epsilon<1$ is a perturbation parameter, $w_{k}$ is a white noise sequence with covariance $Q=I_{n_{x}\times n_{x}}$ , $k=0,1\cdots N$ , where $N=T/\Delta t$ , and $||o(\Delta t)||\rightarrow 0$ as $\Delta t\rightarrow 0$ . At the end of this Section, we will obtain the continuous time result by letting $\Delta t\rightarrow 0$ . For notational convenience, we shall not explicitly write the $o(\Delta t)$ term in the following.

Let us also consider a noiseless version of the system dynamics given by (8), obtained by setting $w_{k}=0$ for all $k$ : $\bar{x}_{k+1}=\bar{x}_{k}+(\mathcal{F}(\bar{x}_{k})+\mathcal{G}(\bar{x}_{k})% \bar{u}_{k})\Delta t$ , where we denote the “nominal” state trajectory as $\bar{x}_{k}$ and the “nominal” control as $\bar{u}_{k}$ , with $\bar{u}_{k}=\pi_{k\Delta t}(\bar{x}_{k})$ , where $\{\pi_{k\Delta t}(\cdot),k=0,1\cdots N\}$ is a discretization of a given continuous-time control policy $\Pi=\{\pi_{t}(x),t\in[0,T]\}$ . In the following, to simplify notation, we shall drop the explicit reference to the discretization time $\Delta t$ while denoting the discretized policy as $\{\pi_{k}(x)\},k=0,1,\cdots,N$ .

Assuming that $\mathcal{F}(\cdot)$ and $\pi_{k}(\cdot)$ are sufficiently smooth (assumption A2), we can expand the dynamics about the nominal trajectory using a Taylor series. Denoting $\delta x_{k}=x_{k}-\bar{x}_{k},\delta u_{k}=u_{k}-\bar{u}_{k}$ , we can express,

	$\displaystyle\delta x_{k+1}$	$\displaystyle=A_{k}\delta x_{k}+B_{k}\delta u_{k}+S_{k}(\delta x_{k})+\epsilon w% _{k}\sqrt{\Delta t},$		(9)
	$\displaystyle\delta u_{k}$	$\displaystyle=K_{k}\delta x_{k}+\tilde{S}_{k}(\delta x_{k}),$		(10)

where $A_{k}=I_{n_{x}\times n_{x}}+\frac{\partial(\mathcal{F}(x)+\mathcal{G}(x)u)% \Delta t}{\partial x}|_{\bar{x}_{k},\bar{u}_{k}}$ ,
$B_{k}=\frac{\partial(\mathcal{F}(x)+\mathcal{G}(x)u)\Delta t}{\partial u}|_{% \bar{x}_{k},\bar{u}_{k}}=\mathcal{G}(\bar{x}_{k})\Delta t$ , $K_{k}=\frac{\partial\pi_{k}}{\partial x}|_{\bar{x}_{k}}$ , and $S_{k}(\cdot),\tilde{S}_{k}(\cdot)$ are second and higher order terms in the respective expansions.

Using (9) and (10), we can write the closed-loop dynamics of the trajectory $(\delta x_{k})^{N}_{k=1}$ as,

	$\displaystyle\delta x_{k+1}=\underbrace{(A_{k}+B_{k}K_{k})}_{\bar{A}_{k}}% \delta x_{k}$	$\displaystyle+\underbrace{B_{k}\tilde{S}_{k}(\delta x_{k})+S_{k}(\delta x_{k})% }_{\bar{S}_{k}(\delta x_{k})}$
		$\displaystyle+\epsilon w_{k}\sqrt{\Delta t},$		(11)

where $\bar{A}_{k}$ represents the linear part of the closed-loop system and the term $\bar{S}_{k}(\cdot)$ represents the second and higher order terms in the closed-loop system.

Similarly, we can expand the instantaneous cost $c(x_{k},u_{k})$ about the nominal values $c(\bar{x}_{k},\bar{u}_{k})$ as,

$\displaystyle c(x_{k},u_{k}){\Delta t\ }$	$\displaystyle=\Big{(}{l}(\bar{x}_{k})+L_{k}\delta x_{k}+H_{k}(\delta x_{k})+$
	$\displaystyle\frac{1}{2}{\bar{u}_{k}}^{\mathsf{T}}R\bar{u}_{k}+{\delta u_{k}}^% {\mathsf{T}}R\bar{u}_{k}+\frac{1}{2}{\delta u_{k}}^{\mathsf{T}}R\delta u_{k}% \Big{)}\Delta t,$	(12)
$\displaystyle c_{T}(x_{N})$	$\displaystyle={c}_{T}(\bar{x}_{N})+C_{T}\delta x_{N}+H_{T}(\delta x_{N}),$	(13)

where $L_{k}=\frac{\partial l}{\partial x}|_{\bar{x}_{k}}$ , $C_{T}=\frac{\partial c_{T}}{\partial x}|_{\bar{x}_{N}}$ , and $H_{k}(\cdot)$ and $H_{T}(\cdot)$ are second and higher order terms in the respective expansions. The closed-loop incremental cost given in (12) can be expressed as

c(x_{k},u_{k}){\Delta t\ }=\underbrace{\{{l}(\bar{x}_{k})+\frac{1}{2}{\bar{u}_% {k}}^{\mathsf{T}}R\bar{u}_{k}\}\Delta t}_{\bar{c}_{k}}+\\ \underbrace{[L_{k}+{\bar{u}_{k}}^{\mathsf{T}}RK_{k}]\Delta t}_{\bar{C}_{k}}% \delta x_{k}+\bar{H}_{k}(\delta x_{k}),

where $\bar{H}_{k}(\delta x_{k})$ are the second and higher order terms. Therefore, the cumulative cost of any given closed-loop trajectory $(x_{k},u_{k})^{N}_{k=0}$ can be expressed as, $\mathcal{J}^{\pi}(x_{0})=\sum^{N}_{k=0}c(x_{k},\pi_{k}(x_{k})){\Delta t\ }+c_{% T}(x_{N})$ , which can be written in the following form:

\displaystyle\mathcal{J}^{\pi}(x_{0})

\displaystyle=\sum_{k=0}^{N}\bar{c}_{k}+\sum_{k=0}^{N}\bar{C}_{k}\delta x_{k}+% \sum_{k=0}^{N}\bar{H}_{k}(\delta x_{k}),

(14)

where $\bar{c}_{N}=c_{T}(\bar{x}_{N}),\bar{C}_{N}=C_{T}$ .

We first show the following critical result. Note: The proofs for the results shown here are given in the appendix.

Lemma 1

Given any sample path, the state perturbation equation given in (III-A) can be equivalently characterized as

\displaystyle\delta x_{k}=\delta x_{k}^{l}+e_{k},~{}\delta x_{k+1}^{l}=\bar{A}% _{k}\delta x_{k}^{l}+\epsilon w_{k}\sqrt{\Delta t}

(15)

where $e_{k}$ is an $O(\epsilon^{2})$ function that depends on the entire noise history $\{w_{0},w_{1},\cdots w_{k}\}$ and $\delta x_{k}^{l}$ evolves according to the linear closed-loop system. Furthermore, $e_{k}=e_{k}^{(2)}+O(\epsilon^{3})$ , where $e_{k}^{(2)}=\bar{A}_{k-1}e_{k-1}^{(2)}+{\delta x_{k-1}^{l}}^{\mathsf{T}}\bar{S% }_{k-1}^{(2)}\delta x_{k-1}^{l}$ , $e_{0}^{(2)}=0$ , and $\bar{S}_{k}^{(2)}$ represents the Hessian matrix corresponding to the Taylor series expansion of the function $\bar{S}_{k}(\cdot)$ .

Next, we have the following result for the expansion of the cost-to-go function $\mathcal{J}^{\pi}(x_{0})$ .

Lemma 2

Given any sample path, the cost-to-go under a policy can be expanded about the nominal as:

	$\displaystyle\mathcal{J}^{\pi}(x_{0})=$	$\displaystyle\underbrace{\sum_{k}\bar{c}_{k}}_{\bar{J}^{\pi}}+\underbrace{\sum% _{k}\bar{C}_{k}\delta x_{k}^{l}}_{\delta J_{1}^{\pi}}+\underbrace{\sum_{k}{% \delta x_{k}^{l}}^{\mathsf{T}}\bar{H}_{k}^{(2)}\delta x_{k}^{l}+\bar{C}_{k}e_{% k}^{(2)}}_{\delta J_{2}^{\pi}}$
		$\displaystyle+O(\epsilon^{3}),$

where $\bar{H}_{k}^{(2)}$ denotes the second order coefficient of the Taylor expansion of $\bar{H}_{k}(\cdot)$ .

Finally, we have the following result characterizing the cost of the policy as the discretization time $\Delta t\rightarrow 0$ .

Proposition 1

Under A2, and given that the closed loop system under the policy $\pi_{t}(.)$ has a solution over the interval $[0,T]$ , the mean of the cost-to-go function obeys: $\lim_{\Delta t\rightarrow 0}E[\mathcal{J}^{\pi}(x_{0})]\equiv J^{\pi}(x_{0})={% J}^{\pi,0}(x_{0})+\epsilon^{2}{J}^{\pi,1}(x_{0})+\epsilon^{4}{J}^{\pi,2}(x_{0}% )+\mathcal{R}^{\pi}(x_{0})$ , for some constants ${J}^{\pi,k}(x_{0})$ , $k=0,1,2$ , where $\mathcal{R}^{\pi}(x_{0})$ is $o(\epsilon^{4})$ , i.e., $\lim_{\epsilon\rightarrow 0}\epsilon^{-4}\mathcal{R}^{\pi}(x_{0})=0$ . Furthermore, the term ${J}^{\pi,0}$ arises solely from the nominal control sequence while ${J}^{\pi,1}$ is solely dependent on the nominal control and the linear part of the perturbation closed-loop.

Remark 1

The interpretation of the result above is as follows: it shows that the $\epsilon^{0}$ term, ${J}^{\pi,0}$ , in the cost, stems from the nominal action of the control policy, the $\epsilon^{2}$ term, $J^{\pi,1}$ , stems from the linear feedback action of the closed-loop, while the higher order terms stem from the higher order terms in the feedback law. In the next section, we use the HJB equation to find the equations satisfied by these terms.

Remark 2

In the above development, we have derived the expression for the cost-to-go of a policy from the initial state $x_{0}$ at the initial time $t=0$ , i.e., the above expressions are for $J^{\pi}(0,x_{0})$ , however, such an expression is also valid for any pair $(t,x)$ simply by repeating the above development starting at time $t$ from state $x$ , i.e., any $J^{\pi}(t,x)=J^{\pi,0}(t,x)+\epsilon^{2}J^{\pi,1}(t,x)+\epsilon^{4}J^{\pi,2}(t% ,x)+\mathcal{R}^{\pi}(t,x)$ .

III-B A Closeness Result for Optimal Stochastic and Deterministic Control

Recall the stochastic and deterministic HJB equations (4), (6) from Section II, and the associated optimal control policies (5) and (7). For simplicity, we consider the scalar case here, the vector case is detailed in the Appendix. Let $\varphi(t,x)$ denote the cost-to-go of the deterministic policy when applied to the stochastic system, i.e., $u^{d}$ applied to Eq. (2). Note that the cost-to-go of the deterministic policy applied to the stochastic system, $\varphi(t,x)$ , is different from the deterministic cost-to-go $\phi(t,x)$ , and $\varphi(t,x)$ satisfies a policy evaluation equation [bertsekas1]. Similar to the stochastic HJB, the continuous time policy evaluation equation for $\varphi(t,x)$ can be written as:

\frac{\partial\varphi}{\partial t}=l(x)+\frac{1}{2}r(u^{d})^{2}+\varphi^{x}(f(% x)+g(x)u^{d})+\frac{\epsilon^{2}}{2}\varphi^{xx},

(16)

where $u^{d}=-\frac{1}{r}g(x)\phi^{x}$ . Then, we have the following key result. An analogous version of the following result was originally proved in a seminal paper [fleming1971stochastic] for first passage problems. We provide a simple derivation of the result for a finite time final value problem below.

Proposition 2

The cost function of the optimal stochastic policy, $J(t,x)$ , and the cost function of the “deterministic policy applied to the stochastic system”, $\varphi(t,x)$ , satisfy: $J(t,x)=J^{0}(t,x)+\epsilon^{2}J^{1}(t,x)+\epsilon^{4}J^{2}(t,x)+\cdots$ , and $\varphi(t,x)=\varphi^{0}(t,x)+\epsilon^{2}\varphi^{1}(t,x)+\epsilon^{4}\varphi% ^{2}(t,x)+\cdots$ . Furthermore, $J^{0}(t,x)=\varphi^{0}(t,x)$ , and $J^{1}(t,x)=\varphi^{1}(t,x)$ , for all $(t,x)$ .

Proof:

We show a sketch here for the case of a scalar state, please refer to the appendix for the complete proof.
Due to Proposition 1, the optimal cost function satisfies: $J(t,x)=J^{0}(t,x)+\epsilon^{2}J^{1}(t,x)+\epsilon^{4}J^{2}(t,x)+\cdots$ . Next, we substitute the above equation into the HJB equation (4), along with the minimizing control (5) to obtain a perturbation expansion of the optimal cost function as a power series in $\epsilon^{2}$ . Equating the $O(\epsilon^{0})$ and $O(\epsilon^{2})$ terms on both sides results in governing equations for the $J^{0}$ and $J^{1}$ terms. We also know that the cost function of the deterministic policy when applied to the stochastic system satisfies $\varphi(t,x)=\varphi^{0}(t,x)+\epsilon^{2}\varphi^{1}(t,x)+\cdots$ . Similar to above, we substitute this expression into the policy evaluation equation (16), along with the deterministic optimal control expression $u^{d}=-\frac{1}{r}g(x)\phi^{x}$ , to obtain the governing equations for $\varphi^{0}$ and $\varphi^{1}$ . These equations, when compared with those for $J^{0}$ and $J^{1}$ , are seen to be identical with the same terminal conditions thereby proving the result. ∎

Remark 3

$\mathcal{O}(\epsilon^{4})$ Near-Optimality of Linear Perturbation Feedback. According to Proposition 1, we know that the $O(\epsilon^{2})$ term in the perturbation expansion above stems from the linear feedback term for any policy, and thus, the same is true for the deterministic policy. Given an initial state $x_{0}$ , let $(\bar{x}(t),\bar{u}(t))$ denote the optimal nominal trajectory under the deterministic feedback law and let $K_{t}$ denote the linear feedback corresponding to the expansion of the feedback law about this nominal trajectory. Therefore, it follows that if one applies the perturbation linear feedback law $u(t,x_{t})=\bar{u}_{t}+K_{t}\delta x_{t}$ , where the feedback acts on the perturbation from the nominal, $\delta x_{t}=x_{t}-\bar{x}_{t}$ , starting at the initial state $x_{0}$ , then the performance of this linear feedback policy is also within $O(\epsilon^{4})$ of the optimal stochastic policy.

III-C A Perturbation Expansion of Deterministic Optimal Feedback Control: the Method of Characteristics (MOC)

In this section, we will use the classical Method of Characteristics [Courant-Hilbert] to derive results regarding the deterministic optimal control problem. In particular, we will show that satisfying the Minimum Principle is sufficient to assure us of a global optimum for the open-loop problem. Perhaps most importantly, we shall show that the deterministic cost-to-go function has a perturbation structure in that the higher-order terms do not affect the lower-order terms in a Taylor expansion of the optimal feedback law. We also obtain the equations governing the linear and higher-order feedback terms, and show that the linear feedback gain is different from the standard LQR design. Again, for simplicity, we derive the following for the case of a scalar state, please see the Appendix for the vector case.

Let us recall the Hamilton-Jacobi-Bellman (HJB) equation in continuous-time under the same assumptions as above, i.e., quadratic in control cost $c(x,u)=l(x)+\frac{1}{2}ru^{2}$ , and affine in control dynamics $\dot{x}=f(x)+g(x)u$ [bryson]:

\frac{\partial J}{\partial t}+l(x)-\frac{1}{2}\frac{g(x)^{2}}{r}(J^{x})^{2}+f(% x)J^{x}=0,

(17)

where $J=J(t,x),\;J^{x}=\frac{\partial J}{\partial x}$ , and the equation is integrated back in time with terminal condition $J(T,x)=c_{T}(x)$ . Define $\frac{\partial J}{\partial t}=p,\;J^{x}=q$ , then the HJB can be written as $F(t,x,J,p,q)=0$ , where $F(t,x,J,p,q)=p+l(x)-\frac{1}{2}\frac{g(x)^{2}}{r}q^{2}+f(x)q$ . One can now write the Lagrange-Charpit equations [Courant-Hilbert] for the HJB as:

	$\displaystyle\dot{x}$	$\displaystyle=F_{q}=f(x)-\frac{g(x)^{2}}{r}q,$		(18)
	$\displaystyle\dot{q}$	$\displaystyle=-F_{x}-qF_{J}=-l^{x}+\frac{g(x)g^{x}}{r}q^{2}-f^{x}q,$		(19)

with the terminal conditions $x(T)=x_{T},\;q(T)=c_{T}^{x}(x_{T})$ , where $F_{x}=\frac{\partial F}{\partial x}$ , $F_{q}=\frac{\partial F}{\partial q}$ , $g^{x}=\frac{\partial g}{\partial x}$ , $l^{x}=\frac{\partial l}{\partial x}$ , $f^{x}=\frac{\partial f}{\partial x}$ and $c_{T}^{x}=\frac{\partial c_{T}}{\partial x}$ .
Given a terminal condition $x_{T}$ , the equations above can be integrated back in time to yield a characteristic curve of the HJB PDE. Now, we show how one can use these equations to get a perturbation solution of the HJB, and in particular, the linear feedback gain $K_{t}$ corresponding to the optimal policy. The development also shows that the solution has a perturbation structure in that higher order terms do not affect the lower order terms.

Suppose now that one is given an optimal nominal trajectory $\bar{x}_{t}$ , $t\in[0,T]$ for a given initial condition $x_{0}$ , from solving the open-loop optimal control problem. Let the nominal terminal state be $\bar{x}_{T}$ . We now expand the HJB solution around this nominal optimal solution. To this purpose, let $x_{t}=\bar{x}_{t}+\delta x_{t}$ , for $t\in[0,T]$ . Then, expanding the optimal cost function around the nominal yields: $J(t,x_{t})=\bar{J}_{t}+G_{t}\delta x_{t}+\frac{1}{2}P_{t}\delta x_{t}^{2}+% \frac{1}{6}S_{t}\delta x_{t}^{3}+\cdots,$ where $\bar{J}_{t}=J(t,\bar{x}_{t}),G_{t}=\frac{\partial J}{\partial x_{t}}|_{\bar{x}% _{t}},P_{t}=\frac{\partial^{2}J}{\partial x_{t}^{2}}|_{\bar{x}_{t}}$ , $S_{t}=\frac{\partial^{3}J}{\partial x_{t}^{3}}|_{\bar{x}_{t}}$ . Then, the co-state $q=\frac{\partial J}{\partial x_{t}}=G_{t}+P_{t}\delta x_{t}+\frac{1}{2}S_{t}% \delta x^{2}_{t}+\cdots$ .

For simplicity, we assume that $g^{x}=0$ (this is relaxed but at the expense of a rather tedious derivation detailed in the Appendix). Hence,

	$\displaystyle\underbrace{\frac{d}{dt}(\bar{x}_{t}+\delta x_{t})}_{\dot{\bar{x}% }_{t}+\dot{\delta x}_{t}}=$	$\displaystyle\underbrace{f(\bar{x}_{t}+\delta x_{t})}_{(\bar{f}_{t}+\bar{f}_{t% }^{x}\delta x_{t}+\frac{1}{2}\bar{f}_{t}^{xx}\delta x^{2}_{t}+O(\delta x_{t}^{% 3}))}$
		$\displaystyle-\frac{g^{2}}{r}(G_{t}+P_{t}\delta x_{t}+\frac{1}{2}S_{t}\delta x% ^{2}_{t}+O(\delta x_{t}^{3})),$

where $\bar{f}_{t}=f(\bar{x}_{t}),\bar{f}_{t}^{x}=\frac{\partial f}{\partial x_{t}}|_% {\bar{x}_{t}}$ . Expanding in powers of the perturbation variable $\delta x_{t}$ , the equation above can be written as (after noting that $\dot{\bar{x}}_{t}=\bar{f}_{t}-\frac{g^{2}}{r}G_{t}$ due to the nominal trajectory $\bar{x}_{t}$ satisfying the characteristic equation):

\displaystyle\dot{\delta x}_{t}=(\bar{f}_{t}^{x}-\frac{g^{2}}{r}P_{t})\delta x% _{t}+\frac{1}{2}(\bar{f}_{t}^{xx}-\frac{g^{2}}{r}S_{t})\delta x^{2}_{t}+O(% \delta x_{t}^{3}).

(20)

Next, we have: $\frac{dq}{dt}=-l^{x}-f^{x}q$

	$\displaystyle\frac{d}{dt}(G_{t}+P_{t}\delta x_{t}+\frac{1}{2}S_{t}\delta x^{2}% _{t}+O(\delta x^{3}))=-(\bar{l}_{t}^{x}+\bar{l}_{t}^{xx}\delta x_{t}$
	$\displaystyle+\frac{1}{2}\bar{l}_{t}^{xxx}\delta x^{2}_{t}+O(\delta x^{3}))-% \Big{(}\bar{f}_{t}^{x}+\bar{f}_{t}^{xx}\delta x_{t}+\frac{1}{2}\bar{f}_{t}^{% xxx}\delta x^{2}_{t}$
	$\displaystyle+O(\delta x^{3})\Big{)}(G_{t}+P_{t}\delta x_{t}+\frac{1}{2}S_{t}% \delta x^{2}_{t}+O(\delta x^{3})),$		(21)

where $\bar{f}_{t}^{xx}=\frac{\partial^{2}f}{\partial x^{2}}|_{\bar{x}_{t}},\bar{f}_{% t}^{xxx}=\frac{\partial^{3}f}{\partial x^{3}}|_{\bar{x}_{t}},\bar{l}^{x}_{t}=% \frac{\partial l}{\partial x}|_{\bar{x}_{t}},\bar{l}^{xx}_{t}=\frac{\partial^{% 2}l}{\partial x^{2}}|_{\bar{x}_{t}},\bar{l}^{xxx}_{t}=\frac{\partial^{3}l}{% \partial x^{3}}|_{\bar{x}_{t}}$ . Using $\frac{d}{dt}P_{t}\delta x_{t}=\dot{P}_{t}\delta x_{t}+P_{t}\dot{\delta x}_{t}$ , $\frac{d}{dt}S_{t}\delta x^{2}_{t}=\dot{S}_{t}\delta x^{2}_{t}+2S_{t}\delta x% \dot{\delta x}_{t}$ , and substituting for $\dot{\delta x}_{t}$ from (20), and expanding the two sides above in powers of $\delta x_{t}$ yields:

	$\displaystyle\dot{G}_{t}+(\dot{P}_{t}+P_{t}(\bar{f}_{t}^{x}-\frac{g^{2}}{r}P_{% t}))\delta x_{t}+\frac{1}{2}\Big{(}P_{t}(\bar{f}_{t}^{xx}-\frac{g^{2}}{r}S_{t})$
	$\displaystyle+\dot{S}_{t}+2S_{t}(\bar{f}_{t}^{x}-\frac{g^{2}}{r}P_{t})\Big{)}% \delta x^{2}+O(\delta x^{3})$
	$\displaystyle\quad=-(\bar{l}_{t}^{x}+\bar{f}_{t}^{x}G_{t})-(\bar{l}_{t}^{xx}+% \bar{f}_{t}^{x}P_{t}+\bar{f}_{t}^{xx}G_{t})\delta x_{t}$
	$\displaystyle-\frac{1}{2}(\bar{l}_{t}^{xxx}+\bar{f}_{t}^{xxx}G_{t}+2\bar{f}_{t% }^{xx}P_{t}+\bar{f}_{t}^{x}S_{t})\delta x^{2}+O(\delta x^{3}).$

Equating the first three powers of $\delta x_{t}$ yields:

	$\displaystyle\dot{G}_{t}+\bar{l}_{t}^{x}+\bar{f}_{t}^{x}G_{t}=0,$		(22)
	$\displaystyle\dot{P}_{t}+\bar{l}_{t}^{xx}+P_{t}\bar{f}_{t}^{x}+\bar{f}_{t}^{x}% P_{t}-P_{t}\frac{g^{2}}{r}P_{t}+\bar{f}_{t}^{xx}G_{t}=0,$		(23)
	$\displaystyle\dot{S}_{t}+\bar{l}_{t}^{xxx}+P_{t}\bar{f}_{t}^{xx}+2\bar{f}_{t}^% {xx}P_{t}+\bar{f}_{t}^{x}S_{t}+2S_{t}\bar{f}_{t}^{x}-P_{t}\frac{g^{2}}{r}S_{t}$
	$\displaystyle-2S_{t}\frac{g^{2}}{r}P_{t}+\bar{f}_{t}^{xxx}G_{t}=0$		(24)

Using the first-order necessary condition $u(t,x_{t})=-\frac{g}{r}J^{x}$ , the optimal feedback law is given by:

	$\displaystyle u(t,x_{t})$	$\displaystyle=-\frac{g}{r}J^{x}=\underbrace{-\frac{g}{r}G_{t}}_{\bar{u}_{t}}% \underbrace{-\frac{g}{r}P_{t}}_{K_{t}}\delta x_{t}\underbrace{-\frac{g}{2r}S_{% t}}_{K^{(2)}_{t}}\delta x^{2}_{t}+O(\delta x_{t}^{3})$		(25)
	$\displaystyle u(t,x_{t})$	$\displaystyle=\bar{u}_{t}+K_{t}\delta x_{t}+K^{(2)}_{t}\delta x^{2}_{t}+O(% \delta x_{t}^{3}).$

Thus, we see that the optimal feedback law has a perturbation structure in that the second-order terms $P_{t}$ do not affect the first-order terms $G_{t}$ , and the third and higher-order terms, $S_{t}$ etc., do not affect the second-order term $P_{t}$ and so on for the third and higher order terms.
Now, we provide the final result for the general vector case with a state-dependent control influence matrix (please see the Appendix for details). We ignore the $O(\delta x_{t}^{2})$ and higher-order terms in the feedback law purely for notational convenience.

Definition 1

Let the control influence matrix be given as: $\mathcal{G}(x)=\begin{bmatrix}g_{1}^{1}(x)\cdots g_{1}^{p}(x)\\ \ddots\\ g_{n}^{1}(x)\cdots g_{n}^{p}(x)\end{bmatrix}=\begin{bmatrix}\Gamma^{1}(x)% \cdots\Gamma^{p}(x)\end{bmatrix}$ , i.e., $\Gamma^{j}$ represents the control influence vector corresponding to the $j^{th}$ input. Let $\bar{\mathcal{G}}_{t}=\mathcal{G}(\bar{x}_{t})$ , where $\{\bar{x}_{t}\}$ represents the optimal nominal trajectory. Further, let $\mathcal{F}=\begin{bmatrix}f_{1}(x)\cdots f_{n}(x)\end{bmatrix}^{\intercal}$ denote the drift of the system. Let $G_{t}=[G_{t}^{1}\cdots G_{t}^{n}]^{\intercal}$ , and $R^{-1}\bar{\mathcal{G}}_{t}^{\intercal}G_{t}=-[\bar{u}_{t}^{1}\cdots\bar{u}_{t% }^{p}]^{\intercal}$ , denote the optimal nominal co-state and control vectors respectively. Let the Jacobian and Hessian of our system matrices be defined as:

	$\displaystyle\bar{\mathcal{F}}_{t}^{x}$	$\displaystyle=\begin{bmatrix}\frac{\partial f_{1}}{\partial x_{1}}\cdots\frac{% \partial f_{1}}{\partial x_{n}}\\ \ddots\\ \frac{\partial f_{n}}{\partial x_{1}}\cdots\frac{\partial f_{n}}{\partial x_{n% }}\end{bmatrix}\|_{\bar{x}_{t}},~{}~{}\bar{\mathcal{F}}_{t}^{xx,i}=\begin{% bmatrix}\frac{\partial^{2}f_{1}}{\partial x_{1}\partial x_{i}}\cdots\frac{% \partial^{2}f_{1}}{\partial x_{n}\partial x_{i}}\\ \ddots\\ \frac{\partial^{2}f_{n}}{\partial x_{1}\partial x_{i}}\cdots\frac{\partial^{2}% f_{n}}{\partial x_{n}\partial x_{i}}\end{bmatrix}\|_{\bar{x}_{t}},$
	$\displaystyle\bar{\mathcal{G}}_{t}^{x,i}$	$\displaystyle=\begin{bmatrix}\frac{\partial g_{1}^{1}}{\partial x_{i}}\cdots% \frac{\partial g_{1}^{p}}{\partial x_{i}}\\ \ddots\\ \frac{\partial g_{n}^{1}}{\partial x_{i}}\cdots\frac{\partial g_{n}^{p}}{% \partial x_{i}}\end{bmatrix}\|_{\bar{x}_{t}}.$		(26)

Similarly $\bar{\Gamma}^{j,x}_{t}=\nabla_{x}\Gamma^{j}|_{\bar{x}_{t}}$ , $\bar{\Gamma}^{j,xx,i}_{t}=\nabla_{xx}\Gamma^{j}|_{\bar{x}_{t}}$ for the vector function $\Gamma^{j}$ . Finally, define $\mathcal{A}_{t}=\bar{\mathcal{F}}_{t}^{x}+\sum_{j=1}^{p}\bar{\Gamma}_{t}^{j,x}% \bar{u}_{t}^{j}$ , $\bar{L}_{t}^{x}=\nabla_{x}l|_{\bar{x}_{t}}$ , and $\bar{L}_{t}^{xx}=\nabla^{2}_{xx}l|_{\bar{x}_{t}}$ .

Proposition 3

Under A1, and given the above definitions, the following result holds for the evolution of the co-state/ gradient vector $G_{t}$ , and the Hessian matrix $P_{t}$ , of the optimal cost function ${J}_{t}(x_{t})$ , evaluated on the optimal nominal trajectory $\bar{x}_{t},t\in[0,T]$ :

$\displaystyle\dot{G}_{t}$	$\displaystyle+\bar{L}_{t}^{x}+\mathcal{A}_{t}^{\intercal}G_{t}=0,$	(27)
$\displaystyle\dot{P}_{t}$	$\displaystyle+\mathcal{A}_{t}^{\intercal}P_{t}+P_{t}\mathcal{A}_{t}+\bar{L}_{t% }^{xx}$
$\displaystyle+\sum_{i=1}^{n}$	$\displaystyle[\bar{\mathcal{F}}_{t}^{xx,i}+\sum_{j=1}^{p}\bar{\Gamma}_{t}^{j,% xx,i}\bar{u}_{t}^{j}]G_{t}^{i}-K_{t}^{\intercal}RK_{t}=0,$	(28)
$\displaystyle K_{t}$	$\displaystyle=-R^{-1}[\sum_{i=1}^{n}\bar{\mathcal{G}}_{t}^{x,i,\intercal}G_{t}% ^{i}+\bar{\mathcal{G}}_{t}^{\intercal}P_{t}],$	(29)

with terminal conditions $G_{T}=\nabla_{x}c_{T}|_{\bar{x}_{T}}$ , and $P_{T}=\nabla_{xx}^{2}c_{T}|_{\bar{x}_{T}}$ and the control input with the optimal linear feedback is given by $u_{t}=\bar{u}_{t}+K_{t}\delta x_{t}$ .

Remark 4

Not standard LQR. The co-state equation (27) above is identical to the co-state equation in the Minimum Principle [bryson, Pontryagin]. However, the Hessian $P_{t}$ equation (28) is Riccati-like with some important differences: note the extra second order terms due to $\bar{\mathcal{F}}_{t}^{xx,i}$ and $\bar{\Gamma}_{t}^{xx,i}$ in the second line stemming from the nonlinear drift and input influence vectors and an extra term in the gain equation (29) coming from the state dependent influence matrix. These terms are not present in the LQR Riccati equation, and thus, it is clear that this cannot be a traditional perturbation feedback design [bryson, Ch. 6]. If the input influence matrix is independent of the state, the first term in the second line remains, and hence, it is still different from the LQR case.

Remark 5

Convexity and Global Minimum. Recall the Lagrange-Charpit equations for solving the HJB (18), (19). Given an unconstrained control, under standard smoothness assumptions on the involved functions, the characteristic curves governed by the equations in $(x,q)$ space are unique, and do not intersect. Therefore, the open-loop optimal trajectory, found by satisfying the Minimum Principle is also the unique global minimum even though the open-loop problem is non-convex. This observation is formalized in the following result.

Proposition 4

Global Optimality of open-loop solution. Let the cost functions $l(\cdot)$ , $c_{T}(\cdot)$ , the drift $f(\cdot)$ and the input influence function $g(\cdot)$ be $\mathcal{C}^{2}$ , i.e., twice continuously differentiable, and let a solution to (18)-(19) exist in $[0,T]$ for any terminal condition $(x_{T},q_{T})$ . Under A1, an optimal trajectory that satisfies the Minimum Principle from a given initial state $x_{0}$ , is the unique global minimum of the open-loop problem starting at the initial state $x_{0}$ .

III-D Loss of Perturbation Structure in Stochastic Control

Finally, we outline the loss of the perturbation structure in the stochastic problem. For the sake of simplicity, we only consider the scalar case in continuous time, however, even this case brings out the difficulty associated with stochastic control while the generalization to the vector case is relatively straightforward, albeit somewhat tedious.

Recall the stochastic HJB:

-\frac{\partial J}{\partial t}=\min_{u}[H(x,u)]+{\frac{\epsilon^{2}}{2}\ }% \frac{\partial^{2}J}{\partial x^{2}},

(30)

where $H(x,u)=l(x)+\frac{1}{2}ru^{2}+(f(x)+gu)\frac{\partial J}{\partial x}$ is the Hamiltonian of the system, and the equation is integrated backwards from a terminal condition $J(T,x)=c_{T}(x)$ . For simplicity, we assume that $g$ is not state dependent in the following derivation and we also assume the noise variance $Q=1$ , which otherwise would appear in the diffusion term in Eq. (30). Suppose now that we are given the optimal policy $u(t,x)$ and suppose that the nominal trajectory of the system (without noise) starting at some $x_{0}$ is given by $\{\bar{x}_{t}\}$ under the nominal control $\{\bar{u}_{t}\}$ . As was done previously, let us now expand the solution of the equation above in terms of the perturbations from this nominal trajectory, $\delta x_{t}=x_{t}-\bar{x}_{t}$ . Then, given the optimal nominal control $\bar{u}_{t}$ , we can solve the minimization of the Hamiltonian as:

$\displaystyle\min_{u_{t}}$	$\displaystyle\ H(x_{t},u_{t})=\min_{\delta u_{t}}H(\bar{x}_{t}+\delta x_{t},% \bar{u}_{t}+\delta u_{t}),$	(31)
	$\displaystyle=\min_{\delta u_{t}}\Big{[}l(\bar{x}_{t}+\delta x_{t})+\frac{r}{2% }\bar{u}_{t}^{2}+(f(\bar{x}_{t}+\delta x_{t})+g\bar{u}_{t})\frac{\partial J}{% \partial x}$
	$\displaystyle+r\bar{u}_{t}\delta u_{t}+\frac{r}{2}\delta u_{t}^{2}+g\delta u_{% t}\frac{\partial J}{\partial x}\Big{]},$

which leads to the necessary condition for a minimum:

\displaystyle(g\frac{\partial J}{\partial x}+r\bar{u}_{t})+r\delta u_{t}=0,

(32)

which is also sufficient for a minimum since $r>0$ leading to $H$ being strictly quadratic in the variable $\delta u_{t}$ . From Eq. (32), the optimizing perturbation control is given by $\delta u_{t}=-\bar{u}_{t}-\frac{g}{r}\frac{\partial J}{\partial x}$ .

Now, let us expand the dynamics and the optimal cost function in the HJB in terms of their perturbations from the nominal trajectory: $f(x_{t})=f(\bar{x}_{t})+F_{t}^{1}\delta x_{t}+\frac{1}{2}F_{t}^{2}\delta x_{t}% ^{2}+\cdots$ , $J(t,x_{t})=\bar{J}_{t}(\bar{x}_{t})+K_{t}^{1}\delta x_{t}+\frac{1}{2}K_{t}^{2}% \delta x_{t}^{2}+\cdots$ , where the $F_{t}^{i},K_{t}^{i}$ represent the Taylor coefficients of the series expansion of these functions. Therefore, $\frac{\partial J}{\partial x}=K_{t}^{1}+K_{t}^{2}\delta x_{t}+\frac{K_{t}^{3}}% {2}\delta x_{t}^{2}+\cdots$ , $\frac{\partial^{2}J}{\partial x^{2}}=K_{t}^{2}+K_{t}^{3}\delta x_{t}+\frac{1}{% 2}K_{t}^{4}\delta x_{t}^{2}+\cdots$ . Noting that the variable $x_{t}=\bar{x}_{t}+\delta x_{t}$ , i.e., the space variable has an explicit time dependence via the nominal trajectory, it follows that:

	$\displaystyle\frac{\partial J(t,x_{t})}{\partial t}=[\dot{\bar{J}}_{t}(\bar{x}% _{t})+\dot{K}_{t}^{1}\delta x_{t}+\frac{1}{2}\dot{K}_{t}^{2}\delta x_{t}^{2}+\cdots]$
	$\displaystyle-\dot{\bar{x}}_{t}[K_{t}^{1}+K_{t}^{2}\delta x_{t}+\frac{K_{t}^{3% }}{2}\delta x_{t}^{2}+\cdots],$		(33)

where, $\dot{\bar{J}}_{t}(\bar{x}_{t}),\dot{K}_{t}^{1},\cdots$ , are total derivatives with respect to $t$ , since they only depend on the time.

Then, using the above expressions, one can express the minimum value of the Hamiltonian in terms of the state perturbations $\delta x_{t}$ as:

$\displaystyle\min_{u_{t}}H(x_{t},u_{t})$	$\displaystyle=[l(\bar{x}_{t})+L_{t}^{1}\delta x_{t}+\frac{L_{t}^{2}}{2}\delta x% _{t}^{2}+\cdots]$
	$\displaystyle-\frac{g^{2}}{2r}[K_{t}^{1}+K_{t}^{2}\delta x_{t}+\frac{K_{t}^{3}% }{2}\delta x_{t}^{2}+\cdots]^{2}$
	$\displaystyle+(f(\bar{x}_{t})+F_{t}^{1}\delta x_{t}+\frac{F_{t}^{2}}{2}\delta x% _{t}^{2}+\cdots)(K_{t}^{1}$
	$\displaystyle+K_{t}^{2}\delta x_{t}+\frac{K_{t}^{3}}{2}\delta x_{t}^{2}+\cdots).$	(34)

Next, noting that $\dot{\bar{x}}=f(\bar{x}_{t})+g\bar{u}_{t}$ , we obtain the following equations for the evolution of the Taylor co-efficient of the optimal cost function by equating the different powers of $\delta x_{t}$ on both sides of the stochastic HJB (Eq. (30)) given in Eq. (III-D) and (III-D).

$\displaystyle-\dot{\bar{J}}_{t}$	$\displaystyle=\bar{l}_{t}-\frac{g}{r}(\frac{gK_{t}^{1}}{2}+r\bar{u}_{t})K_{t}^% {1}+\epsilon^{2}K_{t}^{2},$	(35)
$\displaystyle-\dot{K}_{t}^{1}$	$\displaystyle=L_{t}^{1}+F_{t}^{1}K_{t}^{1}-\frac{g}{r}(gK_{t}^{1}+r\bar{u}_{t}% )K_{t}^{2}+\epsilon^{2}K_{t}^{3},$	(36)
$\displaystyle-\dot{K}_{t}^{2}$	$\displaystyle=L_{t}^{2}+2F_{t}^{1}K_{t}^{2}+F_{t}^{2}K_{t}^{1}-\frac{g^{2}}{r}% (K_{t}^{2})^{2}$
	$\displaystyle-\frac{g}{r}(gK_{t}^{1}+r\bar{u}_{t})K_{t}^{3}+\epsilon^{2}K_{t}^% {4},$	(37)

where we have expanded the first three terms of the expansion in the equations above, and similar expansions may be done for the higher order terms as well. At this point, we make the following remarks regarding the perturbation expansion above.

Remark 6

Computational Intractability of the Stochastic Problem. The equations above show that the lower order terms in the stochastic problem are affected by the higher order terms unlike in the deterministic case. Thus, in order to compute the stochastic law, we have to approximate to a high enough order to ensure accuracy in the solution, which in turn implies that the solution of the stochastic problem is very prone to errors. To see this, note that if we were to expand the solution to the $n^{th}$ order, the Taylor co-efficient $K_{t}^{n}$ would be affected by the coefficients $K_{t}^{n+1}$ and $K_{t}^{n+2}$ , and therefore these higher order coefficients would need to be sufficiently small for the resulting solution to be accurate. However, if one approximates to a very high order $n$ , quite apart from the obvious curse of dimensionality issue, the resulting system of equations becomes severely ill-conditioned, and consequently, highly sensitive to small errors in the data. Please see our related paper [RL_conv] for the relevant details on this aspect.

Remark 7

The Deterministic Problem. The expressions above also allow us to find the perturbation expansions for the deterministic problem. It is key to note that if the problem considered is deterministic then $gK_{t}^{1}+r\bar{u}_{t}=0$ due to the minimum principle and since $\epsilon=0$ in the deterministic problem, we obtain the expressions that we derived via the Method of Characteristics in the previous section. The Method of Characteristics is still necessary since it allows us to establish the uniqueness of the optimal nominal trajectory $(\bar{x}_{t},\bar{u}_{t})$ . Thus, the above development can be thought of as an alternative way to derive the perturbation expansion result. Furthermore, we can see that if we are required to derive the cost-to-go of the deterministic policy when applied to the stochastic system, albeit $gK_{t}^{1}+r\bar{u}_{t}=0$ due to optimality, nonetheless, there is coupling from the higher order terms due to stochasticity arising from the $O(\epsilon^{2})$ terms above, and thus, even this case is intractable to compute. However, since we are interested only in the deterministic feedback law, such a computation is unnecessary.

IV THE OPTIMALITY OF SHRINKING HORIZON MODEL PREDICTIVE CONTROL

In our developments till this point, we have shown that the deterministic feedback law is near-optimal with respect to the optimal stochastic law and that it has a perturbation structure that is lost in the stochastic problem. However, solving the deterministic DP problem is also subject to the Curse of Dimensionality. Nonetheless, owing to the perturbation structure, one can solve the deterministic problem locally (up to the linear feedback term), and then replan at fixed decision time epochs, assuming that the time between the decision epochs is small enough that the local feedback law remains valid in between the epochs. Thus, consider a Model Predictive type approach to solving the stochastic control problem. We outline the algorithmic procedure in Algorithm 1 to highlight that our advocated procedure is slightly different from the traditional MPC approach studied in the literature [Mayne_1, Mayne_2].

1 Given: initial state

x_{0}

, time horizon

T

, cost

c(x,u)=l(x)+\frac{1}{2}{u}^{\mathsf{T}}Ru

, terminal cost

c_{T}(x)

, and decision epoch time

\Delta

2 Set

N=\frac{T}{\Delta}

x_{i}=x_{0}

3 while $t<N\Delta$ do

41.

Solve the open-loop (noise free) optimal control problem for initial state $x_{i}$ , along with the

5 associated linear perturbation feedback, for the

6 horizon (

N\Delta-t

). Let the perturbation feedback

law be denoted by

u(t,x)=\bar{u}_{t}+K_{t}\delta x_{t}

, where

\delta x_{t}=x_{t}-\bar{x}_{t}

and

(\bar{x}_{t},\bar{u}_{t})

is the optimal nominal trajectory.

72.

Apply the perturbation feedback law $u(t,x)$ till

time

(t+\Delta)

and observe the state

x_{f}=x_{(t+\Delta)}

83.

set $t=t+\Delta$ , $x_{i}=x_{f}$ .

9 end while

Algorithm 1 Shrinking Horizon MPC (MPC-SH)

Remark 8

In traditional MPC [Mayne_1, Mayne_2], the horizon $N$ to solve the open-loop problem is fixed. The setting is deterministic, and the necessity of replanning for the problem stems from the assumption that the actual problem horizon is infinite, and therefore, computationally intractable. In lieu, our problem horizon is finite, the repeated replanning takes place over progressively shorter horizons, and the need for replanning arises from the stochasticity of the problem. In particular, note that if the system were really deterministic, there would be no need for replanning.

Theorem 1

Near-Optimality of MPC-SH. The MPC feedback policy obtained from the application of the Shrinking Horizon MPC algorithm is near-optimal to $O(\epsilon^{4})$ to the optimal stochastic feedback policy for the stochastic system (2).

Proof:

We know that $J_{t}^{0}(x)=\varphi_{t}^{0}(x)$ , and $J_{t}^{1}(x)=\varphi_{t}^{1}(x)$ from Proposition 2, for all $(t,x)$ . Owing to the uniqueness and global optimality of the open-loop from Proposition 4, it follows that the nominal control sequence, and the associated linear perturbation feedback law, found by the MPC procedure outlined above coincides locally with the optimal deterministic feedback law given any state $x$ and any time $t$ . Therefore, the result follows. $\hfill\blacksquare$
∎

Note that the proof above also shows that the MPC-SH procedure furnishes the optimal deterministic feedback law which is stated in the following corollary.

Corollary 1

The MPC-SH algorithm furnishes the optimal deterministic feedback law given any initial condition.

The result above establishes that repeatedly solving the deterministic optimal control problem from the current state at the decision making epochs results in a near-optimal stochastic policy. We examine two particularly important consequences in the following.

Stochastic MPC

A major computational bottleneck with stochastic MPC [Mayne_1], is that the MPC search needs to be over (time-varying) feedback policies rather than control sequences owing to the stochasticity of the problem, which leads to an intractable optimization for nonlinear systems. Because of this intractability, most of the work in stochastic MPC deal with linear systems using stochastic tube approach [smpc_mesbah2016, smpc_heirung2018], and some more recent work using generalized polynomial chaos (gPC) [kim2013generalised, FISHER2009polychaos]. Nonlinear stochastic MPC using gPC also typically solves over control sequences instead of feedback policies for traceability. However, as our results demonstrate, the MPC feedback law we propose (MPC-SH) is near-optimal to the fourth order. Further, as we have shown analytically in Section III-D and as will be seen from our simulation results in Section V, in practice, the solution of the stochastic DP problem is highly sensitive to noise, quite apart from the usual issue of dimensionality, and MPC-SH gives much better performance than the solution of the stochastic DP problem. A further important practical consequence of Theorem 1 is that we can get performance comparable to MPC, by wrap** the optimal linear feedback law around the nominal control sequence ( $u_{t}=\bar{u}_{t}+K_{t}\delta x_{t}$ ), and replanning the nominal sequence only when the deviation is large enough, similar to the event driven MPC philosophy [ETMPC1, ETMPC2].

Reinforcement Learning

The problems considered in reinforcement learning can be construed as one of finding the optimal feedback policy for a stochastic nonlinear dynamical system [bertsekas1]. Typically, this is done via simulations or rollouts of the dynamical system of interest, which allied with a suitable function approximator such as a (deep) neural net, yields a nonlinear feedback policy. However, these methods tend to be highly data intensive, slow to converge, and suffer from extremely high variance in the solution since they try to solve the DP equation [RL_conv]. This is a manifestation of the inherent curse of dimensionality in trying to solve the stochastic DP problem. Thus, in our opinion, albeit the DP equation is an excellent analytical tool to study the structure of the feedback problem, nonetheless, it is not the correct synthesis tool. In fact, it is much easier to repeatedly solve the open-loop problem as prescribed by MPC, i.e., solve for the characteristic curves of the DP problem. Of course, there remains the problem of whether we can solve the open-loop problem online. In our opinion, this is feasible today, when allied with efficient computational algorithms like iLQR [ILQG_tassa2012synthesis] that exploit the causal structure of optimal control problems, suitable high performance computing (HPC) modifications, and suitable randomization of the computations via rollouts that can help us very efficiently estimate the system parameters involved. In fact, this is the subject of the second part of this paper on data-based control [wang2022search].

V Simulation Results

This section will show evidence for theoretical results derived previously through simulations. In subsection V-A, the inaccuracy of the stochastic solution, as discussed in Remark 6, will be shown for a simple 1-D problem in comparison with the deterministic solution. The near-optimality of MPC-SH, which was theoretically shown to be the optimal deterministic solution in Theorem 1, will also be compared with the stochastic solution in a nonlinear problem. Further, we will show why it is intractable to solve the stochastic HJB accurately. In Subsection V-B, the performance of using the optimal linear perturbation feedback derived in Section III-C will be compared with MPC-SH on nonlinear robotic problems. The experiments shown in this section are carried out over 500 Monte Carlo simulations, and the performance statistics are computed from these simulations .

V-A Deterministic vs. Stochastic policy

In this section, we aim to show through simulations, that computing the optimal stochastic feedback law is subject to errors, as explained by the theory discussed previously in Sec. III-D. We show this by comparing the performance of the deterministic solution applied to the stochastic problem and the stochastic solution in a nonlinear problem. We consider the following problem:


	$\displaystyle J(0,x(0))=$
	$\displaystyle\min_{\{u_{t}\}}\mathop{\mathbb{E}}_{\begin{subarray}{c}{}\end{% subarray}}\left[{\frac{1}{2}\left(\int_{0}^{T}(qx(t)^{2}+ru(t)^{2})dt+q_{T}x(T% )^{2}\right)}\right]$		(38a)
	$\displaystyle\text{s.t.}~{}dx=(f(x)+g(x)u)dt+\epsilon dw,~{}\text{given}~{}x(0).$		(38b)

The solution to the above problem is calculated by solving the HJB equation (written for the scalar case):

-\frac{\partial J}{\partial t}=\frac{1}{2}\Big{(}qx^{2}-\frac{g(x)^{2}}{r}\Big% {(}\frac{\partial J}{\partial x}\Big{)}^{2}\Big{)}+f(x)\frac{\partial J}{% \partial x}+\frac{\epsilon^{2}Q}{2}\frac{\partial^{2}J}{\partial x^{2}},~{}

(39)

where, $J=J(t,x)$ is the expected cost-to-go from state $x$ at time $t$ , with terminal condition $J(T,x)=\frac{1}{2}q_{T}x^{2}$ . The minimizing optimal control $u=-\frac{1}{r}g(x)\frac{\partial J}{\partial x}$ and we take $q=100$ , $q_{T}=500$ and $r=1$ . The noise $w$ added to the system in stochastic cases is zero mean Gaussian white noise, with standard deviation being the maximum value of the control input obtained from the nominal trajectory by solving the deterministic problem - $(\sqrt{Q}=\bar{u}_{max})$ . The HJB equation in Eq. (39) is solved by the finite difference (FD) method in a fixed domain since it is the standard method for solving advection-diffusion PDEs, which Eq. (39) is, in the computation fluid dynamics community [FD_CFD]. The parameters used in FD are shown in Table I. The time and space discretization was chosen to satisfy the Courant–Friedrichs–Lewy (CFL) conditions [CFL]. We consider only a 1-D problem for the sake of easy illustration since Eq. (39) becomes computationally intractable to solve for high-dimensional problems; nevertheless, these simple low dimensional problems clearly illustrate the issues with solving the HJB equation.

Domain	$\Delta x$	$\Delta t$	$T$
$[-2,2]$	$0.02$	$3.33\times 10^{-6}$	1

TABLE I: Parameters used in finite difference solution to HJB PDE.

Nonlinear case

We consider the nonlinear system $dx=(-cos(x)+u)dt+\epsilon dw$ with initial condition $x(0)=1$ . As discussed in Sec. IV, MPC-SH feedback law is the optimal feedback law for the deterministic problem and the cost is $O(\epsilon^{4})$ near-optimal to the stochastic cost. The algorithm for MPC-SH is given in Algorithm 1. To solve the open-loop optimization problem in MPC-SH, the iterative linear quadratic regulator (ILQR) algorithm is used [ILQG_tassa2012synthesis]. ILQR is used specifically since the converged optimal solution satisfies the necessary conditions of the minimum principle given in Eqs. (18), (19). As discussed in Proposition 4, the deterministic open-loop problem has a unique minimum for our case, and ILQR will guarantee convergence to it [wang2022search].

In our experiment, we solve the HJB equation in (39) for a particular value of $\epsilon$ in the domain $[-2,2]$ . We use the obtained feedback policy $u=-\frac{1}{r}\frac{\partial J}{\partial x}$ , and apply it to the nonlinear system using the same $\epsilon$ value the HJB was solved for, to regulate the noise acting on the system. The system is simulated under this feedback policy for a time interval of $[0,1]$ . We do Monte Carlo simulations of the system for different noise samples of $w$ and obtain the mean and standard deviation of the cost incurred by the system over these experiments. The open-loop optimization in MPC-SH is solved using ILQR as discussed above for the specific initial condition and tested on the stochastic nonlinear system for a value of $\epsilon$ . The experiment is repeated for different noise levels by varying $\epsilon$ . The decision epoch time chosen for MPC-SH was $\Delta=0.005$ , approximately $1000\times$ the $\Delta t$ used in FD. The mean and standard deviation of the cost incurred in these experiments are tabulated in Fig. 1. Fig. 1 shows that the MPC-SH feedback law has comparable performance with the stochastic HJB-FD solution. MPC-SH is also computationally more efficient to solve, as HJB-FD requires very fine time discretization to solve without numerical issues even for the 1-D case owing to the CFL conditions (see table I). Also, MPC-SH finds an optimal trajectory for a single initial condition as opposed to HJB-FD which finds all solutions over the entire domain, which is computationally expensive.

Refer to caption — (a) Full noise spectrum.

Even when the deterministic solution, which MPC-SH is, is applied to the stochastic case, the performance is almost equivalent, due to the $O(\epsilon^{4})$ near-optimality of the deterministic solution to the stochastic. Moreover, the stochastic policy has higher variance than the deterministic MPC-SH policy at $\epsilon=0.8$ , and fails after that - another case that shows that the calculated stochastic policy is inaccurate. To illustrate the inaccuracy in the HJB-FD solution, we compare the expected cost-to-go value calculated by solving the HJB with the true cost of operation in Fig. 2a. It can be seen that the cost-to-go becomes inaccurate after $\epsilon=0.6$ . The reason for the inaccuracy of the stochastic HJB-FD solution is illustrated in Fig. 2c and 2d. The plots show the trajectories taken by the system under the HJB-FD feedback policy for different values of $\epsilon$ . When the $\epsilon$ value parametrizing the strength of the noise becomes large, it can be seen that the trajectories leave the domain on which the solution is obtained, due to the noise acting on the system. Since the cost-to-go solution is unavailable outside the domain, one has to approximate the cost of these trajectories with the cost at the boundary. To get an accurate solution, the domain one has to solve needs to expand with time. Since most computational methods do a fixed domain approximation, the stochastic solution obtained will inherently be inaccurate because the states inside the boundary need the cost-to-go values of states outside the domain as the noise intensity increases. Expanding the domain makes the problem more computationally expensive and trajectories will still leave the domain in high noise cases. In contrast, MPC-SH does not face the issue of computational inaccuracy when a trajectory exits the boundary since it can compute a new trajectory from any given state without worrying about the boundary and the boundary conditions as required by HJB-FD. In particular, this may be construed as the primary computational benefit of using the MPC-SH approach. In the deterministic case, owing to the absence of noise, the control takes the system towards the origin and not outside the domain. So, the deterministic cost-to-go and feedback policy is always accurate. Furthermore, note that when the stochastic HJB solution is accurate, the system does not leave the domain owing to the control dominating the effect of the noise.

V-B Comparison between MPC-SH and Optimal Linear feedback

In this section, we will show the comparison in performance of two different deterministic feedback laws: the optimal linear feedback and the MPC-SH feedback law. In Remark 3, it was shown that the optimal linear feedback controller given by Eqs. (27)-(29), designed around the optimal open-loop nominal trajectory is also near-optimal to the order of $O(\epsilon^{4})$ to the stochastic system. This design is referred to as the trajectory-optimized perturbation feedback controller (T-PFC) [parunandi2019TPFC]. The difference between T-PFC and MPC-SH is that, T-PFC plans the nominal trajectory only once, from the initial state, and uses the linear feedback to correct for errors during its execution. While, MPC-SH replans the nominal trajectory from the current state continuously and uses the linear feedback only for a short interval $\Delta$ between the replans. The advantage of using T-PFC is that the open-loop optimization has to be carried out only once (preferably offline), and the precomputed linear feedback gains can be used to correct for deviations due to uncertainty online. In a stochastic setting, this optimal nominal trajectory generated initially is only optimal if the system stays close to the nominal. If it deviates, the trajectory has to be replanned from the current state as done by MPC-SH to maintain optimal performance. We will examine how the performance of T-PFC compares with MPC-SH in nonlinear robotics problems, namely the car-like robot and cart-pole system, for different noise levels in Fig. 3.

In Fig. 3, we see that T-PFC shows comparable performance to MPC-SH for low values of $\epsilon$ . As noise increases, the trajectory deviates from the nominal computed initially, and the feedback policy is no longer optimal, necessitating the need for a replanned nominal trajectory from the current state. Hence, the performance of T-PFC deteriorates for high noise levels. Nevertheless, there is value for T-PFC-like deterministic feedback laws in applications that wish to minimize onboard computing and act in low-noise settings.

V-C Discussion

The primary takeaway from Section V-A and V-B is that deterministic policies are not only near-optimal but also accurate, scalable, and repeatable. It is not possible to compute the stochastic policy accurately, as shown in Sec. V-A. Note that the inaccuracy is not a limitation of the finite difference method used. Galerkin Finite Element and Collocation methods like Chebyshev polynomial-based methods are also solved on a bounded domain, and consequently, not immune to the errors observed in FD. As discussed in Sec. IV, random sampling-based methods like approximate dynamic programming, and reinforcement learning are dependent on their samples to explore the domain and inherently have the same issue in the stochastic case. In high dimensions problems, one needs a prohibitively large number of samples to explore the domain. An inefficient sampling of the domain will lead to inaccurate policies as the cost-to-go is not accurately captured by the samples. Due to this issue, there is an inherent variance in the solution obtained by such methods [RL_conv]. We have done an exhaustive investigation comparing the deterministic feedback approach with other RL methods in the companion paper [wang2022search], where we report the accuracy, scalability, efficiency, and repeatability of the deterministic policy that the stochastic RL methods lack. To summarize, as shown in Fig. 2, the regime where the stochastic solutions can be computed accurately is the one of low noise where the deterministic solution gives near-identical performance, and consequently, in practice, the deterministic feedback is sufficient.

VI Conclusion

In this paper, we have considered the problem of stochastic nonlinear control. We have shown that recursively solving the deterministic optimal control problem from the current state, à la MPC, results in a near-optimum policy to fourth order in a small noise parameter, and in practice, empirical evidence shows that the MPC law performs better than the law obtained by computationally solving the stochastic DP problem owing to the perturbation structure of the deterministic optimal control problem. An important limitation currently is the smoothness of the nominal trajectory such that suitable Taylor expansions are possible, this breaks down when trajectories are non-smooth such as in hybrid systems like legged robots, or maneuvers have kinks for car-like robots such as in a tight parking application. It remains to be seen as to if, and how, one may extend the result to such applications that are piecewise smooth in the dynamics. Also, a further careful investigation into the relative merits and demerits of the shrinking horizon approach to MPC when compared to the traditional fixed horizon approach is required, as is the generalization to the more practical and important partially observed problem.

Appendix A DETAILED PROOFS OF RESULTS

A-A Proof of Lemma 1

Proof:

We proceed by induction. The first general instance of the recursion occurs at $k=3$ . It can be shown that:

	$\displaystyle\delta x_{3}$	$\displaystyle=\underbrace{(\bar{A}_{2}\bar{A}_{1}(\epsilon w_{0}\sqrt{\Delta t% })+\bar{A}_{2}(\epsilon w_{1}\sqrt{\Delta t})+\epsilon w_{2}\sqrt{\Delta t})}_% {\delta x_{3}^{l}}+e_{3},$
	$\displaystyle e_{3}$	$\displaystyle=\bar{A}_{2}\bar{S}_{1}(\epsilon w_{0}\sqrt{\Delta t})+\bar{S}_{2% }(\bar{A}_{1}(\epsilon w_{0}\sqrt{\Delta t})+\epsilon w_{1}\sqrt{\Delta t}+$
		$\displaystyle\bar{S}_{1}(\epsilon w_{0}\sqrt{\Delta t})).$

Noting that $\bar{S}_{1}(.)$ and $\bar{S}_{2}(.)$ are second and higher order terms, it follows that $e_{3}$ is $O(\epsilon^{2})$ .
Suppose now that $\delta x_{k}=\delta x_{k}^{l}+e_{k}$ where $e_{k}$ is $O(\epsilon^{2})$ . Then: $\delta x_{k+1}=\bar{A}_{k}(\delta x_{k}^{l}+e_{k})+\epsilon w_{k}\sqrt{\Delta t% }+\bar{S}_{k}(\delta x_{k}),\\ =\underbrace{(\bar{A}_{k}\delta x_{k}^{l}+\epsilon w_{k}\sqrt{\Delta t})}_{% \delta x_{k+1}^{l}}+\underbrace{\{\bar{A}_{k}e_{k}+\bar{S}_{k}(\delta x_{k})\}% }_{e_{k+1}}.$ Noting that $\bar{S}_{k}$ is $O(\epsilon^{2})$ and that $e_{k}$ is $O(\epsilon^{2})$ by assumption, the result follows that $e_{k+1}$ is $O(\epsilon^{2})$ .
Now, let us take a closer look at the term $e_{k}$ and again proceed by induction. It is clear that $e_{1}=e_{1}^{(2)}=0$ . Next, it can be seen that $e_{2}=\bar{A}_{1}e_{1}^{(2)}+{\delta x_{1}^{l}}^{\mathsf{T}}\bar{S}_{1}^{(2)}% \delta x_{1}^{l}+O(\epsilon^{3})=(\epsilon^{2}\Delta t){w_{0}}^{\mathsf{T}}% \bar{S}_{1}^{(2)}w_{0}+O(\epsilon^{3})$ , which shows the recursion is valid for $k=2$ given it is so for $k=1$ .
Suppose that it is true for $k$ . Then: $\delta x_{k+1}=\bar{A}_{k}\delta x_{k}+\bar{S}_{k}(\delta x_{k})+\epsilon w_{k% }\sqrt{\Delta t}=\bar{A}_{k}(\delta x_{k}^{l}+e_{k})+\bar{S}_{k}(\delta x_{k}^% {l}+e_{k})+\epsilon w_{k}\sqrt{\Delta t}=\underbrace{(\bar{A}_{k}\delta x_{k}^% {l}+\epsilon w_{k}\sqrt{\Delta t})}_{\delta x_{k+1}^{l}}+\underbrace{\bar{A}_{% k}e_{k}^{(2)}+{\delta x_{k}^{l}}^{\mathsf{T}}\bar{S}_{k}^{(2)}\delta x_{k}^{l}% }_{e_{k+1}^{(2)}}+O(\epsilon^{3}),$ where the last line follows because $e_{k}=e_{k}^{(2)}+O(\epsilon^{3})$ , and $\bar{S}_{k}^{(2)}$ is the second order term of $\bar{S}_{k}(.)$ . This completes the induction and the proof. $\hfill\blacksquare$ ∎

A-B Proof of Lemma 2

Proof:

We have that: $\mathcal{J}^{\pi}=\sum_{k}\bar{c}_{k}+\sum_{k}\bar{C}_{k}(\delta x_{k}^{l}+e_{% k})+\sum_{k}\bar{H}_{k}(\delta x_{k}^{l}+e_{k}),=\sum_{k}\bar{c}_{k}+\sum_{k}% \bar{C}_{k}\delta x_{k}^{l}+\sum_{k}\delta x_{k}^{l^{\prime}}\bar{H}_{k}^{(2)}% \delta x_{k}^{l}+\bar{C}_{k}e_{k}^{(2)}+O(\epsilon^{3})$ , where the last line of the equation above follows from an application of Lemma 1. $\hfill\blacksquare$ ∎

A-C Proof of Proposition 1

In order to prove this result, we first need the following preparatory result. Consider the following deterministic continuous time system:

	$\displaystyle J^{\pi}(0,x_{0})$	$\displaystyle=\int_{0}^{T}\underbrace{c(x_{t},\pi_{t}(x_{t})}_{\bar{c}(t,x_{t}% )}dt+c_{T}(x_{T}),$
	$\displaystyle\dot{x}$	$\displaystyle=\underbrace{f(x)+g(x)\pi_{t}(x)}_{\bar{f}(t,x)}+\epsilon v,$

where $v(t)$ is a given continuous time input. We rewrite the above policy evaluation equation in state-space form as follows: $\dot{x}=\bar{f}(t,x)+\epsilon v,\;\dot{R}=\bar{c}(t,x),\;\dot{t}=1,\\ Z(t)=R(t)+c_{T}(x),$ where the above equations can now be expressed in a time-invariant state space form as: $\dot{X}=F(X)+\epsilon Gv$ , and $Z(t)=H(X(t))$ , where $X=[x,R,t]^{\prime}$ , $F=[\bar{f}(t,x),\bar{c}(t,x),1]^{\prime}$ , $G=[I_{n},0,0]^{\prime}$ and $H(X)=R+c_{T}(x)$ .
Given that the component functions $f(\cdot),~{}g(\cdot),~{}c(\cdot,\cdot),~{}c_{T}(\cdot),~{}\pi_{t}(\cdot)$ are five times continuously differentiable ( $\mathcal{C}^{5}$ ) in their arguments (assumption A2), the output $Z(T)=J^{\pi}(0,x_{0})$ can be expressed in terms of the inputs $v(t)$ as the unique Volterra series (Theorem 2.5 in [Volterra_Krener]) where we have suppressed the dependence on $\pi$ for notational convenience:

	$\displaystyle Z(T)=J^{(0)}(x_{0})+\epsilon\int_{0}^{T}J^{(1)}(T,s)v(s)ds$
	$\displaystyle+\epsilon^{2}\int_{0}^{T}\int_{0}^{s_{1}}J^{(2)}(T,s_{1},s_{2})v(% s_{1})v(s_{2})ds_{2}ds_{1}$
	$\displaystyle+\epsilon^{3}\int_{0}^{T}\int_{0}^{s_{1}}\int_{0}^{s_{2}}J^{(3)}(% T,s_{1},s_{2},s_{3})v(s_{3})v(s_{2})v(s_{1})ds_{3}ds_{2}ds_{1}$
	$\displaystyle+\epsilon^{4}\int_{0}^{T}\int_{0}^{s_{1}}\int_{0}^{s_{2}}\int_{0}% ^{s_{3}}J^{(4)}(T,s_{1},s_{2},s_{3},s_{4})[v(s_{4})v(s_{3})$
	$\displaystyle\quad\quad v(s_{2})v(s_{1})]ds_{3}ds_{2}ds_{1}+\mathcal{G},$		(40)

where the Volterra kernels $J^{(k)}(.)$ are unique and continuous in their arguments, and $\mathcal{G}$ is an $o(\epsilon^{4})$ function.

Proof:

We show the result for a scalar input, the generalization to a vector input is straightforward. We first write the sample path cost in an input-output fashion in the discrete time case. Let $v(t)$ be a given input sequence, and given a discretization time $\Delta t$ such that $N=T/\Delta t$ , let $v_{k}=v(k\Delta t)$ , $k=0,1,2\cdots N-1,$ denote a piecewise constant approximation of the input. Under A2, the cost of any sample path from a given initial state $x_{0}$ can be expanded as follows in discrete time (where we have suppressed the explicit dependence of the different terms on $x_{0}$ for simplifying notation): $\mathcal{V}^{\pi}_{N}=\mathcal{V}^{\pi,0}_{N}+\epsilon\mathcal{V}^{\pi,1}_{N}+% \epsilon^{2}\mathcal{V}^{\pi,2}_{N}+\epsilon^{3}\mathcal{V}^{\pi,3}_{N}+% \epsilon^{4}\mathcal{V}^{\pi,4}_{N}+\mathcal{G}_{N}^{\pi},$ where $\mathcal{V}^{\pi,0}_{N}$ represents the nominal/ zero input cost and

	$\displaystyle\mathcal{V}^{\pi,1}_{N}=\sum_{s=0}^{N-1}\mathcal{J}^{(1)}_{N}(N% \Delta t,s\Delta t)v_{s}{\Delta t},$
	$\displaystyle\mathcal{V}^{\pi,2}_{N}=\sum_{s_{1}=0}^{N-1}\sum_{s_{2}=0}^{s_{1}% }\mathcal{J}^{(2)}_{N}(N\Delta t,s_{1}\Delta t,s_{2}\Delta t)v_{s_{2}}v_{s_{1}% }\Delta t^{2},$
	$\displaystyle\mathcal{V}^{\pi,3}_{N}=\sum_{s_{1}=0}^{N-1}\sum_{s_{2}=0}^{s_{1}% }\sum_{s_{3}=0}^{s_{2}}\mathcal{J}^{(3)}_{N}(N\Delta t,s_{1}\Delta t,s_{2}% \Delta t,s_{3}\Delta t)$
	$\displaystyle\quad\quad\times v_{s_{3}}v_{s_{2}}v_{s_{1}}\Delta t^{3},$
	$\displaystyle\mathcal{V}^{\pi,4}_{N}=\sum_{s_{1}=0}^{N-1}\sum_{s_{2}=0}^{s_{1}% }\sum_{s_{3}=0}^{s_{2}}\sum_{s_{4}=0}^{s_{3}}\mathcal{J}^{(4)}_{N}(N\Delta t,s% _{1}\Delta t,s_{2}\Delta t,s_{3}\Delta t,s_{4}\Delta t)$
	$\displaystyle\quad\quad\times v_{s_{4}}v_{s_{3}}v_{s_{2}}v_{s_{1}}\Delta t^{4},$

where $\mathcal{J}^{(k)}(\cdot)$ represent the piecewise constant discretized kernels corresponding to the Volterra kernels defined in (40). Further, the remainder function $\mathcal{G}_{N}^{\pi}$ is an $o(\epsilon^{4})$ function.
Let $V^{\pi}(x_{0})$ denote the cost of the trajectory under the continuous time input $v(t)$ . Then it follows that $\mathcal{V}^{\pi}_{N}(x_{0})\rightarrow V^{\pi}(x_{0})$ as $N\rightarrow\infty$ , regardless of the input sequence $v(t)$ . Therefore, it follows that the discretized piecewise constant kernels $\mathcal{J}^{(k)}_{N}\rightarrow J^{(k)}$ in the $L_{1}$ sense as $N\rightarrow\infty$ .
If the inputs were a discretized Wiener sequence $\omega(k\Delta t)=w_{k}\sqrt{\Delta t}$ , where $w_{k}$ is a Gaussian white noise sequence, we can write the cost of a sample path as: $\mathcal{J}^{\pi}_{N}=\mathcal{J}^{\pi,0}_{N}+\epsilon\mathcal{J}^{\pi,1}_{N}+% \epsilon^{2}\mathcal{J}^{\pi,2}_{N}+\epsilon^{3}\mathcal{J}^{\pi,3}_{N}+% \epsilon^{4}\mathcal{J}^{\pi,4}_{N}+\mathcal{R}_{N}^{\pi}$ , where $\mathcal{J}^{\pi,0}_{N}$ is the zero noise cost and

	$\displaystyle\mathcal{J}^{\pi,1}_{N}=\sum_{s=0}^{N-1}\mathcal{J}^{(1)}_{N}(N% \Delta t,s\Delta t)w_{s}{\sqrt{\Delta t}},$
	$\displaystyle\mathcal{J}^{\pi,2}_{N}=\sum_{s_{1}=0}^{N-1}\sum_{s_{2}=0}^{s_{1}% }\mathcal{J}^{(2)}_{N}(N\Delta t,s_{1}\Delta t,s_{2}\Delta t)w_{s_{2}}w_{s_{1}% }\Delta t,$
	$\displaystyle\mathcal{J}^{\pi,3}_{N}=\sum_{s_{1}=0}^{N-1}\sum_{s_{2}=0}^{s_{1}% }\sum_{s_{3}=0}^{s_{2}}\Big{(}\mathcal{J}^{(3)}_{N}(N\Delta t,s_{1}\Delta t,s_% {2}\Delta t,s_{3}\Delta t)$
	$\displaystyle\quad\quad w_{s_{3}}w_{s_{2}}w_{s_{1}}(\Delta t)^{3/2}\Big{)},$
	$\displaystyle\mathcal{J}^{\pi,4}_{N}=\sum_{s_{1}=0}^{N-1}\sum_{s_{2}=0}^{s_{1}% }\sum_{s_{3}=0}^{s_{2}}\sum_{s_{4}=0}^{s_{3}}\Big{(}\mathcal{J}^{(4)}_{N}(N% \Delta t,s_{1}\Delta t,s_{2}\Delta t,s_{3}\Delta t,$
	$\displaystyle\quad\quad s_{4}\Delta t)w_{s_{4}}w_{s_{3}}w_{s_{2}}w_{s_{1}}(% \Delta t)^{2}\Big{)},$

Moreover, due to the whiteness of the noise sequence $\{w_{k}\}$ , it follows that $E[\mathcal{J}^{\pi,1}_{N}]=0$ , and $E[\mathcal{J}^{\pi,3}_{N}]=0$ , since these terms are made of odd valued products of the noise sequences, while $E[\mathcal{J}^{\pi,2}_{N}],E[\mathcal{J}^{\pi,4}_{N}]$ are both finite owing to the finiteness of the moments of the noise values. Next as we take the limit of the terms above as $N\rightarrow\infty$ , we obtain:

	$\displaystyle\lim_{N\rightarrow\infty}E[\mathcal{J}^{\pi,2}_{N}]=\int_{0}^{T}J% ^{(2)}(T,t,t)dt\equiv J^{\pi,1}<\infty,$
	$\displaystyle\lim_{N\rightarrow\infty}E[\mathcal{J}^{\pi,4}_{N}]=\int_{0}^{T}% \int_{0}^{t}J^{(4)}(T,t,t,\tau,\tau)d\tau dt\equiv J^{\pi,2}<\infty,$

where the first equality above follows from the convergence of the discretized kernels $\mathcal{J}^{(k)}_{N}\rightarrow J^{(k)}$ for $k=2,4$ , while the integrals are finite owing to the continuity of the functions $J^{(2)}$ and $J^{(4)}$ as established in (40). Further $\lim_{\epsilon\rightarrow 0}\epsilon^{-4}\lim_{N\rightarrow\infty}E[\mathcal{R% }_{N}^{\pi}]=\lim_{N\rightarrow\infty}E[\lim_{\epsilon\rightarrow 0}\epsilon^{% -4}\mathcal{R}_{N}^{\pi}]=0$ , i.e., $\lim_{N\rightarrow\infty}E[\mathcal{R}_{N}^{\pi}]$ is $o(\epsilon^{4})$ . Therefore, taking expectations on both sides, we obtain: $\lim_{N\rightarrow\infty}E[\mathcal{J}^{\pi}_{N}]={J}^{\pi,0}+\epsilon^{2}{J}^% {\pi,1}+\epsilon^{4}{J}^{\pi,2}+o(\epsilon^{4}),$ where $J^{\pi,0}=\lim_{N\rightarrow\infty}\mathcal{J}^{\pi,0}_{N}$ , which proves the first part of the result.
Next, from Lemma 2, as we take the limit $\Delta t\rightarrow 0$ , it is clear that ${J}^{\pi,0}$ stems solely from the continuous-time nominal trajectory, and that ${J}^{\pi,1}$ is dependent on the continuous-time nominal and the linear closed-loop feedback. Therefore, the result follows. $\hfill\blacksquare$ ∎

A-D Proof of Proposition 2

Proof:

Using Proposition 1, we know that any cost function, and hence, the optimal cost-to-go function $J(t,x)$ can be expanded as:

J=J^{0}+\epsilon^{2}J^{1}+\epsilon^{4}J^{2}+\cdots.

(41)

Consider the HJB in Eq. (4) and substitute the minimizing control $u=-{R}^{\mathsf{-1}}{\mathcal{G}(x)}^{\mathsf{T}}J^{x}$ (Eq. (5)). This gives the PDE

	$\displaystyle-\frac{\partial J}{\partial t}=$	$\displaystyle\bar{l}+\frac{1}{2}{(J^{x})}^{\mathsf{T}}\bar{\mathcal{G}}{R}^{% \mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}J^{x}+{(J^{x})}^{\mathsf{T}}(\bar{% \mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}% }J^{x})$
		$\displaystyle+\frac{\epsilon^{2}}{2}tr(J^{xx}),$		(42)

with terminal condition $J(T,x)=c_{T}(x).$ Also, $\bar{l}=l(x)$ , $\bar{\mathcal{F}}=\mathcal{F}(x)$ , $\bar{\mathcal{G}}=\mathcal{G}(x)$ and $tr()$ is the trace operator. Substituting Eq. (41) into Eq. (A-D) we obtain that:

	$\displaystyle(-\frac{\partial J^{0}}{\partial t}-\epsilon^{2}\frac{\partial J^% {1}}{\partial t}-\epsilon^{4}\frac{\partial J^{2}}{\partial t}+\cdots)=\bar{l}+$
	$\displaystyle\frac{1}{2}{(J^{0,x}+\epsilon^{2}J^{1,x}+\cdots)}^{\mathsf{T}}% \bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}(J^{0,x}+% \epsilon^{2}J^{1,x}+\cdots)$
	$\displaystyle+{(J^{0,x}+\epsilon^{2}J^{1,x}+\cdots)}^{\mathsf{T}}\Big{(}\bar{% \mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}% }(J^{0,x}+$
	$\displaystyle\epsilon^{2}J^{1,x}+\cdots)\Big{)}+\frac{\epsilon^{2}}{2}tr(J^{0,% xx}+\epsilon^{2}J^{1,xx}+\cdots).$		(43)

Now, we equate the $\epsilon^{0}$ , $\epsilon^{2}$ terms on both sides to obtain perturbation equations for the cost functions $J^{0},J^{1},J^{2}\cdots$ .
First, let us consider the $\epsilon^{0}$ term. Utilizing Eq. (43) above, we obtain:

	$\displaystyle-\frac{\partial J^{0}}{\partial t}$	$\displaystyle=\bar{l}+\frac{1}{2}{(J^{0,x})}^{\mathsf{T}}\bar{\mathcal{G}}{R}^% {\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}(J^{0,x})$
		$\displaystyle+{(J^{0,x})}^{\mathsf{T}}\underbrace{(\bar{\mathcal{F}}-\bar{% \mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}J^{0,x})}_{\bar{f% }^{0}},$		(44)

with the terminal condition $J^{0}(T,x)=c_{T}(x)$ .
Similarly, one can obtain the $J^{1}$ equations by equating the $O(\epsilon^{2})$ terms in Eq. (43), which after regrou** and cancelling some of the terms yields:

\displaystyle-\frac{\partial J^{1}}{\partial t}={(J^{1,x})}^{\mathsf{T}}% \underbrace{(\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{% \mathcal{G}}}^{\mathsf{T}}J^{0,x})}_{=\bar{f}^{0}}+\frac{1}{2}tr(J^{0,xx}),

(45)

with terminal boundary condition $J^{1}(T,x)=0$ . Note the perturbation structure of Eqs. (A-D) and (45), $J^{0}(t,x)$ can be solved without knowledge of $J^{1}(t,x),J^{2}(t,x)$ etc., while $J^{1}(t,x)$ requires knowledge only of $J^{0}(t,x)$ , and so on. In other words, the equations can be solved sequentially rather than simultaneously.

Now, let us consider the deterministic HJB equation in Eq. (6). Recall, $\phi(t,x)$ represents the optimal cost-to-go of the deterministic problem, and $u^{d}=-{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x}$ is the deterministic policy, analogous to the stochastic case. Substituting $u^{d}$ in Eq. (6) gives

-\frac{\partial\phi}{\partial t}=\bar{l}+\frac{1}{2}{(\phi^{x})}^{\mathsf{T}}% \bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x}+{(% \phi^{x})}^{\mathsf{T}}(\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{% \bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x}),

(46)

with terminal condition $\phi(T,x)=c_{T}(x).$

Next, let $\varphi(t,x)$ denote the cost-to-go of the deterministic policy $u^{d}(\cdot)$ when applied to the stochastic system, i.e., Eq. (1) with $\epsilon>0$ . Then, the cost-to-go of the deterministic policy, when applied to the stochastic system, satisfies:

	$\displaystyle-\frac{\partial\varphi}{\partial t}=$	$\displaystyle\bar{l}+\frac{1}{2}{(\phi^{x})}^{\mathsf{T}}\bar{\mathcal{G}}{R}^% {\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x}+{(\varphi^{x})}^{\mathsf% {T}}(\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{% \mathsf{T}}\phi^{x})$
		$\displaystyle+\frac{\epsilon^{2}}{2}tr(\varphi^{xx}),$		(47)

with terminal condition $\varphi(T,x)=c_{T}(x)$ . From Proposition 1, we know $\varphi=\varphi^{0}+\epsilon^{2}\varphi^{1}+\epsilon^{4}\varphi^{2}+\cdots$ . Substituting this in Eq. (A-D) gives

	$\displaystyle-\frac{\partial\varphi^{0}}{\partial t}-\epsilon^{2}\frac{% \partial\varphi^{1}}{\partial t}-\epsilon^{4}\frac{\partial\varphi^{2}}{% \partial t}+\cdots=\bar{l}+\frac{1}{2}{(\phi^{x})}^{\mathsf{T}}\bar{\mathcal{G% }}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x}$
	$\displaystyle+{(\varphi^{0,x}+\epsilon^{2}\varphi^{1,x}+\cdots)}^{\mathsf{T}}% \Big{(}\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}% ^{\mathsf{T}}\phi^{x}\Big{)}$
	$\displaystyle+\frac{\epsilon^{2}}{2}tr(\varphi^{0,xx}+\epsilon^{2}\varphi^{1,% xx}+\cdots).$		(48)

As before, if we gather the terms for $\epsilon^{0}$ , $\epsilon^{2}$ , etc., on both sides of the above equation, we shall get the equations governing $\varphi^{0},\varphi^{1}$ , etc. First, looking at the $\epsilon^{0}$ term in Eq. (48), we obtain:

-\frac{\partial\varphi^{0}}{\partial t}=\bar{l}+\frac{1}{2}{(\phi^{x})}^{% \mathsf{T}}\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}% \phi^{x}+{(\varphi^{0,x})}^{\mathsf{T}}(\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}% ^{\mathsf{-1}}{\bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x}),

(49)

with the terminal condition $\varphi^{0}(T,x)=c_{T}(x)$ .

Comparing Eqs. (49) and (46), it follows that $\phi(t,x)=\varphi^{0}(t,x)$ for all $(t,x)$ . Further, comparing them to Eq. (A-D), it follows that $\varphi^{0}(t,x)=J^{0}(t,x)$ , for all $(t,x)$ . Also, note that the closed-loop system above, $\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{\bar{\mathcal{G}}}^{% \mathsf{T}}\phi^{x}=\bar{f}^{0}$ (see Eq. (A-D) and (45)).

Next, consider the $\epsilon^{2}$ terms in Eq. (48). We obtain:

\displaystyle-\frac{\partial\varphi^{1}}{\partial t}={(\varphi^{1,x})}^{% \mathsf{T}}\underbrace{(\bar{\mathcal{F}}-\bar{\mathcal{G}}{R}^{\mathsf{-1}}{% \bar{\mathcal{G}}}^{\mathsf{T}}\phi^{x})}_{\bar{f}^{0}}+\frac{1}{2}tr(\varphi^% {0,xx}),

(50)

with terminal condition $\varphi^{1}(T,x)=0$ . Again, comparing Eq. (50) to Eq. (45), and noting that $\varphi^{0}=J^{0}$ , it follows that $\varphi^{1}(t,x)=J^{1}(t,x)$ , for all $(t,x)$ . This completes the proof of the result. $\hfill\blacksquare$
∎

The result above has used the fact that the noise sequence $w_{t}$ is white. However, this is not necessary to show that $J^{0}(t,x)=\varphi^{0}(t,x)$ for all $(t,x)$ .

A-E Proof of Proposition 3

Proof:

Let the system model be given as $\dot{x}=\mathcal{F}(x)+\mathcal{G}(x)u$ where, the system matrices, its Jacobians, and Hessians are defined as in Definition 1.
Using indicial notation, the Lagrange-Charpit equations are (the subscript $t$ is ignored for the sake of simplicity):

	$\displaystyle\dot{x}_{i}$	$\displaystyle=f_{i}(x)-\Gamma_{i}^{j}{R}^{\mathsf{-1}}_{jm}\Gamma_{m}^{n}q_{n},$		(51)
	$\displaystyle\dot{q}_{i}$	$\displaystyle=-L_{i}^{x}-f_{ij}^{x}q_{j}+q_{n}\Gamma_{m}^{n}{R}^{\mathsf{-1}}_% {lm}\Gamma^{l,x}_{ik}q_{k}.$		(52)

Performing a perturbation expansion of $\dot{x}$ around a nominal trajectory $\bar{x}$ gives

\delta\dot{x}_{i}=(f_{ij}^{x}-\Gamma_{ij}^{k,x}{R}^{\mathsf{-1}}_{km}\Gamma_{m% }^{n}q_{n}-\Gamma_{k}^{i}{R}^{\mathsf{-1}}_{km}\Gamma_{nj}^{m,x}q_{n})\delta x% _{j}+\frac{1}{2}(f_{ijk}^{xx}-\\ {R}^{\mathsf{-1}}_{lm}\Gamma_{m}^{n}q_{n}\Gamma_{ikj}^{l,xx}-\Gamma_{ik}^{l,x}% {R}^{\mathsf{-1}}_{lm}\Gamma_{nj}^{m,x}q_{n}-\Gamma_{i}^{l}{R}^{\mathsf{-1}}_{% lm}\Gamma_{nkj}^{m,xx}q_{n})\delta x_{k}\delta x_{j}\\ -\Gamma_{i}^{j}{R}^{\mathsf{-1}}_{jm}\Gamma_{m}^{n}\delta q_{n}+\tilde{H}(% \delta x^{3})+\tilde{S}(\delta q^{2}).

(53)

Expanding the co-states about the nominal gives

	$\displaystyle q_{i}$	$\displaystyle=G_{i}+P_{ij}\delta x_{j}+H(\delta x^{2}),$		(54)
	$\displaystyle\delta q_{i}$	$\displaystyle=P_{ij}\delta x_{j}+H(\delta x^{2}).$		(55)

Substituting Eq. (54) and (55) in Eq. (53), we get

	$\displaystyle\delta\dot{x}_{i}$	$\displaystyle=(\bar{f}_{ij}^{x}-\bar{\Gamma}_{ij}^{k,x}{R}^{\mathsf{-1}}_{km}% \bar{\Gamma}_{m}^{n}q_{n}-\bar{\Gamma}_{k}^{i}{R}^{\mathsf{-1}}_{km}\bar{% \Gamma}_{nj}^{m,x}q_{n}$
		$\displaystyle-\bar{\Gamma}_{i}^{l}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{m}^{n}P_% {nj})\delta x_{j}+H.O.T.$		(56)

Let $\mathcal{M}_{ij}=\bar{f}_{ij}^{x}-\bar{\Gamma}_{ij}^{k,x}{R}^{\mathsf{-1}}_{km% }\bar{\Gamma}_{m}^{n}q_{n}-\bar{\Gamma}_{k}^{i}{R}^{\mathsf{-1}}_{km}\bar{% \Gamma}_{nj}^{m,x}q_{n}-\bar{\Gamma}_{i}^{l}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}% _{m}^{n}P_{nj}.$ Differentiating Eq. (54) and using Eq. (56), we get

	$\displaystyle\dot{q}_{i}$	$\displaystyle=\dot{G}_{i}+P_{ij}\delta\dot{x}_{j}+\dot{P}_{ij}\delta x_{j}+\cdots,$		(57)
	$\displaystyle\dot{q}_{i}$	$\displaystyle=\dot{G}_{i}+P_{ij}(\mathcal{M}_{jk}\delta x_{k}+\cdots)+\dot{P}_% {ij}\delta x_{j}+\cdots.$		(58)

Expanding Eq. (52) upto 1st order about a nominal trajectory and substituting Eq. (54),

$\displaystyle\dot{q}_{i}$	$\displaystyle=-(\bar{L}_{i}^{x}-L_{ij}^{xx}\delta{x}_{j}+\cdots)-\bar{f}_{ij}^% {x}(G_{j}+P_{jk}\delta x_{k}+\cdots)$
	$\displaystyle-\delta x_{m}\bar{f}_{ijm}^{xx}(G_{j}+P_{ij}\delta x_{k}+\cdots)+% (G_{n}+P_{nk}\delta x_{k}+\cdots)$
	$\displaystyle\times(\bar{\Gamma}_{m}^{n}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{ip% }^{l,x}+\bar{\Gamma}_{mj}^{n,x}\delta x_{j}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_% {ip}^{l,x}+\bar{\Gamma}_{m}^{n}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{ipj}^{l,xx}% \delta x_{j}+\cdots)$
	$\displaystyle\times(G_{p}+P_{pr}\delta x_{r}+\cdots).$	(59)

Comparing the terms up to 1st order in $\delta x$ in Eq. (58) and Eq. (59) with appropriate change in indices, we get

	$\displaystyle\dot{G}_{i}=-\bar{L}_{i}^{x}-\bar{f}_{ij}^{x}G_{j}+G_{n}\bar{% \Gamma}_{m}^{n}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{ip}^{l,x}G_{p},$		(60)
	$\displaystyle\dot{P}_{ij}=-P_{ik}(\bar{f}_{kj}^{x}-\bar{\Gamma}_{kj}^{l,x}{R}^% {\mathsf{-1}}_{lm}\bar{\Gamma}_{m}^{n}G_{n})-(\bar{f}_{ik}^{x}-G_{n}\bar{% \Gamma}_{m}^{n}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{ik}^{l,x})P_{kj}$
	$\displaystyle-L_{ij}^{xx}-(\bar{f}_{ipj}^{xx}-G_{n}\bar{\Gamma}_{m}^{n}{R}^{% \mathsf{-1}}_{lm}\bar{\Gamma}_{ipj}^{l,xx})G_{p}+P_{ik}\bar{\Gamma}_{l}^{k}{R}% ^{\mathsf{-1}}_{lm}\bar{\Gamma}_{nj}^{m,x}G_{n}$
	$\displaystyle+P_{ik}\bar{\Gamma}_{k}^{l}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{m}% ^{n}P_{nj}+P_{nj}\bar{\Gamma}_{m}^{n}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{ip}^{% l,x}G_{p}$
	$\displaystyle+G_{n}\bar{\Gamma}_{mj}^{n,x}{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{% ip}^{l,x}G_{p}.$		(61)

Substituting $\bar{u}_{l}=-{R}^{\mathsf{-1}}_{lm}\bar{\Gamma}_{m}^{n}G_{n}$ and changing indices to group terms, Eq. (60) can be written as $\dot{G}_{i}=-\bar{L}_{i}^{x}-(\bar{f}_{ij}^{x}+\bar{u}_{l}\bar{\Gamma}_{ij}^{l% ,x})G_{j},$ whose vector form is Eq. (27). Similarly, $\bar{u}_{l}$ can be substituted in Eq. (61) and can be written as

$\displaystyle\dot{P}_{ij}$	$\displaystyle=-P_{ik}(\bar{f}_{kj}^{x}+\bar{\Gamma}_{kj}^{l,x}\bar{u}_{l})-(% \bar{f}_{ik}^{x}+\bar{u}_{l}\bar{\Gamma}_{ik}^{l,x})P_{kj}-L_{ij}^{xx}$
	$\displaystyle-(\bar{f}_{ipj}^{xx}-\bar{u}_{l}\bar{\Gamma}_{ipj}^{l,xx})G_{p}+K% _{li}R_{lm}K_{mj},$	(62)
$\displaystyle\text{where, }K_{ij}$	$\displaystyle=-{R}^{\mathsf{-1}}_{im}(\bar{\Gamma}_{k}^{m}P_{kj}+\bar{\Gamma}_% {kj}^{m,x}G_{k}).$	(63)

Eq. (28) and Eq. (29) are the vector form of Eq. (62) and Eq. (63) respectively. $\hfill\blacksquare$ ∎

A-F Proof of Proposition 4

Proof:

We show the scalar case, the vector case is a straightforward extension. We need the function $F(t,x,p,q,J)=p+l-\frac{1}{2}\frac{g^{2}}{r}q^{2}+fq$ to be $\mathcal{C}^{2}$ in all its arguments for unique characteristic curves, i.e., characteristic curves that do not intersect, since then the functions $F_{x}$ and $F_{q}$ are $\mathcal{C}^{1}$ , and therefore Lipschitz continuous. From the existence and uniqueness results of ODEs [nonlinear_systems_khalil, Ch. 3.1], it follows that the Lagrange-Charpit characteristic ODEs $\dot{x}=F_{q}$ , $\dot{q}=-F_{x}-qF_{J}$ , are Lipschitz continuous in their right hand side functions, and therefore, have unique solutions in the interval $[0,T]$ . Moreover, the state $x$ and co-state $q=J^{x}$ vary continuously with respect to the terminal condition $x_{T}$ at any time $t$ . Let us denote $q_{T}=c_{T}^{x}(x_{T})\equiv\phi_{T}(x_{T})$ . Thus, $q_{T}$ is a function of $x_{T}$ , i.e., $q_{T}$ is uniquely determined by the value of $x_{T}$ .
Next, we show that under the Lagrange-Charpit equations, the function $\phi_{T}(x_{T})$ remains a function, i.e., we can write $q_{t}=\phi_{t}(x_{t})$ , for some suitable smooth function $\phi_{t}(.)$ , for any $t\in[0,T]$ . In order to show this, suppose that this is not the case for some $t$ . Then, it is necessary that there exist $x_{t}^{*}$ such that $\frac{dq_{t}}{dx_{t}}|_{x_{t}^{*}}=\pm\infty$ , or equivalently that $\frac{dx_{t}}{dq_{t}}|_{x_{t}^{*}}=0$ (see Fig. 4). This in turn implies that there exists a terminal condition $x_{T}^{*}$ such that $\frac{dx_{t}}{dx_{T}}|_{x_{T}^{*}}=0$ , where the terminal state $x_{T}^{*}$ maps to the state $x_{t}^{*}$ under the Lagrange-Charpit equations. We will now show that this is not feasible. Owing to the uniqueness of the solutions of the Lagrange-charpit equations: the Jacobian $\begin{vmatrix}\frac{\partial x_{t}}{\partial x_{T}}&\frac{\partial x_{t}}{% \partial q_{T}}\\ \frac{\partial q_{t}}{\partial x_{T}}&\frac{\partial q_{t}}{\partial q_{T}}% \end{vmatrix}_{(x_{T},q_{T})}\neq 0$ , for any $(x_{T},q_{T})$ . Thus, for $q_{T}=\phi_{T}(x_{T})$ , substituting into the above equation implies that:

\frac{\partial x_{t}}{\partial x_{T}}-\phi_{T}^{\prime}(x_{T})\frac{\partial x% _{t}}{\partial q_{T}}\neq 0,

(64)

for any terminal state $x_{T}$ , where $\phi_{T}^{\prime}(.)$ represents the derivative of the function. Consider now the state $x_{T}^{*}$ , owing to the fact that $\frac{dx_{t}}{dx_{T}}|_{x_{T}^{*}}=0$ , we obtain that:

\frac{dx_{t}}{dx_{T}}=\frac{\partial x_{t}}{\partial x_{T}}\frac{dx_{T}}{dx_{T% }}+\frac{\partial x_{t}}{\partial q_{T}}\frac{dq_{T}}{dx_{T}}=\frac{\partial x% _{t}}{\partial x_{T}}+\phi_{T}^{\prime}(x_{T}^{*})\frac{\partial x_{t}}{% \partial q_{T}}=0,

(65)

where the partial derivatives are taken at $x_{T}^{*}$ . The above implies that $\frac{dq_{T}}{dx_{T}}=-\phi_{T}^{\prime}(x_{T}^{*})$ , however, by definition: $\frac{dq_{T}}{dx_{T}}=\phi_{T}^{\prime}(x_{T}^{*})$ , which means that $\phi_{T}^{\prime}(x_{T}^{*})=0$ . Owing to Eq. (64), this means that $\frac{\partial x_{t}}{\partial x_{T}}|_{x_{T}^{*}}\neq 0$ . However, using the second equality in Eq. (65), this implies that $\frac{dx_{t}}{dx_{T}}|_{x_{T}^{*}}\neq 0$ , which contradicts the assumption that $\frac{dx_{t}}{dx_{T}}|_{x_{T}^{*}}=0$ . Thus, it follows that $q_{t}=\phi_{t}(x_{t})$ , for some smooth function $\phi_{t}(\cdot)$ , for any $t\in[0,T]$ .

Next, note that if a characteristic curve flows through the initial state $x_{0}$ , then it means that we have found a terminal state $x_{T}$ , along with the terminal co-state $q_{T}=c_{T}^{x}(x_{T})$ , that satisfies the Lagrange-Charpit equations. However, this is, by definition, a solution that is found by satisfying the Minimum Principle. Therefore, owing to the development above, the co-state $q_{0}=\phi_{0}(x_{0})$ is uniquely determined by the initial state $x_{0}$ , and a solution that satisfies the minimum principle is necessarily unique. Moreover, since this solution is the unique characteristic curve of the HJB flowing through $x_{0}$ , it is also the global optimum.
The arguments made above can be generalized to the vector case where the function $F(t,x,J,p,q)$ in the vector case is defined as $F(t,x,J,p,q)=p+l-\frac{1}{2}{q}^{\mathsf{T}}\mathcal{G}(x){R}^{\mathsf{-1}}{% \mathcal{G}(x)}^{\mathsf{T}}q+{q}^{\mathsf{T}}\mathcal{F}(x),$ and the equivalent Lagrange-Charpit characteristic ODEs are: $\dot{x}_{i}=f_{i}(x)-\Gamma_{i}^{j}{R}^{\mathsf{-1}}_{jm}\Gamma_{m}^{n}q_{n},% \ \dot{q}_{i}=-L_{i}^{x}-f_{ij}^{x}q_{j}+q_{n}\Gamma_{m}^{n}{R}^{\mathsf{-1}}_% {lm}\Gamma^{l,x}_{ik}q_{k}.$ $\hfill\blacksquare$

∎

Appendix B Acknowledgment

This work was supported by the NSF under grants ECCS-1637889, CDSE 1802867, and the AFOSR DDIP program under grant FA9550-17-1-0068. The simulations were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

\printbibliography

Author biography:

Mohamed Naveed Gul Mohamed is pursuing his Ph.D. in Aerospace Engineering at Texas A&M University, College Station. He holds a bachelor’s degree in Instrumentation and Control Engineering from NIT Trichy, India. His research interests are on optimal control of nonlinear dynamical systems, focusing on overcoming challenges such as stochasticity, partial observation, unknown models, and stability concerns.

Suman Chakravorty is a Professor of Aerospace Engineering at Texas A&M University. He holds a Ph.D. in Aerospace Engineering from the University of Michigan, Ann Arbor. His research interests broadly lie in Estimation and Stochastic Optimal Control Theory with application to Robotic Control and Situational Awareness Problems.

Raman Goyal earned his Ph.D. in Aerospace Engineering from Texas A&M University, College Station and his B.Tech. degree in Mechanical Engineering from IIT Roorkee, India. Raman is interested in intelligent learning approaches for optimal control of stochastic nonlinear systems. He has also worked on modeling, design, control, and security of various cyber-physical systems.

Ran Wang received his Ph.D. in Aerospace Engineering from Texas A&M University, College Station and his bachelor’s degree in Mechanical Engineering from Huazhong University of Science and Technology. Ran’s research interests include optimal control and reinforcement learning of soft-body robots.