Learning to Boost the Performance
of Stable Nonlinear Systems

Luca Furieri, Clara Lucía Galimberti, and Giancarlo Ferrari-Trecate L. Furieri, C. L. Galimberti, and G. Ferrari-Trecate are with the Institute of Mechanical Engineering, EPFL, Switzerland. E-mail addresses: {luca.furieri, clara.galimberti, giancarlo.ferraritrecate}@epfl.ch.Research supported by the Swiss National Science Foundation (SNSF) under the NCCR Automation (grant agreement 51NF40_80545). Luca Furieri is also grateful to the SNSF for the Ambizione grant PZ00P2_208951.

Abstract

The growing scale and complexity of safety-critical control systems underscore the need to evolve current control architectures aiming for the unparalleled performances achievable through state-of-the-art optimization and machine learning algorithms. However, maintaining closed-loop stability while boosting the performance of nonlinear control systems using data-driven and deep-learning approaches stands as an important unsolved challenge. In this paper, we tackle the performance-boosting problem with closed-loop stability guarantees. Specifically, we establish a synergy between the Internal Model Control (IMC) principle for nonlinear systems and state-of-the-art unconstrained optimization approaches for learning stable dynamics. Our methods enable learning over arbitrarily deep neural network classes of performance-boosting controllers for stable nonlinear systems; crucially, we guarantee $\mathcal{L}_{p}$ closed-loop stability even if optimization is halted prematurely, and even when the ground-truth dynamics are unknown, with vanishing conservatism in the class of stabilizing policies as the model uncertainty is reduced to zero. We discuss the implementation details of the proposed control schemes, including distributed ones, along with the corresponding optimization procedures, demonstrating the potential of freely sha** the cost functions through several numerical experiments.

Index Terms:

Optimal control, Closed-loop stability, Learning for control, Internal model control, Uncertain systems, Distributed control

I Introduction

The success of control systems across a broad spectrum of applications — from manufacturing to water, power, and transportation networks [1] — is rooted not only in advancements in sensing, computation, and communication but also in the growing availability of methods for designing model-based controllers capable of stabilizing nonlinear systems at nominal operating conditions.

However, in many applications, merely stabilizing the closed-loop system is not sufficient; achieving satisfactory performance is also crucial, often necessitating the integration of additional control loops. In Nonlinear Optimal Control (NOC), performance requirements are typically encoded in the shape of the cost function that the control policy strives to minimize. Consequently, it is beneficial to develop NOC algorithms that accommodate general nonlinear costs to enable sophisticated closed-loop behaviors, such as collision avoidance or waypoint tracking in swarms of robots.

In this paper, we tackle the following performance-boosting problem: given a discrete-time nonlinear system that is stable or has been pre-stabilized using a base controller, how can we enhance its performance during the transient — that is, before the system settles into a steady state — by employing general cost functions without compromising stability?

A first approach to designing performance-boosting regulators involves resorting to NOC methods with stability guarantees. Despite extensive research in this area [2], the NOC problem is fully understood only when the system dynamics are linear and the cost admits a convex reformulation. For nonlinear systems, traditional methods for addressing NOC include dynamic programming and the maximum principle [3, 4]. However, the computation of NOC policies through these methods often faces significant computational challenges [4]. Furthermore, to ensure stability, stringent limitations must be imposed on the class of costs that can be utilized. An alternative approach to tackling performance-boosting is offered by receding-horizon control schemes, such as Nonlinear Model Predictive Control (NMPC) [5]. These controllers are based on real-time optimization; a finite-horizon NOC problem is solved at each time instant to determine the control input. However, a significant limitation of NMPC is that the control policy can seldom be precomputed and stored in an explicit form, which makes NMPC inapplicable when the control platform lacks the computational resources necessary to solve mathematical programs in real-time. Moreover, similar to NOC, ensuring stability requires imposing strong limitations on the class of admissible cost functions [5].

More recently, Reinforcement Learning (RL) and Deep Neural Networks (DNNs) have emerged as powerful tools that enable agents to understand and optimally interact with complex environments and dynamical systems, e.g., [6, 7]. Many RL approaches are based on minimizing arbitrary cost functions, calling for the use of broad sets of candidate nonlinear control policies. To this end, RL methods often employ families of policies that incorporate deep Neural Networks (NNs), due to their ability to model rich classes of nonlinear functions. These capabilities have led to remarkable applications, such as four-legged robots navigating challenging terrains [8] and drones that can outperform humans in races [9, 10]. On the other hand, general methodologies for designing RL policies for nonlinear dynamical systems, while ensuring closed-loop stability, are currently scarce and may be limited by strong assumptions [11, 12, 13]. As a result, so far the applicability of RL approaches has been mainly limited to systems that are not safety-critical.

Independent of their application in RL, NNs have been employed in model-based control since the 1990s for approximating nonlinear receding horizon policies [14, 15] or synthesizing nonlinear regulators from scratch [16]. Recent results on the design of provably stabilizing DNN control policies fall into two categories. The first one comprises constrained optimization approaches [11, 17, 18] that ensure global or local stability by enforcing Lyapunov-like inequalities during optimization. However, conservative stability constraints can severely restrict the range of admissible policies or fail to produce a viable controller even when it exists. Additionally, enforcing constraints such as linear matrix inequalities becomes a computational bottleneck in large-scale applications.

The second category embraces unconstrained optimization approaches, aiming to define classes of control policies with built-in stability guarantees [19, 20, 21]. These methods, which are similar to those developed in this paper, allow unconstrained optimization over finitely many parameters — using, for instance, standard gradient descent techniques — without sacrificing stability, regardless of the chosen parameter values. Optimizing over sets of stabilizing policies has two main benefits. First, it completely decouples the stabilization problem from the choice of the cost being optimized. Second, it enables fail-safe design, that is, the ability to guarantee closed-loop stability even if the policy optimization ends at a local minimum or is prematurely halted. However, these approaches are limited to discrete-time linear systems [19, 20] or to continuous-time systems in the port-Hamiltonian form [21]. While recent work surpasses the limitations above [22, 23], in real-world applications, the knowledge about the system model is not perfect. The impact of modeling errors on the parametrizations of stable closed-loop maps for nonlinear systems has remained largely unexplored.

I-A Contributions

This paper explores approaches to solve performance-boosting problems in general discrete-time, time-varying systems. Specifically, we develop unconstrained optimization approaches based on classes of state-feedback policies that induce closed-loop dynamics described by stable and arbitrarily deep NNs.

After formally stating the performance-boosting problem in Section II, we present our first contribution, which provides a complete characterization of the class of stability-preserving controllers for stable systems. This result is presented in Section III and reveals that an Internal Model Control (IMC) structure [24, 25, 26] allows characterizing, without conservatism, the class of all stability-preserving controllers, where the only free parameter is an $\mathcal{L}_{p}$ operator. Our results hinge on adapting nonlinear variants of the Youla parametrization [27, 28] to discrete-time systems. Further, we examine the relationship with the recently proposed nonlinear System Level Synthesis (SLS) framework developed in [29]. In Section IV, our main contribution is that the proposed approach is compatible with scenarios where only an approximate system description is available, such as models identified from data or derived from simplified physical principles. Specifically, under a finite gain assumption on the model mismatch, stability can always be preserved by embedding a nominal system model and optimizing over nonlinear controllers with a sufficiently reduced gain on the free $\mathcal{L}_{p}$ parameter. Importantly, the method ensures vanishing conservatism as the model uncertainty approaches zero. Additionally, by considering networks of interconnected subsystems, we demonstrate how the IMC structure of our controllers naturally lends itself to the development of distributed policies where the communication topology mirrors the subsystem couplings.

Finally, Section V bridges the gap between theoretical developments and computations, showing how to use Recurrent Equilibrium Networks (RENs) [30, 31] to obtain a finite-dimensional parametrization of performance-boosting controllers that can include DNNs. The final part of the paper in Section VI presents several simulations by considering coordination problems for mobile robots. Specifically, we show how, similarly to RL, the freedom in specifying the optimization cost allows designing NN controllers that can boost various forms of performance and safety, reaching beyond classical optimal control objectives consisting of the sum of stage-costs over time [3].

I-B Notation

Signals and operators: The set of all sequences $\mathbf{x}=(x_{0},x_{1},x_{2},\ldots)$ , where $x_{t}\in\mathbb{R}^{n}$ , $t\in\mathbb{N}$ , is denoted as $\ell^{n}$ . Moreover, $\mathbf{x}$ belongs to $\ell_{p}^{n}\subset\ell^{n}$ with $p\in\mathbb{N}\cup\infty$ if $\left\lVert\mathbf{x}\right\rVert_{p}=\left(\sum_{t=0}^{\infty}|x_{t}|^{p}% \right)^{\frac{1}{p}}<\infty$ , where $|\cdot|$ denotes any vector norm. We say that $\mathbf{x}\in\ell^{n}_{\infty}$ if $\operatorname{sup}_{t}|x_{t}|<\infty$ . When clear from the context, we omit the superscript $n$ from $\ell^{n}$ and $\ell^{n}_{p}$ . An operator $\mathbf{A}$ is said to be $\ell_{p}$ -stable¹¹1We also say that the operator is stable, for short, when the value of $p$ is clear from the context. if it is causal and $\mathbf{A}(\mathbf{w})\in\ell_{p}^{m}$ for all $\mathbf{w}\in\ell_{p}^{n}$ . Equivalently, we write $\mathbf{A}\in\mathcal{L}_{p}$ . We say that an $\mathcal{L}_{p}$ operator $\mathbf{A}:\mathbf{w}\mapsto\mathbf{u}$ has finite $\mathcal{L}_{p}$ -gain $\gamma(\mathbf{A})>0$ if $\|\mathbf{u}\|\leq\gamma(\mathbf{A})\|\mathbf{w}\|$ , for all $\mathbf{w}\in\ell_{p}^{n}$ .

Time-series: We use the notation $x_{j:i}$ to refer to the truncation of $\mathbf{x}$ to the finite-dimensional vector $(x_{i},x_{i+1},\ldots,x_{j})$ . An operator $\mathbf{A}:\ell^{n}\rightarrow\ell^{m}$ is said to be causal if $\mathbf{A}(\mathbf{x})=(A_{0}(x_{0}),A_{1}(x_{1:0}),\ldots,A_{t}(x_{t:0}),\ldots)$ . If in addition $A_{t}(x_{t:0})=A_{t}(x_{t-1:0},0)$ , then $\mathbf{A}$ is said to be strictly causal. Similarly, we define $A_{j:i}(x_{j:0})=(A_{i}(x_{i:0}),A_{i+1}(x_{i+1:0}),\ldots,A_{j}(x_{j:0}))$ . For a matrix $M\in\mathbb{R}^{m\times n}$ , $M\mathbf{x}=(Mx_{0},Mx_{1},\ldots)\in\ell^{m}$ .

Graph theory: Given an undirected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ described by the set of nodes $\mathcal{V}=\{1,\ldots,N\}$ and the set of edges $\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ , we denote set of neighbors of node $i$ , including $i$ itself by $\mathcal{N}_{i}=\{i\}\cup\{j\ |\ \{i,j\}\in\mathcal{E}\}\subseteq\mathcal{V}$ . We denote with col ${}_{j\in\mathcal{V}}(v^{[j]})$ a vector which consists of the stacked subvectors $v^{[j]}$ from $j=1$ to $j=N$ and with $v^{[\mathcal{N}_{i}]}$ a vector composed by the stacked subvectors $v^{[j]}$ of all neighbors of node $i$ , i.e., $v^{[\mathcal{N}_{i}]}=col_{j\in\mathcal{N}_{i}}(v^{[j]})$ . For a signal $\mathbf{x}\in\ell^{n}$ , where $x_{t}=col_{i\in\mathcal{V}}(x^{[i]}_{t})$ , $x_{t}^{[i]}\in\mathbb{R}^{n_{i}}$ , and $n=\sum_{i=1}^{N}n_{i}$ , we denote with $\mathbf{x}^{[i]}\in\ell^{n_{i}}$ the sequence $\mathbf{x}^{[i]}=(x_{0}^{[i]},x_{1}^{[i]},\ldots)$ . Similarly, we define sequence $\mathbf{x}^{[\mathcal{N}_{i}]}=(x_{0}^{[\mathcal{N}_{i}]},x_{1}^{[\mathcal{N}_% {i}]},\ldots)$ .

II The Performance-boosting Problem

We consider nonlinear discrete-time time-varying systems

x_{t}=f_{t}(x_{t-1:0},u_{t-1:0})+w_{t}\,,~{}~{}~{}t=1,2,\ldots\,,

(1)

where $x_{t}\in\mathbb{R}^{n}$ is the state vector, $u_{t}\in\mathbb{R}^{m}$ is the control input, $w_{t}\in\mathbb{R}^{n}$ stands for unknown process noise with $w_{0}=x_{0}$ , and $f_{0}=0$ . The system model (1) is very general. For instance, it can describe the dynamics of the error between the state of a nonlinear system and a reference trajectory in $\ell_{p}$ . In operator form, system (1) is equivalent to

\mathbf{x}=\mathbf{F}(\mathbf{x},\mathbf{u})+\mathbf{w}\,,

(2)

where $\mathbf{F}:\ell^{n}\times\ell^{m}\rightarrow\ell^{n}$ is the strictly causal operator such that $\mathbf{F}(\mathbf{x},\mathbf{u})=(0,f_{1}(x_{0},u_{0}),\ldots,f_{t}(x_{t-1:0}% ,u_{t-1:0}),\ldots)$ . Note that $\mathbf{w}=(x_{0},w_{1},\ldots)$ and $\mathbf{u}$ collects all data needed for defining the system evolution over an infinite horizon. As an example, when the system (1) takes the Linear Time Invariant (LTI) form

x_{t}=Ax_{t-1}+Bu_{t-1}+w_{t}\,,

(3)

the model (2) becomes

\begin{bmatrix}x_{0}\\ x_{1}\\ x_{2}\\ \vdots\end{bmatrix}=\begin{bmatrix}0&0&0&\cdots\\ A&0&0&\cdots\\ 0&A&0&\cdots\\ \vdots&\vdots&\vdots&\ddots\end{bmatrix}\begin{bmatrix}x_{0}\\ x_{1}\\ x_{2}\\ \vdots\end{bmatrix}+\begin{bmatrix}0&0&0&\cdots\\ B&0&0&\cdots\\ 0&B&0&\cdots\\ \vdots&\vdots&\vdots&\ddots\end{bmatrix}\begin{bmatrix}u_{0}\\ u_{1}\\ u_{2}\\ \vdots\end{bmatrix}+\begin{bmatrix}x_{0}\\ w_{1}\\ w_{2}\\ \vdots\end{bmatrix}\,.

We consider disturbances with support $\mathcal{W}_{t}\subseteq\mathbb{R}^{n}$ following a random vector distribution $\mathcal{D}_{t}$ , that is, $w_{t}\in\mathcal{W}_{t}$ and $w_{t}\sim\mathcal{D}_{t}$ for every $t=0,1,\ldots$ . In order to control the behavior of system (1), we consider nonlinear, state-feedback, time-varying control policies

\mathbf{u}=\mathbf{K}(\mathbf{x})=(K_{0}(x_{0}),K_{1}(x_{1:0}),\ldots,K_{t}(x_% {t:0}),\ldots)\,,

(4)

where $\mathbf{K}:\ell^{n}\ \rightarrow\ell^{m}$ is a causal operator to be designed. Note that the controller $\mathbf{K}$ can be dynamic, as $K_{t}$ can depend on the whole past history of the system state. Since for each $\mathbf{w}\in\ell^{n}$ and $\mathbf{u}\in\ell^{m}$ the system (1) produces a unique state sequence $\mathbf{x}\in\ell^{n}$ , equation (2) defines a unique transition operator

\mathbfcal{F}:(\mathbf{u},\mathbf{w})\mapsto\mathbf{x}\,,

which provides an input-to-state model of system (1). Similarly, for each $\mathbf{w}\in\ell^{n}$ the closed-loop system (1)-(4) produces unique trajectories. Hence, the closed-loop map** $\mathbf{w}\mapsto(\mathbf{x},\mathbf{u})$ is well-defined. Specifically, for a system $\mathbf{F}$ and a controller $\mathbf{K}$ , we denote the corresponding induced closed-loop operators $\mathbf{w}\mapsto\mathbf{x}$ and $\mathbf{w}\mapsto\mathbf{u}$ as $\bm{\Phi}^{\mathbf{x}}[\mathbf{F},\mathbf{K}]$ and $\bm{\Phi}^{\mathbf{u}}[\mathbf{F},\mathbf{K}]$ , respectively. Therefore, we have $\mathbf{x}=\bm{\Phi}^{\mathbf{x}}[\mathbf{F},\mathbf{K}](\mathbf{w})$ and $\mathbf{u}=\bm{\Phi}^{\mathbf{u}}[\mathbf{F},\mathbf{K}](\mathbf{w})$ for all $\mathbf{w}\in\ell^{n}$ .

Definition 1.

The closed-loop system (1)-(4) is $\ell_{p}$ -stable if $\bm{\Phi}^{\mathbf{u}}[\mathbf{F},\mathbf{K}]$ and $\bm{\Phi}^{\mathbf{u}}[\mathbf{F},\mathbf{K}]$ are in $\mathcal{L}_{p}$ .

Our goal is to synthesize a control policy $\mathbf{K}$ solving the following problem.

Problem 1 (Performance boosting).

Assume that $\mathbfcal{F}$ lies in $\mathcal{L}_{p}$ . Find $\mathbf{K}$ solving the finite-horizon Nonlinear Optimal Control (NOC) problem


$\displaystyle\min_{\mathbf{K}(\cdot)}$	$\displaystyle\qquad\mathbb{E}_{w_{T:0}}\left[L(x_{T:0},u_{T:0})\right]$	(5a)
$\displaystyle\operatorname{s.t.}~{}~{}$	$\displaystyle x_{t}=f_{t}(x_{t-1:0},u_{t-1:0})+w_{t}\,,~{}~{}w_{0}=x_{0}\,,$
	$\displaystyle u_{t}=K_{t}(x_{t:0})\,,~{}~{}\forall t=0,1,\ldots\,,$
	$\displaystyle(\bm{\Phi}^{\mathbf{x}}[\mathbf{F},\mathbf{K}],\bm{\Phi}^{\mathbf% {u}}[\mathbf{F},\mathbf{K}])\in\mathcal{L}_{p}\,\,,$	(5b)

where $L(\cdot)$ defines a loss over realized trajectories $x_{T:0}$ and $u_{T:0}$ , and the expectation $\mathbb{E}_{w_{T:0}}[\cdot]$ removes the effect of disturbances $w_{T:0}$ on the realized values of the loss.²²2Another common choice is to use $\max_{w_{T:0}\in\mathcal{W}_{T:0}}[\cdot]$ instead of the expectation. Other useful choices include $\operatorname{Var}_{w_{T:0}}[\cdot]$ , $\operatorname{CVAR}_{w_{T:0}}[\cdot]$ , and weighted combinations of all the above. In practice, one can approximate the chosen operator that removes the effect of disturbances from the cost by performing multiple experiments.

The main feature of (5) is that the cost is optimized over the finite horizon $0,\ldots,T$ , but under the strict requirement that the closed-loop system is stable when it evolves over $0,\ldots,+\infty$ . In other words, the feedback controller must preserve stability of $\mathbfcal{F}$ , and its role is to boost the performance of the system in the transient $0,\ldots,T$ . As it will be clear in the sequel, we consider iterative control design algorithms based on gradient descent that are fail-safe, in the sense that they search in sets of controllers that are stability-preserving by design. This guarantees closed-loop stability during the optimization of the policy parameters. Note also that, as it is standard in NOC, we do not expect gradient descent to find the globally optimal solution for any initialization — this is generally impossible for problems beyond Linear Quadratic Gaussian (LQG) control, which enjoy convexity of the cost and linearity of the optimal policies [32, 33]. Furthermore, the expected value in (5a) can seldom be computed³³3For instance because it is too costly or the distribution $\mathcal{D}$ is unknown. and is approximated by using samples of $w_{T:0}$ . Fail-safe design guarantees that, in spite of all these limitations, closed-loop stability is never lost.

III Unconstrained Parametrization of all Stability-preserving Controllers

As a preliminary step towards fail-safe design for stable systems, we show how to parametrize all stability-preserving policies by using an IMC control architecture [24, 25], depending on an operator $\mathbfcal{M}$ that can be freely chosen in $\mathcal{L}_{p}$ . Specifically, the block diagram of the proposed control architecture is represented in Figure 1 and it includes a copy of the system dynamics, which is used for computing the estimate $\hat{\mathbb{w}}$ of the disturbance $\mathbb{w}$ .

Refer to caption — Figure 1: IMC architecture parametrizing of all stabilizing controllers in terms of one freely chosen operator $\mathbfcal{M}\in\mathcal{L}_{p}$ .

We are now in a position to introduce the main result.

Theorem 1.

Assume that the operator $\mathbfcal{F}$ is $\ell_{p}$ -stable, i.e. $\mathbb{x}\in\ell_{p}$ if $(\mathbb{w},\mathbb{u})\in\ell_{p}$ , and consider the evolution of (2) where $\mathbf{u}$ is chosen as

\mathbf{u}=\mathbfcal{M}(\mathbf{x}-\mathbf{F}(\mathbf{x},\mathbf{u}))\,,

(6)

for a causal operator $\mathbfcal{M}:\ell^{n}\rightarrow\ell^{m}$ . Let $\mathbf{K}$ be the operator such that $\mathbf{u}=\mathbf{K}(\mathbf{x})$ is equivalent to (6).⁴⁴4This operator always exists because $\mathbf{F}(\mathbf{x},\mathbf{u})$ is strictly causal. Hence $u_{t}$ depends on the inputs $u_{t-1:0}$ and can be computed recursively from past inputs and $x_{t:0}$ — see formula (11). The following two statements hold true.

1.

If $\mathbfcal{M}\in\mathcal{L}_{p}$ , then the closed-loop system is $\ell_{p}$ -stable.

If there is a causal policy $\mathbb{C}$ such that $\bm{\Phi}^{\mathbf{x}}[\mathbf{\mathbf{F}},\mathbf{\mathbb{C}}],~{}\bm{\Phi}^{% \mathbf{u}}[\mathbf{\mathbf{F}},\mathbf{\mathbb{C}}]\in\mathcal{L}_{p}$ , then

\mathbfcal{M}=\bm{\Phi}^{\mathbf{u}}[\mathbf{\mathbf{F}},\mathbf{\mathbb{C}}]\,,

(7)

gives $\mathbb{K}=\mathbb{C}$ .

Proof.

We prove $1)$ . For compactness, define $\widehat{\mathbf{w}}=\mathbf{x}-\mathbf{F}(\mathbf{x},\mathbf{u})$ . As highlighted in [25], since there is no model mismatch between the plant $\mathbfcal{F}$ and the model $\mathbf{F}$ used to define $\widehat{\mathbf{w}}$ , one has $\widehat{\mathbf{\mathbb{w}}}=\mathbb{w}$ , hence opening the loop. More specifically, from Figure 1 and Equation (2) one has

\widehat{\mathbf{\mathbb{w}}}=-\mathbb{F}(\mathbb{x},\mathbb{u})+\mathbb{F}(% \mathbb{x},\mathbb{u})+\mathbb{w}=\mathbb{w}\,.

(8)

Therefore, by definition of the closed-loop maps, one has $\bm{\Phi}^{\mathbf{u}}[\mathbf{\mathbf{F}},\mathbf{\mathbb{K}}]=\mathbfcal{M}$ and $\bm{\Phi}^{\mathbf{x}}[\mathbf{\mathbf{F}},\mathbf{\mathbb{K}}](\mathbb{w})=% \mathbb{F}(\mathbb{x},\mathbfcal{M}(\mathbb{w}))+\mathbb{w}$ , $\forall\mathbb{w}\in\ell_{p}$ . When $\mathbb{w}\in\ell_{p}$ , one has $\bm{\Phi}^{\mathbf{u}}[\mathbf{\mathbf{F}},\mathbf{\mathbb{K}}](\mathbb{w})\in% \ell_{p}$ because $\mathbfcal{M}\in\mathcal{L}_{p}$ . Moreover $\mathbfcal{M}\in\mathcal{L}_{p}$ and $\mathbfcal{F}\in\mathcal{L}_{p}$ imply that the operator $\mathbb{w}\mapsto\mathbb{x}$ defined by the composition of the operators $\mathbb{w}\mapsto(\mathbfcal{M}(\mathbb{w}),\mathbb{w})$ and $\mathbfcal{F}$ is in $\mathcal{L}_{p}$ as well. This is due to the property that the composition of operators in $\mathcal{L}_{p}$ is in $\mathcal{L}_{p}$ .

We prove $2)$ . Set, for short, $\bm{\Psi}^{\mathbf{x}}=\bm{\Phi}^{\mathbf{x}}[\mathbf{\mathbf{F}},\mathbf{% \mathbb{C}}]$ , $\bm{\Psi}^{\mathbf{u}}=\bm{\Phi}^{\mathbf{u}}[\mathbf{\mathbf{F}},\mathbf{% \mathbb{C}}]$ , $\bm{\Upsilon}^{\mathbf{x}}=\bm{\Phi}^{\mathbf{x}}[\mathbf{\mathbf{F}},\mathbf{% \mathbb{K}}]$ , and $\bm{\Upsilon}^{\mathbf{u}}=\bm{\Phi}^{\mathbf{u}}[\mathbf{\mathbf{F}},\mathbf{% \mathbb{K}}]$ . By assumption, one has $\mathbfcal{M}=\bm{\Psi}^{\mathbf{u}}$ and since $\bm{\Psi}^{\mathbf{u}}\in{\mathcal{L}}_{p}$ also $\mathbfcal{M}\in{\mathcal{L}}_{p}$ . By definition, $\bm{\Upsilon}^{\mathbf{u}}$ is the operator $\mathbb{w}\mapsto\mathbb{u}$ and, from (8) and Figure 1, it coincides with $\mathbfcal{M}$ . Hence

\bm{\Psi}^{\mathbf{u}}=\bm{\Upsilon}^{\mathbf{u}}\,.

(9)

It remains to prove that $\bm{\Upsilon}^{\mathbf{x}}=\bm{\Psi}^{\mathbf{x}}$ . Similar to [22], we proceed by induction. First, we show that $\Psi^{x}_{0}=\Upsilon^{x}_{0}$ . Since $f_{0}=0$ and $w_{0}=x_{0}$ , one has from (1) that the closed-loop map $w_{0}\mapsto x_{0}$ is the identity, irrespectively of the controller. Therefore $\Upsilon_{0}^{x}=\Psi_{0}^{x}=I$ . Assume now that, for a positive $j\in\mathbb{N}$ we have $\Upsilon^{x}_{i}=\Psi^{x}_{i}$ for all $0\leq i\leq j$ . Since $(\bm{\Upsilon}^{\mathbf{x}},\bm{\Upsilon}^{\mathbf{u}})$ and $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})$ are closed-loop maps, from (2) they verify

\Upsilon^{x}_{j+1}=F_{j+1}(\Upsilon^{x}_{j:0},\Upsilon^{u}_{j:0})+I,\Psi^{x}_{% j+1}=F_{j+1}(\Psi^{x}_{j:0},\Psi^{u}_{j:0})+I.

(10)

But, from (9), one has $\Psi^{u}_{j:0}=\Upsilon^{u}_{j:0}$ and, by using the inductive assumption, one obtains $\Upsilon^{x}_{j+1}=\Psi^{x}_{j+1}$ . This implies $\mathbf{K=C}$ .

∎

Several comments are in order. First, Theorem 1 is about nominal stability only as there is no model mismatch between the plant model and the one used in the controller. We analyze robust stability in Section IV. Second, it is well known that many IMC architectures are sufficient for preserving stability, both in the linear [24] and the nonlinear [25] case.⁵⁵5Note, however, that IMC in [25] is developed in terms of continuous-time nonlinear input-output models, for which the effect of process noise is difficult to analyze. Moreover, the control objective is to track a reference signal to the plant output, which raises the problem of approximating inverses of nonlinear operators. In our work, we use instead discrete-time input-to-state models and analyze the closed-loop maps from process noise to control inputs and system states. Moreover, our goal is to solve optimal control rather than tracking problems. It is also known that in the LTI setting, IMC is also necessary for preserving stability [34] and provides an alternative to the Youla-Koucera parametrization [35]. In this respect, Theorem 1 provides a necessary condition for preserving stability also for nonlinear systems. This result is perhaps not surprising given that necessary and sufficient conditions for stabilizing wide classes of input-output nonlinear models, in the spirit of the Youla- Koucera parametrization, have been derived since the 80’s [27]. However, these controllers are not conceived in the IMC form.

Following [24, 25], we argue that the IMC structure facilitates the design of performance-boosting policies. Indeed, it is straightforward to deploy controllers using the block-diagram structure shown in Figure 1. In equation form, for a chosen operator $\mathbfcal{M}$ , one simply computes the control input as follows:


	$\displaystyle\widehat{w}_{t}=x_{t}-f_{t}(x_{t-1},u_{t-1})\,,$		(11a)
	$\displaystyle u_{t}=\mathcal{M}_{t}(\widehat{w}_{t:0})\,.$		(11b)

Moreover, Theorem 1 highlights that it is sufficient to search in the space of operators $\mathbfcal{M}\in{\mathcal{L}}_{p}$ for describing all and only performance-boosting policies. While finding a parametrization of all operators $\mathbfcal{M}\in{\mathcal{L}}_{p}$ might be prohibitive, we will show in Section V that one can use NNs for describing broad subsets of these operators. Moreover, the IMC structure lends itself to the development of policies that enjoy a distributed structure (see Section IV).

III-1 The case of LTI systems with nonlinear costs

Consider the linear system (3) and let $z$ denote the time-shift operator. When the system is asymptotically stable, the classical Youla parametrization [35] states that all linear state-feedback stabilizing control policies $\mathbf{u}=\mathbf{K}\mathbf{x}$ can be written as

\mathbf{u}=\mathbf{Q}(z)\mathbf{x}-\frac{\mathbf{Q}(z)}{z}\left(A\mathbf{x}+B% \mathbf{u}\right)\quad\mathbf{Q}(z)\in\mathcal{TF}_{s}\,,

(12)

where $\mathbf{Q}(z)$ is the so-called Youla parameter. Here, $\mathcal{TF}_{s}$ denotes the set of stable transfer matrices — that is, the set of matrices whose scalar entries are stable transfer functions. The class of linear control policies is globally optimal for standard LQG problems, and it allows optimizing over $\mathbf{Q}\in\mathcal{TF}_{s}$ using simple pole approximations and convex programming — we refer to [36, 37] for state-of-the-art results. However, nonlinear policies can be significantly more performing when the controller is distributed [38], or the cost function is nonlinear. As an immediate corollary of Theorem 1, and in accordance with the core contribution of [39], we have the following result for linear systems controlled by nonlinear policies.

Corollary 1.

Consider the linear system (3) and assume that it is asymptotically stable. Then, all and only control policies that make the closed-loop system $\ell_{p}$ -stable are expressed as

\mathbf{u}=\mathbfcal{M}\left(\mathbf{x}-\frac{\left(A\mathbf{x}+B\mathbf{u}% \right)}{z}\right)\,,

(13)

where $\mathbfcal{M}\in\mathcal{L}_{p}$ .

Proof.

The proof follows from Theorem 1 upon realizing that the asymptotic stability of system (3) implies that the corresponding operator $\mathbfcal{F}$ is in $\mathcal{L}_{p}$ , for any $p\geq 1$ . ∎

In conclusion, as expected, the linear Youla parametrization (12) is a special case of the proposed parametrization (13) with $\mathbfcal{M}=\mathbf{Q}$ and $\mathbf{Q}\in\mathcal{TF}_{s}$ .

III-2 Relationships with [22] and nonlinear SLS

In [22], we provided a slight generalization of Theorem 1 and the results in Section III-1 by also considering unstable systems $\mathbf{x}=\tilde{\mathbf{F}}(\mathbf{x},\mathbf{u})+\mathbf{w}$ for which a pre-stabilizing controller $\mathbf{K}^{\prime}$ exists, so that the overall policy is

\displaystyle\mathbf{u}=\mathbf{K}^{\prime}(\mathbf{x})+\mathbfcal{M}(\mathbf{% \widehat{\mathbf{w}}})\,.

(14)

By letting $\mathbf{F}(\mathbf{x},\mathbf{u})=\mathbf{F}(\mathbf{x},\mathbf{K}^{\prime}(% \mathbf{x})+\mathbf{u})$ , and assuming that both $\mathbfcal{F}$ and $\mathbf{K}^{\prime}$ lie in $\mathcal{L}_{p}$ , Theorem 1 coincides with Theorem 2 in [22]. However, when $\mathbf{K}^{\prime}\not\in\mathcal{L}_{p}$ , Theorem 2 in [22] highlights that $\mathbfcal{M}\in\mathcal{L}_{p}$ may no longer be a necessary condition for closed-loop $\ell_{p}$ -stability, while being still sufficient.

Moreover, as highlighted in [22], there is a deep link between Theorem 1 and the SLS parametrization of stabilizing controllers [40, 29]. The idea behind the SLS approach [40, 29] is to circumvent the difficulty of characterizing stabilizing controllers, by instead directly designing stable closed-loop maps. Let us define the set of all achievable closed-loop maps for system $\mathbf{F}$ as

\mathcal{CL}[\mathbf{F}]=\{(\bm{\Phi}^{\mathbf{x}}[\mathbf{F},\mathbf{K}],\bm{% \Phi}^{\mathbf{u}}[\mathbf{F},\mathbf{K}])~{}|~{}\mathbf{K}\text{ is causal}\}\,,

(15)

and the set of all achievable and stable closed-loop maps as

\mathcal{CL}_{p}[\mathbf{F}]=\{(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})% \in\mathcal{CL}[\mathbf{F}]~{}|~{}(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u% }})\in\mathcal{L}_{p}\}\,.

(16)

Note that, if $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{CL}_{p}[\mathbf{F}]$ , then $\mathbf{x}=\bm{\Psi}^{\mathbf{x}}(\mathbf{w})\in\ell_{p}^{n}$ and $\mathbf{u}=\bm{\Psi}^{\mathbf{u}}(\mathbf{w})\in\ell_{p}^{m}$ for all $\mathbf{w}\in\ell_{p}^{n}$ . Based on Theorem III.3 of [29], and adding the requirement that the closed-loop maps must belong to $\mathcal{L}_{p}$ , we summarize the main SLS result for nonlinear discrete-time systems.

Theorem 2 (Nonlinear SLS parametrization [29]).

The following two statements hold true.

The set $\mathcal{CL}_{p}[\mathbf{F}]$ of all achievable and stable closed-loop responses admits the following characterization:


$\displaystyle\mathcal{CL}_{p}[\mathbf{F}]=\{$	$\displaystyle(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\|~{}~{}(\bm{\Psi}^% {\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\text{ are causal}\,,$	(17a)
	$\displaystyle\bm{\Psi}^{\mathbf{x}}=\mathbf{F}(\bm{\Psi}^{\mathbf{x}},\bm{\Psi% }^{\mathbf{u}})+\mathbf{I}\,,$	(17b)
	$\displaystyle(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{L}_{p}% \}\,.$	(17c)

For any $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{CL}_{p}[\mathbf{F}]$ , the operator $\bm{\Psi}^{\mathbf{x}}$ is invertible and the causal controller

\mathbf{u}=\mathbf{K}(\mathbf{x})=\bm{\Psi}^{\mathbf{u}}\left((\bm{\Psi}^{% \mathbf{x}})^{-1}(\mathbf{x})\right)\,,

(18)

is the only one that achieves the stable closed-loop responses $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})$ .

Theorem 2 clarifies that any policy $\mathbf{K}(\mathbf{x})$ achieving $\ell_{p}$ -stable closed-loop maps can be described in terms of two causal operators $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{L}_{p}$ complying with the nonlinear functional equality (17b). Therefore, the NOC problem admits an equivalent Nonlinear SLS (N-SLS) formulation:

$\displaystyle\operatorname{N-SLS:}~{}~{}$	$\displaystyle\min_{(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})}$	$\displaystyle\quad\mathbb{E}_{w_{T:0}}\left[L(x_{T:0},u_{T:0})\right]$	( $\star$ )
	$\displaystyle~{}~{}~{}\operatorname{s.t.}~{}~{}$	$\displaystyle\quad x_{t}=\Psi^{x}_{t}(w_{t:0})\,,~{}~{}~{}u_{t}=\Psi^{u}_{t}(w% _{t:0})\,,$
$\displaystyle\quad(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{% CL}_{p}[\mathbf{F}]\,,t=0,1,\ldots$

According to Theorem 2, the constraint $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{CL}_{p}[\mathbf{F}]$ is equivalent to requiring that $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})$ are causal and verify (17b)-(17c). The constraint (17b) simply defines the operator $\bm{\Psi}^{\mathbf{x}}$ in terms of $\bm{\Psi}^{\mathbf{u}}$ and it can be computed explicitly because $\mathbf{F}$ is strictly causal. The main challenge is to comply with (17c). Indeed, it is hard to generate $\bm{\Psi}^{\mathbf{u}}\in\mathcal{L}_{p}$ such that the corresponding $\bm{\Psi}^{\mathbf{x}}$ satisfies $\bm{\Psi}^{\mathbf{x}}\in\mathcal{L}_{p}$ . The paper [29] suggests directly searching over $\ell_{p}$ -stable operators $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})$ and abandoning the goal of complying with (17b) exactly. One can then study robust stability when (17b) only holds approximately as per Theorem IV.2 in [29]. However, with the exception of polynomial systems [41], this way of proceeding may result in conservative control policies or fail to produce a stabilizing controller. Instead, for the case of stable or pre-stabilized systems, Theorem 1 can be seen as a way of parametrizing all stabilizing controllers that circumvents completely the problem of fulfilling (17b)-(17c).

IV Beyond Closed-loop Stability: Handling Model Uncertainty and Distributed Architectures

This section tackles the performance boosting problem (Problem 1) under more intricate real-world constraints beyond just closed-loop stability. Firstly, Theorem 1 suffers from requiring perfect plant knowledge for controller design. In reality, ensuring closed-loop stability despite an imperfect model is crucial. Secondly, control policies in large-scale applications like power grids and traffic systems are inherently distributed. This means they rely solely on local sensor data and communication, posing significant challenges to achieving network-level robustness and stability.

IV-A Robustness against model-mismatch

Let us denote the nominal model available for design as ${\widehat{\mathbf{F}}}(\mathbf{x},\mathbf{u})$ and the real unknown plant as

\mathbf{F}(\mathbf{x},\mathbf{u})=\widehat{\mathbf{F}}(\mathbb{x},\mathbb{u})+% \bm{\Delta}(\mathbb{x},\mathbb{u})\,,

(19)

where $\bm{\Delta}$ is a strictly causal operator representing the model mismatch. Let $\delta_{t}(x_{t-1:0},u_{t-1:0})$ be the time representation of the mismatch operator $\mathbf{\Delta}$ . Since for each sequence of disturbances $\mathbf{w}\in\ell^{n}$ and inputs $\mathbf{u}\in\ell^{m}$ the dynamics represented by (1) with $f_{t}(x_{t-1:0},u_{t-1:0})$ replaced by $\widehat{f}_{t}(x_{t-1:0},u_{t-1:0})+\delta_{t}(x_{t-1:0},u_{t-1:0})$ produces a unique state sequence $\mathbf{x}\in\ell^{n}$ , the equation

\mathbf{x}=\mathbf{F}(\mathbf{x},\mathbf{u})+\mathbf{w}\,,

(20)

defines again a unique transition operator $\mathbfcal{F}:(\mathbf{u},\mathbf{w})\mapsto\mathbf{x}$ , which provides an input-to-state model of the perturbed system.

Here, we show that when $\bm{\Delta}$ can be described by an $\mathcal{L}_{p}$ operator with finite gain, we can always design operators $\mathbfcal{M}$ with sufficiently small $\mathcal{L}_{p}$ -gain that stabilize the real closed-loop system. More specifically, letting $\gamma_{\bm{\Delta}}$ be the maximum $\mathcal{L}_{p}$ gain of the model mismatch $\bm{\Delta}$ , it is possible to design controllers $\mathbf{K}$ that comply with the following robust version of the stability constraint (5b):

(\bm{\Phi}^{*}[\widehat{\mathbb{F}}+\bm{\Delta},\mathbf{K}])\in\mathcal{L}_{p}% \,,~{}*\in\{\mathbb{x},\mathbb{u}\}\,,~{}\forall\bm{\Delta}|~{}\gamma(\bm{% \Delta})\leq\gamma_{\bm{\Delta}}\,.

(21)

This result, which is given in the next theorem, refers to the control scheme in Figure 2.

Theorem 3.

Assume that the mismatch operator $\bm{\Delta}$ in (19) has finite $\mathcal{L}_{p}$ -gain $\gamma(\mathbf{\Delta})$ . Furthermore, assume that the operator $\mathbfcal{F}$ has finite $\mathcal{L}_{p}$ -gain $\gamma(\mathbf{\mathbfcal{F}})$ . Then, for any $\mathbfcal{M}$ such that

\gamma(\bm{\mathcal{M}})<\gamma(\mathbf{\Delta})^{-1}(\gamma(\mathbfcal{F})+1)% ^{-1}\,,

(22)

the control policy given by


	$\displaystyle\widehat{w}_{t}=x_{t}-\widehat{f}_{t}(x_{t-1:0},u_{t-1:0})\,,$		(23a)
	$\displaystyle u_{t}=\mathcal{M}_{t}(\widehat{w}_{t:0})\,,$		(23b)

stabilizes the closed-loop system.

Proof.

We first show that operators $\mathbf{F}$ and $\mathbfcal{F}$ verify

\mathbf{F}(\mathbfcal{F}(\mathbf{u},\mathbf{w}),\mathbf{u})=\mathbfcal{F}(% \mathbf{u},\mathbf{w})-\mathbf{w}\,.

(24)

This follows by substituting $\mathbf{x}=\mathbfcal{F}(\mathbf{u},\mathbf{w})$ in (20). We now compute the $\mathcal{L}_{p}$ gain of the operator $\Sigma_{1}:(\mathbf{u},\mathbf{w})\mapsto\widehat{\mathbf{w}}$ in the right frame of Figure 2:

$\displaystyle\widehat{\mathbf{w}}$	$\displaystyle=\mathbfcal{F}(\mathbf{u},\mathbf{w})-\widehat{\mathbf{F}}(% \mathbfcal{F}(\mathbf{u},\mathbf{w}),\mathbf{u})=$
	$\displaystyle=\mathbf{F}(\mathbfcal{F}(\mathbf{u},\mathbf{w}),\mathbf{u})-% \widehat{\mathbf{F}}(\mathbfcal{F}(\mathbf{u},\mathbf{w}),\mathbf{u})+\mathbf{w}$
	$\displaystyle=\mathbf{\Delta}(\mathbfcal{F}(\mathbf{u},\mathbf{w}),\mathbf{u})% +\mathbf{w}\,,$	(25)

where the first equality follows from (24). Using the definition of $\mathcal{L}_{p}$ -gain for the operator $\mathbf{y}=\mathbf{\Delta}(\mathbf{x},\mathbf{u})$ one has $|\mathbf{y}|\leq\gamma(\mathbf{\Delta})(|\mathbf{x}|+|\mathbf{u}|)$ , and, by using (25) and $\mathbf{u}=\mathbfcal{M}(\widehat{\mathbf{w}})$ , one obtains

	$\displaystyle\|\widehat{\mathbf{w}}\|\leq\gamma(\mathbf{\Delta})(\|\mathbfcal{F}(% \mathbf{u},\mathbf{w})\|+\|\mathbf{u}\|))+\|\mathbf{w}\|$
	$\displaystyle\leq\gamma(\mathbf{\Delta})(\gamma(\mathbfcal{F})\|\mathbf{w}\|+% \gamma(\mathbfcal{F})\|\mathbf{u}\|+\|\mathbf{u}\|)+\|\mathbf{w}\|$
	$\displaystyle\leq(\gamma(\mathbf{\Delta})\gamma(\mathbfcal{F})+1)\|\mathbf{w}\|+% \gamma(\mathbf{\Delta})(\gamma(\mathbfcal{F})+1)\gamma(\mathbfcal{M})\|\widehat% {\mathbf{w}}\|\,.$

The relationship above implies that

|\widehat{\mathbf{w}}|\leq\left(\frac{\gamma(\bm{\Delta})\gamma(\mathbfcal{F})% +1}{1-\gamma(\bm{\Delta})\gamma(\mathbfcal{M})\left(\gamma(\mathbfcal{F})+1% \right)}\right)|\mathbf{w}|\,.

(26)

Next, we plug the upperbound (26) into the inequality $|\mathbf{u}|\leq\gamma(\mathbfcal{M})|\widehat{\mathbf{w}}|$ to obtain

|\mathbf{u}|\leq\left(\frac{\gamma(\mathbfcal{M})\left(\gamma(\bm{\Delta})% \gamma(\mathbfcal{F})+1\right)}{1-\gamma(\bm{\Delta})\gamma(\mathbfcal{M})(% \gamma(\mathbfcal{F})+1)}\right)|\mathbf{w}|\,,

(27)

and subsequently, we plug (27) into the inequality $|\mathbf{x}|\leq\gamma(\mathbfcal{F})(|\mathbf{u}|+|\mathbf{w}|)$ to obtain

|\mathbf{x}|\leq\left(\gamma(\mathbfcal{F})\frac{1+\gamma(\mathbfcal{M})\left(% 1-\gamma(\bm{\Delta})\right)}{1-\gamma(\bm{\Delta})\gamma(\mathbfcal{M})(% \gamma(\mathbfcal{F})+1)}\right)|\mathbf{w}|\,.

(28)

The last step is to verify that the maps $\mathbf{w}\rightarrow\mathbf{x}$ and $\mathbf{w}\rightarrow\mathbf{u}$ have a finite $\mathcal{L}_{p}$ -gain. This is done by checking that the gains in (27) and (28) are positive values when the gain of $\mathbfcal{M}$ is sufficiently small. If (22) holds, we have that $\gamma(\bm{\Delta})\gamma(\mathbfcal{M})(\gamma(\mathbfcal{F})+1)<1$ , and hence the denominator in (27) is positive. Since the numerator of (27) is always positive, we conclude that the map $\mathbf{w}\rightarrow\mathbf{u}$ has an $\mathcal{L}_{p}$ -gain. Similarly for (28), since (22) implies that $\gamma(\mathbfcal{M})\gamma(\bm{\Delta})<1$ , we have that both numerator and denominator are positive. This implies that the map $\mathbf{w}\rightarrow\mathbf{x}$ has an $\mathcal{L}_{p}$ -gain, as desired. ∎

The robustness condition (22) highlights a trade-off between ( $i$ ) the degree of tolerable uncertainty in the mismatch between nominal and real dynamics, and ( $ii$ ) the extent of the set of stabilizing control policies that we are permitted to optimize over. Specifically, (22) ensures that, for any model mismatch $\bm{\Delta}\in\mathcal{L}_{p}$ , there always exists a range of admissible gains for $\mathbfcal{M}$ such that the closed-loop is stable. This enables one to freely learn over all appropriately gain-bounded operators. Further note that Theorem 3 is not conservative when $\bm{\Delta}=0$ — this is unlike the classical application of the small-gain theorem [42] which would enforce that $\gamma(\mathbf{K})<(\gamma(\mathbfcal{F}))^{-1}$ even when $\bm{\Delta}=0$ . Indeed, when the model is fully known, the right-hand side of (22) diverges to infinity, allowing the gain of $\mathbfcal{M}$ to be any finite value, although without imposing an upper bound, and therefore recovering the completeness result of Theorem 1.

Remark 1 (Robust stability of nonlinear SLS).

The authors of [29] characterize robust stability of nonlinear SLS against mismatch in satisfying the achievability constraint (17b). Specifically, [29] focuses on the scenario where the control policy is the map** $\mathbf{x}\rightarrow\mathbf{u}$ in the form

	$\displaystyle\tilde{\mathbf{w}}$	$\displaystyle=\mathbf{x}-(\bm{\Psi}^{\mathbf{x}}-\mathbf{I})\tilde{\mathbf{w}}\,,$		(29)
	$\displaystyle\mathbf{u}$	$\displaystyle=\bm{\Psi}^{\mathbf{u}}(\tilde{\mathbf{w}})\,,$		(30)

for some $(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})\in\mathcal{L}_{p}$ which are not assumed to perfectly comply with (17b). Accordingly, the authors define a mismatch operator

\bm{\Xi}=\mathbf{F}(\bm{\Psi}^{\mathbf{x}},\bm{\Psi}^{\mathbf{u}})+\mathbf{I}-% \bm{\Psi}^{\mathbf{x}}\,.

(31)

Then, Theorem IV.2 of [29] proves closed-loop stability as long as $\gamma\left(\bm{\Xi}\right)<1$ . Since $\bm{\Xi}$ measures the degree of violation of the achievability constraint rather than the degree of model uncertainty, a robust stability analysis based on verifying $\gamma(\bm{\Xi})<1$ tailored to the case $\mathbf{F}=\widehat{\mathbf{F}}+\bm{\Delta}$ may not be straightforward, and it is not attempted in [29]. For this case, instead, Theorem 3 provides an upper bound on the admissible gains for $\mathbfcal{M}$ ; this is achieved by exploiting the IMC structure of the policy (23), and bounding the effect of model uncertainty on the closed-loop map for the ground-truth system.

IV-B Distributed controllers for large-scale plants

When dealing with large-scale cyber-physical systems, one may consider that the plant (1) is composed of a network of $N$ dynamically interconnected nonlinear subsystems. To model this scenario, we introduce an undirected coupling graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where the nodes $\mathcal{V}=\{1,\dots,N\}$ represent the subsystems in the network, and the set of edges $\mathcal{E}$ encode pairs of subsystems $\{i,j\}$ that are dynamically interconnected through state variables. Specifically, the dynamics of each subsystem $i\in\mathcal{V}$ is

x_{t}^{[i]}=f_{t}^{[i]}(x^{[\mathcal{N}_{i}]}_{t-1:0},u^{[i]}_{t-1:0})+w^{[i]}% _{t},\ \ \ t=1,2,\ldots

(32)

where state and input of each subsystem $i\in\mathcal{V}$ at time $t=1,2,\ldots$ are denoted by $x_{t}^{[i]}\in\mathbb{R}^{n_{i}}$ and $u_{t}^{[i]}\in\mathbb{R}^{m_{i}}$ respectively, and the initial state is $x^{[i]}_{0}\in\mathbb{R}^{n_{i}}$ . In operator form we have

\mathbf{x}^{[i]}=\mathbf{F}^{[i]}(\mathbf{x}^{[\mathcal{N}_{i}]},\mathbf{u}^{[% i]})+\mathbf{w}^{[i]},

(33)

where $\mathbf{F}^{[i]}:\ell^{n_{\mathcal{N}_{i}}}\times\ell^{m_{i}}\rightarrow\ell^{% n_{i}}$ . Note that, by stacking the subsystem dynamics in (32) together, we recover a system in the form (1), where $x_{t}=col_{i\in\mathcal{V}}(x_{t}^{[i]})\in\mathbb{R}^{n}$ , $u_{t}=col_{i\in\mathcal{V}}(u_{t}^{[i]})\in\mathbb{R}^{m}$ , and $w_{t}=col_{i\in\mathcal{V}}(w_{t}^{[i]})\in\mathbb{R}^{n}$ .

When controlling networked systems in the form (33), a common scenario is that the local feedback controller $u_{t}^{[i]}$ can only access information made available by its neighbors according to a communication network with the same topology of $\mathcal{G}$ . This requirement translates into imposing the following additional constraint to the performance-boosting problem (Problem 1):

\mathbf{u}^{[i]}=\mathbf{K}^{[i]}(\mathbf{x}^{[\mathcal{N}_{i}]}),\quad\forall i% \in\mathcal{V}\,.

(34)

The challenge becomes to parametrize only those stabilizing policies that are distributed according to (34). This can be achieved by exploiting the IMC controller architecture (11) in combination with the network sparsity of $\mathbf{F}$ highlighted in (33). Let us consider, for example, the networked plant of Figure 3, where $\mathbf{u}^{[i]}$ depends on the local disturbance reconstructions $\widehat{\mathbf{w}}^{[i]}$ only, that is, $\mathbf{u}^{[i]}=\mathbfcal{M}^{[i]}(\widehat{\mathbf{w}}^{[i]})$ . In order to reconstruct $\widehat{\mathbf{w}}^{[1]}$ , agent $i=1$ needs to evaluate the local dynamics $\mathbf{F}^{[1]}(\mathbf{x}^{[1]},\mathbf{x}^{[3]},\mathbf{u}^{[1]})$ ; this, in turns, requires a measurement of the state $\mathbf{x}^{[3]}$ over time. Repeating this reasoning for the agents $i=2$ and $i=3$ , one obtains an overall control policy $\mathbf{K}(\mathbf{x})$ whose agent-wise components are computed relying on measurements from neighboring subsystems only, thus complying with (34). We formalize this reasoning in the next proposition.

Proposition 1.

Let graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ describe the topology of a plant $\mathbf{F}$ as per (33). Consider an IMC control policy (11) where the operator $\mathbfcal{M}\in\mathcal{L}_{p}$ is decentralized, that is, $\mathbfcal{M}^{[i]}(\widehat{\mathbf{w}})=\mathbfcal{M}^{[i]}(\widehat{\mathbf% {w}}^{[i]})$ for every agent $i\in\mathcal{V}$ . Then, the closed-loop system is $\ell_{p}$ -stable and the corresponding control policy $\mathbf{u}=\mathbf{K}(\mathbf{x})$ is distributed according to (34).

Proof.

Since $\mathbfcal{M}\in\mathcal{L}_{p}$ , the closed-loop system is $\ell_{p}$ -stable by Theorem 1. By (33), we have $\widehat{\mathbf{w}}^{[i]}=\mathbf{x}^{[i]}-\mathbf{F}^{[i]}(\mathbf{x}^{[% \mathcal{N}_{i}]},\mathbf{u}^{[i]})$ . Hence, agent $i$ only needs measurements of the neighboring states according to $\mathcal{G}$ and local past inputs, thus complying with (34). ∎

The result of Proposition 1 can be extended to more complex cases. First, one can use local operators $\mathbfcal{M}^{[i]}\in\mathcal{L}_{p}$ that, besides $\widehat{\mathbf{w}}^{[i]}$ , have access to disturbance reconstructions $\widehat{\mathbf{w}}^{[j]}$ or control variables $\mathbf{u}^{[j]}$ computed at locations $j\neq i$ . While these architectures can be beneficial, e.g. for counteracting disturbances affecting other subsystems before they propagate to the subsystem $i$ through coupling, they require additional communication channels $\{i,j\}$ if $j\not\in\mathcal{N}_{i}$ . Moreover, one has to use local operators $\mathbfcal{M}^{[i]}$ guaranteeing that the whole operator $\mathbfcal{M}$ belongs to $\mathcal{L}_{p}$ . To this purpose, in general, it is not enough that $\mathbfcal{M}^{[i]}\in\mathcal{L}_{p}$ because the dependency on $\widehat{\mathbf{w}}^{[j]}$ and $\mathbf{u}^{[j]}$ for $j\neq i$ can induce loop interconnections that can destabilize the closed-loop system. Classes of local operators $\mathbfcal{M}^{[i]}$ yielding $\mathbfcal{M}\in\mathcal{L}_{p}$ have been proposed in [43, 44] by using dissipativity theory.

V Learning to Boost Performance using Unconstrained Optimization

Leveraging the theoretical results of previous sections, we reformulate the performance-boosting problem in a form that facilitates optimizing by automatic differentiation and unconstrained gradient descent. This enables the use of highly flexible cost functions for complex nonlinear optimal control tasks. By design, the proposed approach guarantees closed-loop stability throughout the optimization process. We assess the effectiveness of the proposed methodology in achieving optimal performance through numerical experiments, in Section VI.

V-A IMC-based reformulation of performance boosting

The main value of Theorem 1 is that it enables reformulating Problem 1 as follows.


$\displaystyle\min_{\mathbfcal{M}\in\mathcal{L}_{p}}$	$\displaystyle\qquad\mathbb{E}_{w_{T:0}}\left[L(x_{T:0},u_{T:0})\right]$	(35a)
$\displaystyle\operatorname{s.t.}~{}~{}$	$\displaystyle x_{t}=f_{t}(x_{t-1:0},u_{t-1:0})+w_{t},\quad x_{0}=w_{0},$	(35b)
	$\displaystyle u_{t}=\mathcal{M}_{t}({w}_{t:0})\,,\quad t=1,2,\ldots\,.$	(35c)

Indeed, (6) corresponds to (35b)-(35c). If the exact dynamics $f_{t}$ in (35b) is not known, it must be simply replaced by the nominal model $\widehat{f}_{t}$ .

The reformulation (35) offers significant computational advantages as compared to Problem 1. In the classical linear quadratic case,⁶⁶6That is, when $f_{t}$ and $\mathbfcal{M}$ are linear and $L$ is quadratic positive definite. (35) becomes strongly convex in $\mathbfcal{M}$ — enabling to use efficient convex optimization for finding a globally optimal solution [45, 40, 46, 47, 36]. In the general nonlinear case, searching over nonlinear operators $\mathbfcal{M}\in\mathcal{L}_{p}$ remains significantly easier than tackling Problem 1 directly. Indeed, the set $\mathcal{K}$ of controllers $\mathbf{K}(\cdot)$ complying with (5b) is, in general, difficult to parametrize. This is mainly because, given two stabilizing policies $\mathbf{K}_{1},\mathbf{K}_{2}$ , their convex combinations $\mathbf{K}_{3}=\gamma\mathbf{K}_{1}+(1-\gamma)\mathbf{K}_{2}$ with $\gamma\in[0,1]$ and their cascaded composition $\mathbf{K}_{4}=\mathbf{K}_{2}(\bm{\Phi}^{\mathbf{x}}[\bm{F},\mathbf{K}_{1}]))$ do not result in stabilizing policies, in general; these issues are very well-known for the special case of linear systems [48, 45]. Hence, it is difficult to parameterize stabilizing policies, for instance, by composing or summing together base stabilizing operators. Instead, thanks to $\mathcal{L}_{p}$ being convex and closed under composition, there exist methods for parametrizing rich subsets of $\mathcal{L}_{p}$ through free parameters $\theta\in\mathbb{R}^{d}$ , that is, to define operators $\mathbfcal{M}(\theta)$ such that

\mathbfcal{M}(\theta)\in\mathcal{L}_{p},\quad\forall\theta\in\mathbb{R}^{d}\,.

(36)

This allows turning (35) into an unconstrained optimization problem over $\theta\in\mathbb{R}^{d}$ .

The last issue to be addressed is the computation of the average in (35a) that, as noticed before, is generally intractable. This is usually circumvented by approximating the exact average with its empirical counterpart obtained using a set of samples $\{w_{T:0}^{s}\}_{s=1}^{S}$ drawn from the distribution $\mathcal{D}$ . One then obtains the finite-dimensional optimization problem:


	$\displaystyle\min_{\theta\in\mathbb{R}^{d}}$	$\displaystyle\frac{1}{S}\sum_{s=1}^{S}L(x_{T:0}^{s},u_{T:0}^{s})$	(37a)
	$\displaystyle\operatorname{s.t.}~{}~{}$	$\displaystyle x_{t}^{s}=f_{t}(x_{t-1}^{s},u_{t-1}^{s})+w_{t}^{s}\,,~{}~{}w_{0}% ^{s}=x_{0}^{s}\,,$	(37b)
$\displaystyle u_{t}=\mathcal{M}_{t}(\theta)(w_{t:0}^{s})\,,\quad t=0,1,2,% \ldots\,,$			(37c)

where $x_{T:0}^{s}$ and $u_{T:0}^{s}$ are the inputs and states obtained when the disturbance $w_{T:0}^{s}$ is applied. While in this work we only consider the empirical cost in the optimization problem (37a), the closed-loop performance when faced with out-of-sample noise sequences is further investigated in [49].

Finally, we highlight that (37b) and (37c) can be seen as the equations of the layer $t$ of a neural network with depth $T$ and parametrized by $\theta$ . When $\mathcal{M}_{t}$ , for $t=0,1,\ldots$ is sufficiently smooth, the absence of constraints on $\theta$ enables the use of powerful packages, such as TensorFlow [50] and PyTorch [51], leveraging automatic differentiation and backpropagation for optimizing the controller through gradient descent.

V-B Free parameterizations of $\mathcal{L}_{2}$ subsets

As highlighted in Section V.V-A, the possibility of obtaining effective controllers by solving (37) critically depends on our ability to parametrize $\mathcal{L}_{p}$ operators. The main obstacle is that the space $\mathcal{L}_{p}$ is infinite-dimensional. Hence, for implementation, one usually restrict the search in subsets of $\mathcal{L}_{p}$ described by finitely many parameters. When linear systems are considered, one can search over Finite Impulse Response (FIR) transfer matrices $\mathbf{M}=\sum_{i=0}^{N}M[i]z^{-i}\in\mathcal{TF}_{s}$ . and then optimize over the finitely many real matrices $M[i]$ . Less and less conservative solutions can be obtained by increasing the FIR order $N$ . However, the FIR approach limits the search to linear control policies.

Recently, [30, 31, 52] have proposed finite-dimensional DNN approximations of nonlinear $\mathcal{L}_{2}$ operators. In the sequel we briefly review the Recurrent Equilibrium Network (REN) models proposed in [31]. An operator $\mathbfcal{M}:\ell^{n}\rightarrow\ell^{m}$ is a REN if the relationship $\mathbf{u}=\mathbfcal{M}(\widehat{\mathbf{\mathbf{w}}})$ is recursively generated by the following dynamical system:

\begin{bmatrix}\xi_{t}\\ z_{t}\\ u_{t}\end{bmatrix}=\overbrace{\begin{bmatrix}A_{1}&B_{1}&B_{2}\\ C_{1}&D_{11}&D_{12}\\ C_{2}&D_{21}&D_{22}\end{bmatrix}}^{W}\begin{bmatrix}\xi_{t-1}\\ \sigma(z_{t})\\ w_{t}\end{bmatrix}+\overbrace{\begin{bmatrix}b_{x,t}\\ b_{z,t}\\ b_{w,t}\end{bmatrix}}^{b_{t}}\,,\quad\xi_{-1}=0\,,

(38)

where $\xi_{t}\in\mathbb{R}^{q}$ , $v_{t}\in\mathbb{R}^{r}$ , $b_{x,t},b_{z,t},b_{w,t}\in\ell_{\infty}$ ⁷⁷7This is slightly different from the the original REN model, where these signals [31] are assumed to be constant. and $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ — the activation function — is applied element-wise. Further, $\sigma(\cdot)$ must be piecewise differentiable and with first derivatives restricted to the interval $[0,1]$ . As noted in [31], RENs subsume many existing DNN architectures. In general, RENs define deep equilibrium network models [53] due to the implicit relationships defining $z_{t}$ in the second block row of (38). By restricting $D_{11}$ to be strictly lower-triangular, the value of $z_{t}$ can be computed explicitly, thus significantly speeding-up computations [31]. To give an example of the expressivity of (38), by suitably choosing the size and zero pattern of matrices in (38), RENs can provide nonlinear systems in the form

	$\displaystyle\xi_{t}=\hat{A}\xi_{t-1}+\hat{B}\,\text{NN}^{\xi}(\xi_{t-1},% \widehat{w}_{t})$
	$\displaystyle u_{t}=\hat{C}\xi_{t}+\hat{D}\,\text{NN}^{u}(\xi_{t-1},\widehat{w% }_{t})$

where $\hat{A}$ , $\hat{B}$ , $\hat{C}$ , $\hat{D}$ are arbitrary matrices of suitable dimensions and $NN^{\star}$ , $\star\in\{\xi,u\}$ , are neural networks of depth $L$ given by the relations

	$\displaystyle\tilde{z}_{0,t}^{\star}=[\xi_{t-1}^{\top},\hat{w}_{t}^{\top}]^{% \top},$
	$\displaystyle\tilde{z}_{k+1,t}^{\star}=\sigma(W_{k}^{\star}\tilde{z}_{k,t}^{% \star}+b_{k}^{\star}),\quad k=0,\ldots L-1$

where $W_{k}^{\star}$ and $b_{k}^{\star}$ are the layer weights and biases, respectively, and $\tilde{z}_{L,t}^{\star}$ is the NN output.

For an arbitrary choice of $W$ and $b_{t}$ , the map $\mathbfcal{M}$ induced by (38) may not lie in $\mathcal{L}_{2}$ . The work [31] provides an explicit smooth map** $\Theta:\mathbb{R}^{d}\rightarrow\mathbb{R}^{(q+r+m)\times(q+r+n)}$ from unconstrained training parameters $\theta\in\mathbb{R}^{d}$ to a matrix $W=\Theta(\theta)\in\mathbb{R}^{(q+r+m)\times(q+r+n)}$ defining (38), with the property that the corresponding operator $\mathbfcal{M}(\theta)$ lies in $\mathcal{L}_{2}$ by design when $b_{t}=0$ .⁸⁸8Furthermore, RENs enjoy contractivity — although the theoretical results of this paper do not rely on this property. This approach can be easily generalized by including vectors $b_{t}$ , $t=1,\ldots,T$ in the set of trainable parameters and assuming $b_{t}=0$ for $t>T$ . Recently, free parameterizations of continuous-time $\mathcal{L}_{2}$ operators through RENs and port-Hamiltonian systems have been also proposed in [52] and [54], respectively.

VI Numerical Experiments: the Magic of the Cost

In this section, we test the flexibility of performance boosting by considering cooperative robotics problems. Firstly, we validate the fail-safe feature of the design approach by showing that closed-loop stability is preserved during and after training — both when the system model is known and when it is uncertain. Secondly, we exploit the freedom in selecting the cost $L(x_{T:0},u_{T:0})$ to include appropriate terms aimed at promoting complex closed-loop behaviors.

In all the examples, we consider two point-mass vehicles, each with position $p_{t}^{[i]}\in\mathbb{R}^{2}$ and velocity $q_{t}^{[i]}\in\mathbb{R}^{2}$ , for $i=1,2$ , subject to nonlinear drag forces (e.g., air or water resistance). The discrete-time model for vehicle $i$ is

\begin{bmatrix}p_{t}^{[i]}\\ q_{t}^{[i]}\end{bmatrix}=\begin{bmatrix}p_{t-1}^{[i]}\\ q_{t-1}^{[i]}\end{bmatrix}+T_{s}\begin{bmatrix}q_{t-1}^{[i]}\\ (m^{[i]})^{-1}\left(-C(q_{t-1}^{[i]})+F_{t-1}^{[i]}\right)\end{bmatrix}\,,

(39)

where $m^{[i]}>0$ is the mass, $F_{t}^{[i]}\in\mathbb{R}^{2}$ denotes the force control input, $T_{s}>0$ is the sampling time and $C^{[i]}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2}$ is a drag function given by $C^{[i]}(s)=b_{1}^{[i]}s-b_{2}^{[i]}\tanh(s)$ , for some $0<b_{2}^{[i]}<b_{1}^{[i]}$ . Each vehicle must reach a target position $\overline{p}^{[i]}\in\mathbb{R}^{2}$ with zero velocity in a stable way. This elementary goal can be achieved by using a base proportional controller

{F^{\prime}}_{t}^{[i]}={K^{\prime}}^{[i]}(\bar{p}^{[i]}-p_{t}^{[i]})\,,

(40)

with $K^{\prime[i]}=\operatorname{diag}(k_{1}^{[i]},k_{2}^{[i]})$ and $k_{1}^{[i]},k_{2}^{[i]}>0$ . The overall dynamics $f_{t}(x_{t-1:0},u_{t-1:0})$ in (1) is given by (39)-(40) with $F^{[i]}_{t}=F^{\prime[i]}_{t}+u_{t}^{[i]}$ , where $x_{t}=(p_{t}^{[1]},q_{t}^{[1]},p_{t}^{[2]},q_{t}^{[2]})$ and $u_{t}=(u_{t}^{[1]},u_{t}^{[2]})$ is a performance-boosting control input to be designed. As per (1), we consider additive disturbances affecting the system dynamics. Thanks to the use of the prestabilizing controller (40), one can show that $\mathbfcal{F}(\mathbb{u},\mathbb{w})\in\mathcal{L}_{2}$ .

The goal of the performance-boosting policy is to enforce additional desired behaviors, on top of stability, which are specified in each of the following subsections. In all cases, we parametrize the operator $\mathbfcal{M}(\theta)\in\mathcal{L}_{2}$ as a REN, see (38). Appendix -A presents all the implementation details, such as parameter values and exact definitions of the cost functions. The code to reproduce our examples as well as various movies are available in our Github repository.⁹⁹9https://github.com/DecodEPFL/performance-boosting_controllers.git

VI-A Robust stability preservation during optimization

We consider the scenario mountains in Figure 4 where each vehicle must reach the target position in a stable way while avoiding collisions between themselves and with two grey obstacles. Each agent is represented with a circle that indicates its radius for the collision avoidance specifications. When using the base controller (40), the vehicles successfully achieve the target, however, they do so with poor performance since collisions are not avoided, as shown in Figure 4(a).

We select a loss $L(x_{T:0},u_{T:0})$ as the sum of stage costs $l(x_{t},u_{t})$ , that is, $L(x_{T:0},u_{T:0})=\sum_{t=0}^{T}l(x_{t},u_{t})$ with

l(x_{t},u_{t})=l_{traj}(x_{t},u_{t})+l_{ca}(x_{t})+l_{obs}(x_{t})\,,

(41)

where $l_{traj}(x_{t},u_{t})=\begin{bmatrix}x_{t}^{\mathsf{T}}&u_{t}^{\mathsf{T}}\end% {bmatrix}Q\begin{bmatrix}x_{t}^{\mathsf{T}}&u_{t}^{\mathsf{T}}\end{bmatrix}^{% \mathsf{T}}$ with $Q\succeq 0$ penalizes the distance of agents from their targets and the control energy, $l_{ca}(x_{t})$ and $l_{obs}(x_{t})$ penalize collisions between agents and with obstacles, respectively.

In order to train the performance-boosting controller, we solve (37), using a REN (38) of dimension $q=r=8$ . The training data consists of a set of 100 initial positions, i.e., we set $w_{0}=((p^{x}_{0})^{[1]},(p^{y}_{0})^{[1]},0,0,(p^{x}_{0})^{[2]},(p^{y}_{0})^{% [2]},0,0)$ and $w_{t}=0$ , for $t>0$ , where $p^{x}$ and $p^{y}$ denote the $x$ and $y$ coordinates of the vehicles in the Cartesian plane, respectively. Initial positions are sampled from a Gaussian distribution around the nominal initial condition. Figure 4(b-c) shows the nominal and training initial conditions marked with ‘ $\times$ ’ and ‘ $\circ$ ’, respectively, and three test trajectories after the training of the IMC controller. The trained control policies avoid collisions and achieve optimized trajectories thanks to minimizing (41).

VI-A1 Early stop** of the training

We validate the fail-safe property of our IMC control policies. We consider the scenario mountains as above but where the training process is interrupted before achieving a local minimum, as per the one in Figure 4. In particular, we stop the optimization algorithm after 25%, 50%, and 75% of the total number of epochs. The obtained trajectories are shown in Figure 5. We observe that even if the performance is not optimized, closed-loop stability is always guaranteed.

VI-A2 Model mismatch

We test our trained IMC controller when considering model mismatch on the system. In particular, we assume that the true vehicles have an incertitude over the mass of $\pm 10\%$ , and we apply IMC control policies embedding the nominal system with the nominal mass value. Figures 6 (a-b) validate the robust $\ell_{2}$ -stability of the closed-loop trajectories when the vehicles are lighter and heavier, respectively. Theorem 3 suggests that, in this case, the gain of $\mathbfcal{M}$ may be sufficiently low to counteract the effect of model uncertainty. Note, however, that checking the sufficient condition (22) requires computing an upper bound on $\gamma(\bm{\Delta})$ — a cumbersome task for general nonlinear systems. Nonetheless, Theorem 3 ensures that, in practical implementation, we can always reduce $\gamma(\mathbfcal{M})$ enough to eventually meet (22).

VI-B Boosting for safety and invariance certificates

A challenging task in many control applications is to deal with stringent safety constraints on the state variables. Ideally, one would directly add the constraint that

x_{t}\in\mathcal{C}\,,\forall t=0,1,\ldots\,,

(42)

in the IMC-based performance-boosting problem (35), where $\mathcal{C}\subseteq\mathbb{R}^{n}$ defines a safety region. Unfortunately, (42) generally results in intractable constraints over $\mathbfcal{M}$ . Indeed, it may be challenging to even verify that (42) holds for a certain $\mathbfcal{M}$ due to the infinite-horizon requirement and the involved nonlinearities. Many state-of-the-art approaches for guaranteeing safety hinge on either predictive safety filters [55, 56] or Control Barrier Functions (CBFs) [57, 58]. Safety filters are used during deployment: they override the control input $\mathbf{u}=\mathbfcal{M}(\widehat{\mathbf{w}})$ with a different (suboptimal) control variable when deemed necessary for guaranteeing safety. Instead, CBFs can be used for safety verification of a given policy, as they allow characterizing $\mathcal{C}$ as a forward invariant set based on a safety-set-defining function $h(x):\mathcal{X}\rightarrow\mathbb{R}$ satisfying $h(x)\geq 0$ for all $x\in\mathcal{C}$ . Certifying the forward invariance of $\mathcal{C}$ translates into determining if $h(x)$ is a CBF through verification of some safety conditions.¹⁰¹⁰10An exact definition of CBFs for the discrete-time can be found in [58]; for a more general discussion on CBFs we refer the reader to [57]. In particular, one can verify that, for any $x_{t}\in\mathcal{C}$ , if there exists an input $u_{t}$ giving $x_{t+1}$ such that it holds

h(x_{t+1})-h(x_{t})+\gamma h(x_{t})\geq 0\,,

(43)

where $0<\gamma\leq 1$ , then $h(x)$ is a CBF.

While optimizing over $\mathbfcal{M}$ such that (42) holds by design remains an open challenge, we aim to promote forward invariant sets by sha** the cost to include soft safety specifications over a horizon of length $T$ . In particular, the new cost term penalizes violations of (43) as per

\mathcal{L}_{\operatorname{inv}}=\sum_{t=0}^{T-1}\operatorname{ReLU}\left(h(x_% {t})-h(x_{t+1})-\gamma h(x_{t})\right)\,.

(44)

We consider the mountains scenario again and add the requirement that $(p_{t}^{y})^{[i]}<(\bar{p}^{y})^{[i]}+0.1$ for each vehicle $i=1,2$ and every $t=0,1,\ldots$ , where $p_{t}^{y}$ denotes the $y$ -coordinate of each center-of-mass position on the Cartesian plane. In other words, we only allow an overshoot of $0.1$ in the vertical direction with respect to the target position for each vehicle. By defining $h(x_{t})=\sum_{i=1}^{2}((\bar{p}^{y})^{[i]}+0.1-(p_{t}^{y})^{[i]})$ we add the term (44) to the loss function (37a). Upon training without including $\mathcal{L}_{\operatorname{inv}}$ in the cost, the masses violate the constraints, on average, on $67.49\%$ of the time over 100 runs — typical trajectories are shown in Figure 4. The violation ratio is decreased to $5.43\%$ when $\mathcal{L}_{\operatorname{inv}}$ is included, as shown in Figure 6(c), where the gray area indicates the unsafe region to be avoided by the vehicles. Note that sha** the cost through $\mathcal{L}_{\operatorname{inv}}$ is also beneficial if one implements an online safety filter such as [55, 56] during deployment. This is because penalizing $\mathcal{L}_{\operatorname{inv}}$ drastically decreases constraint violations of the closed-loop system, and hence, the suboptimal online intervention of the safety filter would be much less frequent.

VI-C Boosting for temporal logic specifications

The success of many policy learning algorithms, e.g., in RL, is highly dependent on the choice of the reward functions for capturing the desired behavior and constraints of an agent. When tasks become complex, specifying loss functions that are the sum over time of stage costs can be restrictive. For instance, consider the case of an agent that must optimally visit a set of locations. A loss function composed of a stage-cost summed over time — that is, the one considered in dynamic programming and classical optimal control [59, 3] — cannot easily capture this task, as it would need a-priori information about the optimal timings to visit each location. To overcome this problem, one could use more complex loss functions, as per those derived from temporal logic formulations. In particular, truncated linear temporal logic (TLTL) is a specification language leveraging a set of operators defined over finite-time trajectories [60, 61]. It allows incorporating domain knowledge, and constraints (in a soft fashion) into the learning process, such as “always avoid obstacles”, “eventually visit location $a$ ”, or “do not visit location $b$ until visiting location $a$ ”. Then, using quantitative semantics one can automatically transform TLTL formulae into real-valued loss functions that are compositions of $\min$ and $\max$ functions over a finite period of time [60, 61].

To test the efficacy of TLTL specifications for sha** complex stable closed-loop behavior, we consider the scenario waypoint-tracking, shown in Figure 7, where the two vehicles have to visit a sequence of waypoints while avoiding collisions between them and the gray obstacles. The blue vehicle’s goal is to visit $g_{b}$ , then $g_{a}$ and then $g_{c}$ , while the goal for the orange vehicle is to visit the waypoints in the following order: $g_{c}$ , $g_{b}$ and $g_{a}$ . Following [60], the loss formulation for the orange agent is translated into plain English as “Visit $g_{c}$ then $g_{b}$ then $g_{a}$ ; and don’t visit $g_{b}$ or $g_{a}$ until visiting $g_{c}$ ; and don’t visit $g_{a}$ until visiting $g_{b}$ ; and if visited $g_{c}$ , don’t visit $g_{c}$ again; and if visited $g_{b}$ , don’t visit $g_{b}$ again; and always avoid obstacles; and always avoid collisions; and eventually state at the final goal.” Its mathematical formulation can be found in Appendix -A.-A2.

Figure 7 shows the waypoint-tracking scenario before and after the training of a performance-boosting controller. As described in Section V.V-B, we use a REN with $q=r=32$ for approximating the $\mathcal{L}_{2}$ operator $\mathbfcal{M}$ . Furthermore, we allow for a time-varying bias of the form $b_{t}^{\top}=\begin{bmatrix}0_{1\times q}&0_{1\times r}&b_{w,t}^{\top}\end{bmatrix}$ , in (38), with $b_{w,t}=0$ for $t>T$ . While the system always starts at the same initial condition indicated with ‘ $\circ$ ,’ the data consists of disturbance sequences $w_{T:0}$ with fixed $w_{0}$ and $w_{T:1}$ as i.i.d. samples drawn from a Gaussian distribution with zero mean and standard deviation of $0.01$ . Our result highlights the power of complex costs — expressed through the TLTL loss function — which promotes vehicles visiting the predefined waypoints in the correct order while avoiding collisions between them and with the obstacles.

VII Conclusion

Embedding safety and stability emerges as a crucial challenge when control systems are equipped with high-performance machine learning components. This work aims to contribute to this rapidly develo** field by uncovering the theoretical and computational potential of IMC for safely boosting the performance of closed-loop nonlinear systems with machine learning models such as DNNs.

The results of this work open up several future research directions. First, motivated by the recent results of [49], it would be relevant to apply statistical learning theory to rigorously assess the generalization capabilities of performance-boosting controllers in uncertain environments and over extended timeframes. Second, drawing on insights from [62], integrating extensive RL-based offline learning with real-time adjustments similar to MPC presents a promising approach. Third, within the IMC framework, there is a significant opportunity to develop richer parametrizations of stable dynamical systems in $\mathcal{L}_{p}$ , and to theoretically prove their approximation capabilities. Lastly, building upon [63], it is interesting to explore how learning-based IMC methods could generate new optimization algorithms with formal guarantees for tackling complex optimal control and machine learning tasks.

-A Implementation details for the numerical experiments in Section VI

We set $m^{[i]}=b^{[i]}_{1}={k^{\prime}}^{[i]}_{1}={k^{\prime}}^{[i]}_{2}=1$ and $b^{i}_{2}=0.5$ as the parameters for each vehicle $i$ , in the model (39) with the pre-stabilizing controller (40). The collision-avoidance radius of each agent is 0.5.

-A1 Mountains scenario

As shown in Figure 4, the vehicles start at $p^{[1]}_{0}=(-2,-2)$ and $p^{[2]}_{0}=(-2,2)$ , and their goal is to go to the target positions $\bar{p}^{[1]}=(2,2)$ and $\bar{p}^{[2]}=(-2,2)$ , respectively. The training data consists of $100$ initial positions sampled from a Gaussian distribution around the initial position with a standard deviation of $0.5$ .

Let $\bar{x}=(\bar{x}^{[1]},\bar{x}^{[2]})$ with $\bar{x}^{[i]}=(\bar{p}^{[i]},0_{2})$ . The terms of the cost function (41) are defined as follows:

\displaystyle l_{traj}(x_{t},u_{t})=(x_{t}-\bar{x})^{\top}\tilde{Q}(x_{t}-\bar% {x})+\alpha_{u}u_{t}^{\top}u_{t}

l_{ca}(x_{t})=\begin{cases}\alpha_{ca}\sum_{i=0}^{N}\sum_{j,\,i\neq j}(d^{i,j}% _{t}+\epsilon)^{-2}&\text{if}\,d^{i,j}_{t}\leq D_{\text{safe}}\,,\\ 0&\text{otherwise}\,,\end{cases}

where $\tilde{Q}\succ 0$ and $\alpha_{u},\alpha_{ca}>0$ are hyperparameters, $d^{i,j}_{t}=|p^{[i]}_{t}-p^{[j]}_{t}|_{2}\geq 0$ denotes the distance between agent $i$ and $j$ , $\epsilon>0$ is a fixed positive small constant such that the loss remains bounded for all distance values and $D_{\text{safe}}$ is a safe distance between the center of mass of each the agent; we set it to 1.2.

Motivated by [64], we represent the obstacles based on a Gaussian density function

\eta(z;\mu,\Sigma)=\frac{1}{2\pi\sqrt{\text{det}(\Sigma)}}\exp\left(-\frac{1}{% 2}\left(z-\mu\right)^{\top}\Sigma^{-1}\left(z-\mu\right)\right)\,,

with mean $\mu\in\mathbb{R}^{2}$ and covariance $\Sigma\in\mathbb{R}^{2\times 2}$ with $\Sigma\succ 0$ . The term $l_{obs}(x_{t})$ is given by

	$\displaystyle l_{obs}(x_{t})=\alpha_{obs}\sum_{i=0}^{2}$	$\displaystyle\Bigg{(}\eta\left(p^{[i]}_{t};\begin{bmatrix}2.5\\ 0\end{bmatrix},0.2\,I\right)$
		$\displaystyle~{}~{}+\eta\left(p^{[i]}_{t};\begin{bmatrix}-2.5\\ 0\end{bmatrix},0.2\,I\right)$
		$\displaystyle~{}~{}+\eta\left(p^{[i]}_{t};\begin{bmatrix}1.5\\ 0\end{bmatrix},0.2\,I\right)$
		$\displaystyle~{}~{}+\eta\left(p^{[i]}_{t};\begin{bmatrix}-1.5\\ 0\end{bmatrix},0.2\,I\right)\Bigg{)}\,.$

For the hyperparameters, we set $\alpha_{u}=2.5\times 10^{-4}$ , $\alpha_{ca}=100$ , $\alpha_{obs}=5\times 10^{3}$ and $Q=I_{4}$ . We use stochastic gradient descent with Adam to minimize the loss function, setting a learning rate of $1\times 10^{-4}$ . We train for $5\times 10^{3}$ epochs with one trajectory per batch size.

-A2 Waypoint-tracking scenario

As shown in Figure 4, the vehicles start at $p^{[1]}_{0}=(-2,0)$ and $p^{[2]}_{0}=(0,0)$ . The goal points $g_{a}$ , $g_{b}$ and $g_{c}$ are located at $(-2,-2)$ , $(0,2)$ and $(2,-2)$ , respectively. To describe the TLTL loss, let us define, for each vehicle, the following functions of time:

•

$d^{g_{i}}_{t}$ , for $i=1,2,3$ , is the distance between the vehicle and the goal point $g_{i}$ ;
•

$d^{o_{i}}_{t}$ , for $i=1,2$ , is the distance between the vehicle and the $i^{\text{th}}$ obstacle;
•

$d^{coll}_{t}$ is the distance between the two vehicles;

where $g_{1}$ , $g_{2}$ and $g_{3}$ are the waypoints in the correct visiting order, for each vehicle. Following the notation of [60], the temporal logic form of the cost function, for each vehicle, is

\left(\psi_{g_{1}}\,\mathcal{T}\,\psi_{g_{2}}\,\mathcal{T}\,\psi_{g_{3}}\right% )\wedge\left(\lnot\left(\psi_{g_{2}}\vee\psi_{g_{3}}\right)\,\mathcal{U}\,\psi% _{g_{1}}\right)\wedge\left(\lnot\psi_{g_{3}}\,\mathcal{U}\,\psi_{g_{2}}\right)% \\ \wedge\left(\bigwedge_{i=1,2,3}\square\left(\psi_{g_{i}}\Rightarrow\bigcirc% \square\lnot\psi_{g_{i}}\right)\right)\wedge\left(\bigwedge_{i=1,2}\square\psi% _{o_{i}}\right)\\ \wedge\square\psi_{coll}\wedge\lozenge\square\psi_{g_{3}}

(45)

where $\psi$ are predicates defined in Table I, and $r_{obs}=1.7$ and $r_{r}=0.5$ are the radii of the obstacles and vehicles, respectively.¹¹¹¹11Note that in the waypoint-tracking scenario, we do not model the obstacles with a Gaussian density function. The Boolean operators $\lnot$ , $\vee$ , and $\wedge$ stand for negation (not), disjunction (or), and conjunction (and). The temporal operators $\mathcal{T}$ , $\mathcal{U}$ , $\lozenge$ , and $\square$ stand for ‘then’, ‘until’, ‘eventually’, and ‘always’. Mathematically, each term can be automatically translated following [60, 61]. For instance, $\square\psi_{coll}$ translates into

\min_{t\in[0,T]}(d^{rob}_{t}-2r_{rob}),

and $\square\left(\psi_{g_{i}}\Rightarrow\bigcirc\square\lnot\psi_{g_{i}}\right)$ translates into

\displaystyle\min_{t\in[0,T]}\max\big{(}\begin{aligned} &-(0.05-d^{g_{i}}_{t})% \,,\,&\min_{\tilde{t}\in[t+1,T]}-(0.05-d^{g_{i}}_{t})\big{)}.\end{aligned}

The full mathematical expression of (45), which can be obtained following [60], is implemented in our Github repository.

Predicates	Expression
$\psi_{g_{1}}$	$d^{g_{1}}<0.05$
$\psi_{g_{2}}$	$d^{g_{2}}<0.05$
$\psi_{g_{3}}$	$d^{g_{3}}<0.05$
$\psi_{o_{1}}$	$d^{o_{1}}>r_{obs}$
$\psi_{o_{2}}$	$d^{o_{2}}>r_{obs}$
$\psi_{coll}$	$d^{rob}>2\,r_{rob}$

TABLE I: Predicates used in the TLTL formulation of (45).

We also add a small regularization term for promoting that the vehicles stay close to the end target point, which reads $\alpha_{\text{reg}}\left\lVert x_{t}-\bar{x}\right\rVert^{2}$ , with $\alpha_{\text{reg}}=1\times 10^{-4}$ . We use stochastic gradient descent with Adam to minimize the loss function, setting a learning rate of $5\times 10^{-4}$ . We train for 3000 epochs with a single trajectory per batch size.

References

[1] A. M. Annaswamy, K. H. Johansson, and G. J. Pappas, “Control for societal-scale challenges: Road map 2030,” IEEE Control Systems Society Publication, 2023.
[2] S. Sastry, Nonlinear systems: analysis, stability, and control. Springer Science & Business Media, 2013, vol. 10.
[3] D. P. Bertsekas, “Dynamic programming and optimal control: Vol. I-II,” Belmont, MA: Athena Scientific, 2011.
[4] L. S. Pontryagin, Mathematical theory of optimal processes. Routledge, 2018.
[5] J. B. Rawlings, D. Q. Mayne, and M. Diehl, Model predictive control: theory, computation, and design. Nob Hill Publishing Madison, WI, 2017, vol. 2.
[6] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[7] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, pp. 411–444, 2022.
[8] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning quadrupedal locomotion over challenging terrain,” Science robotics, vol. 5, no. 47, p. eabc5986, 2020.
[9] Y. Song, M. Steinweg, E. Kaufmann, and D. Scaramuzza, “Autonomous drone racing with deep reinforcement learning,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1205–1212.
[10] E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V. Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,” Nature, vol. 620, no. 7976, pp. 982–987, 2023.
[11] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” Advances in Neural Information Processing Systems 30, vol. 2, pp. 909–919, 2018.
[12] M. Zanon and S. Gros, “Safe reinforcement learning using robust MPC,” IEEE Transactions on Automatic Control, vol. 66, no. 8, pp. 3638–3652, 2020.
[13] M. ** and J. Lavaei, “Stability-certified reinforcement learning: A control-theoretic perspective,” IEEE Access, vol. 8, pp. 229 086–229 100, 2020.
[14] T. Parisini and R. Zoppoli, “A receding-horizon regulator for nonlinear systems and a neural approximation,” Automatica, vol. 31, no. 10, pp. 1443–1451, Oct. 1995.
[15] T. Parisini, M. Sanguineti, and R. Zoppoli, “Nonlinear stabilization by receding-horizon neural regulators,” International Journal of Control, vol. 70, no. 3, pp. 341–362, Jan. 1998.
[16] A. Levin and K. Narendra, “Control of nonlinear dynamical systems using neural networks. II. Observability, identification, and control,” IEEE Transactions on Neural Networks, vol. 7, no. 1, pp. 30–42, Jan. 1996.
[17] F. Gu, H. Yin, L. El Ghaoui, M. Arcak, P. Seiler, and M. **, “Recurrent neural network controllers synthesis with stability guarantees for partially observed systems,” in AAAI, 2022, pp. 5385–5394.
[18] P. Pauli, J. Köhler, J. Berberich, A. Koch, and F. Allgöwer, “Offset-free setpoint tracking using neural network controllers,” in Learning for Dynamics and Control. PMLR, 2021, pp. 992–1003.
[19] R. Wang, N. H. Barbara, M. Revay, and I. Manchester, “Learning over all stabilizing nonlinear controllers for a partially-observed linear system,” IEEE Control Systems Letters, vol. 7, pp. 91–96, 2022.
[20] R. Wang and I. R. Manchester, “Youla-REN: Learning nonlinear feedback policies with robust stability guarantees,” in 2022 American Control Conference (ACC). IEEE, 2022, pp. 2116–2123.
[21] L. Furieri, C. L. Galimberti, M. Zakwan, and G. Ferrari-Trecate, “Distributed neural network control with dependability guarantees: a compositional port-Hamiltonian approach,” in Learning for Dynamics and Control Conference. PMLR, 2022, pp. 571–583.
[22] L. Furieri, C. L. Galimberti, and G. Ferrari-Trecate, “Neural system level synthesis: Learning over all stabilizing policies for nonlinear systems,” in 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022, pp. 2765–2770.
[23] N. H. Barbara, R. Wang, and I. R. Manchester, “Learning over contracting and Lipschitz closed-loops for partially-observed nonlinear systems,” in 2023 62nd IEEE Conference on Decision and Control (CDC). IEEE, 2023, pp. 1028–1033.
[24] C. E. Garcia and M. Morari, “Internal model control. a unifying review and some new results,” Industrial & Engineering Chemistry Process Design and Development, vol. 21, no. 2, pp. 308–323, 1982.
[25] C. G. Economou, M. Morari, and B. O. Palsson, “Internal model control: Extension to nonlinear system,” Industrial & Engineering Chemistry Process Design and Development, vol. 25, no. 2, pp. 403–411, 1986.
[26] F. Bonassi and R. Scattolini, “Recurrent neural network-based internal model control design for stable nonlinear systems,” European Journal of Control, vol. 65, p. 100632, 2022.
[27] V. Anantharam and C. A. Desoer, “On the stabilization of nonlinear systems,” IEEE Transactions on Automatic Control, vol. 29, no. 6, pp. 569–572, 1984.
[28] K. Fujimoto and T. Sugie, “State-space characterization of Youla parametrization for nonlinear systems based on input-to-state stability,” in Proceedings of the 37th IEEE Conference on Decision and Control, vol. 3. IEEE, 1998, pp. 2479–2484.
[29] D. Ho, “A system level approach to discrete-time nonlinear systems,” in 2020 American Control Conference (ACC). IEEE, 2020, pp. 1625–1630.
[30] K.-K. K. Kim, E. R. Patrón, and R. D. Braatz, “Standard representation and unified stability analysis for dynamic artificial neural network models,” Neural Networks, vol. 98, pp. 251–262, 2018.
[31] M. Revay, R. Wang, and I. R. Manchester, “Recurrent equilibrium networks: Flexible dynamic models with guaranteed stability and robustness,” IEEE Transactions on Automatic Control, 2023.
[32] Y. Tang, Y. Zheng, and N. Li, “Analysis of the optimization landscape of linear quadratic Gaussian (LQG) control,” in Learning for Dynamics and Control. PMLR, 2021, pp. 599–610.
[33] L. Furieri and M. Kamgarpour, “First order methods for globally optimal distributed controllers beyond quadratic invariance,” in 2020 American Control Conference (ACC). IEEE, 2020, pp. 4588–4593.
[34] D. E. Rivera, M. Morari, and S. Skogestad, “Internal model control: Pid controller design,” Industrial & engineering chemistry process design and development, vol. 25, no. 1, pp. 252–265, 1986.
[35] K. Zhou and J. C. Doyle, Essentials of robust control. Prentice hall Upper Saddle River, NJ, 1998, vol. 104.
[36] M. W. Fisher, G. Hug, and F. Dörfler, “Approximation by simple poles–part I: Density and geometric convergence rate in hardy space,” IEEE Transactions on Automatic Control, 2023.
[37] ——, “Approximation by simple poles–part II: System level synthesis beyond finite impulse response,” arXiv preprint arXiv:2203.16765, 2022.
[38] L. Furieri, Y. Zheng, A. Papachristodoulou, and M. Kamgarpour, “Sparsity invariance for convex design of distributed controllers,” IEEE Transactions on Control of Network Systems, vol. 7, no. 4, pp. 1836–1847, 2020.
[39] R. Wang, N. H. Barbara, M. Revay, and I. R. Manchester, “Learning over all stabilizing nonlinear controllers for a partially-observed linear system,” IEEE Control Systems Letters, vol. 7, pp. 91–96, 2022.
[40] Y.-S. Wang, N. Matni, and J. C. Doyle, “A system-level approach to controller synthesis,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 4079–4093, 2019.
[41] L. Conger, J. S. L. Li, E. Mazumdar, and S. L. Brunton, “Nonlinear system level synthesis for polynomial dynamical systems,” in 2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022, pp. 3846–3852.
[42] G. Zames, “On the input-output stability of time-varying nonlinear feedback systems part one: Conditions derived using concepts of loop gain, conicity, and positivity,” IEEE transactions on automatic control, vol. 11, no. 2, pp. 228–238, 1966.
[43] L. Massai, D. Saccani, L. Furieri, and G. Ferrari-Trecate, “Unconstrained learning of networked nonlinear systems via free parametrization of stable interconnected operators,” arXiv preprint arXiv:2311.13967, 2023.
[44] D. Saccani, L. Massai, L. Furieri, and G. Ferrari-Trecate, “Optimal distributed control with stability guarantees by training a network of neural closed-loop maps,” arXiv preprint arXiv:2404.02820, 2024.
[45] D. Youla, H. Jabr, and J. Bongiorno, “Modern Wiener-Hopf design of optimal controllers–Part II: The multivariable case,” IEEE Transactions on Automatic Control, vol. 21, no. 3, pp. 319–338, 1976.
[46] L. Furieri, Y. Zheng, A. Papachristodoulou, and M. Kamgarpour, “An input–output parametrization of stabilizing controllers: amidst Youla and system level synthesis,” IEEE Control Systems Letters, 2019.
[47] Y. Zheng, L. Furieri, M. Kamgarpour, and N. Li, “System-level, input–output and new parameterizations of stabilizing controllers, and their numerical computation,” Automatica, vol. 140, p. 110211, 2022.
[48] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning. PMLR, 2018, pp. 1467–1476.
[49] M. G. Boroujeni, C. L. Galimberti, A. Krause, and G. Ferrari-Trecate, “A pac-bayesian framework for optimal control with stability guarantees,” arXiv preprint arXiv:2403.17790, 2024.
[50] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. [Online]. Available: https://www.tensorflow.org/
[51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 8024–8035.
[52] D. Martinelli, C. L. Galimberti, I. R. Manchester, L. Furieri, and G. Ferrari-Trecate, “Unconstrained parametrization of dissipative and contracting neural ordinary differential equations,” in 2023 62nd IEEE Conference on Decision and Control (CDC). IEEE, 2023, pp. 3043–3048.
[53] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
[54] M. Zakwan and G. Ferrari-Trecate, “Neural distributed controllers with port-Hamiltonian structures,” arXiv preprint arXiv:2403.17785, 2024.
[55] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger, “Learning-based model predictive control: Toward safe learning in control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 269–296, 2020.
[56] K. P. Wabersich and M. N. Zeilinger, “A predictive safety filter for learning-based control of constrained nonlinear dynamical systems,” Automatica, vol. 129, p. 109597, 2021.
[57] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European control conference (ECC). IEEE, 2019, pp. 3420–3431.
[58] A. Agrawal and K. Sreenath, “Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation.” in Robotics: Science and Systems, vol. 13. Cambridge, MA, USA, 2017, pp. 1–10.
[59] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. Scokaert, “Constrained model predictive control: Stability and optimality,” Automatica, vol. 36, no. 6, pp. 789–814, 2000.
[60] X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 3834–3839.
[61] K. Leung, N. Aréchiga, and M. Pavone, “Backpropagation through signal temporal logic specifications: Infusing logical structure into gradient-based methods,” The International Journal of Robotics Research, vol. 42, no. 6, pp. 356–370, 2023.
[62] D. Bertsekas, Lessons from AlphaZero for optimal, model predictive, and adaptive control. Athena Scientific, 2022.
[63] A. Martin and L. Furieri, “Learning to optimize with convergence guarantees using nonlinear system theory,” arXiv preprint arXiv:2403.09389, 2024.
[64] D. Onken, L. Nurbekyan, X. Li, S. W. Fung, S. Osher, and L. Ruthotto, “A neural network approach applied to multi-agent optimal control,” in IEEE European Control Conference (ECC), 2021, pp. 1036–1041.

Learning to Boost the Performance of Stable Nonlinear Systems

Abstract

Index Terms:

I Introduction

I-A Contributions

I-B Notation

II The Performance-boosting Problem

Definition 1.

Problem 1 (Performance boosting).

III Unconstrained Parametrization of all Stability-preserving Controllers

Theorem 1.

Proof.

III-1 The case of LTI systems with nonlinear costs

Corollary 1.

Proof.

III-2 Relationships with [22] and nonlinear SLS

Theorem 2 (Nonlinear SLS parametrization [29]).

IV Beyond Closed-loop Stability: Handling Model Uncertainty and Distributed Architectures

IV-A Robustness against model-mismatch

Theorem 3.

Proof.

Remark 1 (Robust stability of nonlinear SLS).

IV-B Distributed controllers for large-scale plants

Proposition 1.

Proof.

V Learning to Boost Performance using Unconstrained Optimization

V-A IMC-based reformulation of performance boosting

V-B Free parameterizations of ℒ2subscriptℒ2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT subsets

VI Numerical Experiments: the Magic of the Cost

VI-A Robust stability preservation during optimization

VI-A1 Early stop** of the training

VI-A2 Model mismatch

VI-B Boosting for safety and invariance certificates

VI-C Boosting for temporal logic specifications

VII Conclusion

-A Implementation details for the numerical experiments in Section VI

-A1 Mountains scenario

-A2 Waypoint-tracking scenario

References

Learning to Boost the Performance
of Stable Nonlinear Systems

V-B Free parameterizations of $\mathcal{L}_{2}$ subsets