Search | arXiv e-print repository

Controllability Test for Nonlinear Datatic Systems

Authors: Yujie Yang, Letian Tao, Likun Wang, Shengbo Eben Li

Abstract: Controllability is a fundamental property of control systems, serving as the prerequisite for controller design. While controllability test is well established in modelic (i.e., model-driven) control systems, extending it to datatic (i.e., data-driven) control systems is still a challenging task due to the absence of system models. In this study, we propose a general controllability test method fo… ▽ More Controllability is a fundamental property of control systems, serving as the prerequisite for controller design. While controllability test is well established in modelic (i.e., model-driven) control systems, extending it to datatic (i.e., data-driven) control systems is still a challenging task due to the absence of system models. In this study, we propose a general controllability test method for nonlinear systems with datatic description, where the system behaviors are merely described by data. In this situation, the state transition information of a dynamic system is available only at a limited number of data points, leaving the behaviors beyond these points unknown. Different from traditional exact controllability, we introduce a new concept called $ε$-controllability, which extends the definition from point-to-point form to point-to-region form. Accordingly, our focus shifts to checking whether the system state can be steered to a closed state ball centered on the target state, rather than exactly at that target state. On its basis, we propose a tree search algorithm called maximum expansion of controllable subset (MECS) to identify controllable states in the dataset. Starting with a specific target state, our algorithm can iteratively propagate controllability from a known state ball to a new one. This iterative process gradually enlarges the $ε$-controllable subset by incorporating new controllable balls until all $ε$-controllable states are searched. Besides, a simplified version of MECS is proposed by solving a special shortest path problem, called Floyd expansion with radius fixed (FERF). FERF maintains a fixed radius of all controllable balls based on a mutual controllability assumption of neighboring states. The effectiveness of our method is validated in three datatic control systems whose dynamic behaviors are described by sampled data. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2404.10064 [pdf, other]

The Feasibility of Constrained Reinforcement Learning Algorithms: A Tutorial Study

Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li, Masayoshi Tomizuka, Changliu Liu

Abstract: Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy.… ▽ More Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy. However, reinforcement learning (RL), as another important control method, solves the optimal policy in an iterative manner, which comes with a series of non-optimal intermediate policies. Feasibility analysis of these non-optimal policies is also necessary for iteratively improving constraint satisfaction; but that is not available under existing MPC feasibility theories. This paper proposes a feasibility theory that applies to both MPC and RL by filling in the missing part of feasibility analysis for an arbitrary policy. The basis of our theory is to decouple policy solving and implementation into two temporal domains: virtual-time domain and real-time domain. This allows us to separately define initial and endless, state and policy feasibility, and their corresponding feasible regions. Based on these definitions, we analyze the containment relationships between different feasible regions, which enables us to describe the feasible region of an arbitrary policy. We further provide virtual-time constraint design rules along with a practical design tool called feasibility function that helps to achieve the maximum feasible region. We review most of existing constraint formulations and point out that they are essentially applications of feasibility functions in different forms. We demonstrate our feasibility theory by visualizing different feasible regions under both MPC and RL policies in an emergency braking control task. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.00481 [pdf, other]

Convolutional Bayesian Filtering

Authors: Wenhan Cao, Shiqi Liu, Chang Liu, Zeyu He, Stephen S. -T. Yau, Shengbo Eben Li

Abstract: Bayesian filtering serves as the mainstream framework of state estimation in dynamic systems. Its standard version utilizes total probability rule and Bayes' law alternatively, where how to define and compute conditional probability is critical to state distribution inference. Previously, the conditional probability is assumed to be exactly known, which represents a measure of the occurrence proba… ▽ More Bayesian filtering serves as the mainstream framework of state estimation in dynamic systems. Its standard version utilizes total probability rule and Bayes' law alternatively, where how to define and compute conditional probability is critical to state distribution inference. Previously, the conditional probability is assumed to be exactly known, which represents a measure of the occurrence probability of one event, given the second event. In this paper, we find that by adding an additional event that stipulates an inequality condition, we can transform the conditional probability into a special integration that is analogous to convolution. Based on this transformation, we show that both transition probability and output probability can be generalized to convolutional forms, resulting in a more general filtering framework that we call convolutional Bayesian filtering. This new framework encompasses standard Bayesian filtering as a special case when the distance metric of the inequality condition is selected as Dirac delta function. It also allows for a more nuanced consideration of model mismatch by choosing different types of inequality conditions. For instance, when the distance metric is defined in a distributional sense, the transition probability and output probability can be approximated by simply rescaling them into fractional powers. Under this framework, a robust version of Kalman filter can be constructed by only altering the noise covariance matrix, while maintaining the conjugate nature of Gaussian distributions. Finally, we exemplify the effectiveness of our approach by resha** classic filtering algorithms into convolutional versions, including Kalman filter, extended Kalman filter, unscented Kalman filter and particle filter. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.01768 [pdf, other]

Canonical Form of Datatic Description in Control Systems

Authors: Guojian Zhan, Ziang Zheng, Shengbo Eben Li

Abstract: The design of feedback controllers is undergoing a paradigm shift from modelic (i.e., model-driven) control to datatic (i.e., data-driven) control. Canonical form of state space model is an important concept in modelic control systems, exemplified by Jordan form, controllable form and observable form, whose purpose is to facilitate system analysis and controller synthesis. In the realm of datatic… ▽ More The design of feedback controllers is undergoing a paradigm shift from modelic (i.e., model-driven) control to datatic (i.e., data-driven) control. Canonical form of state space model is an important concept in modelic control systems, exemplified by Jordan form, controllable form and observable form, whose purpose is to facilitate system analysis and controller synthesis. In the realm of datatic control, there is a notable absence in the standardization of data-based system representation. This paper for the first time introduces the concept of canonical data form for the purpose of achieving more effective design of datatic controllers. In a control system, the data sample in canonical form consists of a transition component and an attribute component. The former encapsulates the plant dynamics at the sampling time independently, which is a tuple containing three elements: a state, an action and their corresponding next state. The latter describes one or some artificial characteristics of the current sample, whose calculation must be performed in an online manner. The attribute of each sample must adhere to two requirements: (1) causality, ensuring independence from any future samples; and (2) locality, allowing dependence on historical samples but constrained to a finite neighboring set. The purpose of adding attribute is to offer some kinds of benefits for controller design in terms of effectiveness and efficiency. To provide a more close-up illustration, we present two canonical data forms: temporal form and spatial form, and demonstrate their advantages in reducing instability and enhancing training efficiency in two datatic control systems. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2401.16793 [pdf, other]

On the Stability of Datatic Control Systems

Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li

Abstract: The development of feedback controllers is undergoing a paradigm shift from $\textit{modelic}$ (model-driven) control to $\textit{datatic}$ (data-driven) control. Stability, as a fundamental property in control, is less well studied in datatic control paradigm. The difficulty is that traditional stability criteria rely on explicit system models, which are not available in those systems with datati… ▽ More The development of feedback controllers is undergoing a paradigm shift from $\textit{modelic}$ (model-driven) control to $\textit{datatic}$ (data-driven) control. Stability, as a fundamental property in control, is less well studied in datatic control paradigm. The difficulty is that traditional stability criteria rely on explicit system models, which are not available in those systems with datatic description. Some pioneering works explore stability criteria for datatic systems with special forms such as linear systems, homogeneous systems, and polynomial systems. However, these systems imply too strong assumptions on the inherent connection among data points, which do not hold in general nonlinear systems. This paper proposes a stability verification algorithm for general datatic control systems called $η$-testing. Our stability criterion only relies on a weak assumption of Lipschitz continuity so as to extend information from known data points to unmeasured regions. This information restricts the time derivative of any unknown state to the intersection of a set of closed balls. Inside the intersection, the worst-case time derivative of Lyapunov function is estimated by solving a quadratically constrained linear program (QCLP). By comparing the optimal values of QCLPs to zero in the whole state space, a sufficient condition of system stability can be checked. We test our algorithm on three datatic control systems, including both linear and nonlinear ones. Results show that our algorithm successfully verifies the stability, instability, and critical stability of tested systems. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2310.19022 [pdf, other]

doi 10.1109/TCYB.2023.3323316

Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback

Authors: **gliang Duan, Jie Li, Xuyang Chen, Kai Zhao, Shengbo Eben Li, Lin Zhao

Abstract: In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the opti… ▽ More In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Journal ref: IEEE Transactions on Cybernetics, 2023

arXiv:2310.05858 [pdf, other]

DSAC-T: Distributional Soft Actor-Critic with Three Refinements

Authors: **gliang Duan, Wenxuan Wang, Liming Xiao, Jiaxin Gao, Shengbo Eben Li

Abstract: Reinforcement learning (RL) has proven to be highly effective in tackling complex decision-making and control tasks. However, prevalent model-free RL methods often face severe performance degradation due to the well-known overestimation issue. In response to this problem, we recently introduced an off-policy RL algorithm, called distributional soft actor-critic (DSAC or DSAC-v1), which can effecti… ▽ More Reinforcement learning (RL) has proven to be highly effective in tackling complex decision-making and control tasks. However, prevalent model-free RL methods often face severe performance degradation due to the well-known overestimation issue. In response to this problem, we recently introduced an off-policy RL algorithm, called distributional soft actor-critic (DSAC or DSAC-v1), which can effectively improve the value estimation accuracy by learning a continuous Gaussian value distribution. Nonetheless, standard DSAC has its own shortcomings, including occasionally unstable learning processes and the necessity for task-specific reward scaling, which may hinder its overall performance and adaptability in some special tasks. This paper further introduces three important refinements to standard DSAC in order to address these shortcomings. These refinements consist of expected value substituting, twin value distribution learning, and variance-based critic gradient adjusting. The modified RL algorithm is named as DSAC with three refinements (DSAC-T or DSAC-v2), and its performances are systematically evaluated on a diverse set of benchmark tasks. Without any task-specific hyperparameter tuning, DSAC-T surpasses or matches a lot of mainstream model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T, unlike its standard version, ensures a highly stable learning process and delivers similar performance across varying reward scales. △ Less

Submitted 28 December, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2309.09734 [pdf, other]

Learning Optimal Robust Control of Connected Vehicles in Mixed Traffic Flow

Authors: Jie Li, Jiawei Wang, Shengbo Eben Li, Keqiang Li

Abstract: Connected and automated vehicles (CAVs) technologies promise to attenuate undesired traffic disturbances. However, in mixed traffic where human-driven vehicles (HDVs) also exist, the nonlinear human-driving behavior has brought critical challenges for effective CAV control. This paper employs the policy iteration method to learn the optimal robust controller for nonlinear mixed traffic systems. Pr… ▽ More Connected and automated vehicles (CAVs) technologies promise to attenuate undesired traffic disturbances. However, in mixed traffic where human-driven vehicles (HDVs) also exist, the nonlinear human-driving behavior has brought critical challenges for effective CAV control. This paper employs the policy iteration method to learn the optimal robust controller for nonlinear mixed traffic systems. Precisely, we consider the H_infty control framework and formulate it as a zero-sum game, the equivalent condition for whose solution is converted into a Hamilton-Jacobi inequality with a Hamiltonian constraint. Then, a policy iteration algorithm is designed to generate stabilizing controllers with desired attenuation performance. Based on the updated robust controller, the attenuation level is further optimized in sum of squares program by leveraging the gap of the Hamiltonian constraint. Simulation studies verify that the obtained controller enables the CAVs to dampen traffic perturbations and smooth traffic flow. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2304.08845 [pdf, other]

Feasible Policy Iteration

Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li, **gliang Duan, **g**g Liu, Xianyuan Zhan, Ya-Qin Zhang

Abstract: Safe reinforcement learning (RL) aims to find the optimal policy and its feasible region in a constrained optimal control problem (OCP). Ensuring feasibility and optimality simultaneously has been a major challenge. Existing methods either attempt to solve OCPs directly with constrained optimization algorithms, leading to unstable training processes and unsatisfactory feasibility, or restrict poli… ▽ More Safe reinforcement learning (RL) aims to find the optimal policy and its feasible region in a constrained optimal control problem (OCP). Ensuring feasibility and optimality simultaneously has been a major challenge. Existing methods either attempt to solve OCPs directly with constrained optimization algorithms, leading to unstable training processes and unsatisfactory feasibility, or restrict policies in overly small feasible regions, resulting in excessive conservativeness with sacrificed optimality. To address this challenge, we propose an indirect safe RL framework called feasible policy iteration, which guarantees that the feasible region monotonically expands and converges to the maximum one, and the state-value function monotonically improves and converges to the optimal one. We achieve this by designing a policy update principle called region-wise policy improvement, which maximizes the state-value function under the constraint of the constraint decay function (CDF) inside the feasible region and minimizes the CDF outside the feasible region simultaneously. This update scheme ensures that the state-value function monotonically increases state-wise in the feasible region and the CDF monotonically decreases state-wise in the entire state space. We prove that the CDF converges to the solution of the risky Bellman equation while the state-value function converges to the solution of the feasible Bellman equation. The former represents the maximum feasible region and the latter manifests the optimal state-value function. Experiments show that our algorithm learns strictly safe and near-optimal policies with accurate feasible regions on classic control tasks. It also achieves fewer constraint violations with performance better than (or comparable to) baselines on Safety Gym. △ Less

Submitted 28 January, 2024; v1 submitted 18 April, 2023; originally announced April 2023.

arXiv:2210.07553 [pdf, other]

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Authors: Dongjie Yu, Wenjun Zou, Yujie Yang, Haitong Ma, Shengbo Eben Li, **gliang Duan, Jianyu Chen

Abstract: Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an i… ▽ More Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an issue in safe model-based RL, especially in training time safety. In this paper, we propose a distributional reachability certificate (DRC) and its Bellman equation to address model uncertainties and characterize robust persistently safe states. Furthermore, we build a safe RL framework to resolve constraints required by the DRC and its corresponding shield policy. We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy. Comprehensive experiments on classical benchmarks such as constrained tracking and navigation indicate that the proposed algorithm achieves comparable returns with much fewer constraint violations during training. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: 12 pages, 6 figures

arXiv:2210.02166 [pdf, other]

Robust Bayesian Inference for Moving Horizon Estimation

Authors: Wenhan Cao, Chang Liu, Zhiqian Lan, Shengbo Eben Li, Wei Pan, Angelo Alessandri

Abstract: The accuracy of moving horizon estimation (MHE) suffers significantly in the presence of measurement outliers. Existing methods address this issue by treating measurements leading to large MHE cost function values as outliers, which are subsequently discarded. This strategy, achieved through solving combinatorial optimization problems, is confined to linear systems to guarantee computational tract… ▽ More The accuracy of moving horizon estimation (MHE) suffers significantly in the presence of measurement outliers. Existing methods address this issue by treating measurements leading to large MHE cost function values as outliers, which are subsequently discarded. This strategy, achieved through solving combinatorial optimization problems, is confined to linear systems to guarantee computational tractability and stability. Contrasting these heuristic solutions, our work reexamines MHE from a Bayesian perspective, unveils the fundamental issue of its lack of robustness: MHE's sensitivity to outliers results from its reliance on the Kullback-Leibler (KL) divergence, where both outliers and inliers are equally considered. To tackle this problem, we propose a robust Bayesian inference framework for MHE, integrating a robust divergence measure to reduce the impact of outliers. In particular, the proposed approach prioritizes the fitting of uncontaminated data and lowers the weight of contaminated ones, instead of directly discarding all potentially contaminated measurements, which may lead to undesirable removal of uncontaminated data. A tuning parameter is incorporated into the framework to adjust the robustness degree to outliers. Notably, the classical MHE can be interpreted as a special case of the proposed approach as the parameter converges to zero. In addition, our method involves only minor modification to the classical MHE stage cost, thus avoiding the high computational complexity associated with previous outlier-robust methods and inherently suitable for nonlinear systems. Most importantly, our method provides robustness and stability guarantees, which are often missing in other outlier-robust Bayes filters. The effectiveness of the proposed method is demonstrated on simulations subject to outliers following different distributions, as well as on physical experiment data. △ Less

Submitted 2 October, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: 17 pages

arXiv:2209.04854 [pdf, other]

Performance-Driven Controller Tuning via Derivative-Free Reinforcement Learning

Authors: Yuheng Lei, Jianyu Chen, Shengbo Eben Li, Sifa Zheng

Abstract: Choosing an appropriate parameter set for the designed controller is critical for the final performance but usually requires a tedious and careful tuning process, which implies a strong need for automatic tuning methods. However, among existing methods, derivative-free ones suffer from poor scalability or low efficiency, while gradient-based ones are often unavailable due to possibly non-different… ▽ More Choosing an appropriate parameter set for the designed controller is critical for the final performance but usually requires a tedious and careful tuning process, which implies a strong need for automatic tuning methods. However, among existing methods, derivative-free ones suffer from poor scalability or low efficiency, while gradient-based ones are often unavailable due to possibly non-differentiable controller structure. To resolve the issues, we tackle the controller tuning problem using a novel derivative-free reinforcement learning (RL) framework, which performs timestep-wise perturbation in parameter space during experience collection and integrates derivative-free policy updates into the advanced actor-critic RL architecture to achieve high versatility and efficiency. To demonstrate the framework's efficacy, we conduct numerical experiments on two concrete examples from autonomous driving, namely, adaptive cruise control with PID controller and trajectory tracking with MPC controller. Experimental results show that the proposed method outperforms popular baselines and highlight its strong potential for controller tuning. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: Accepted by the 61st IEEE Conference on Decision and Control (CDC), 2022. Copyright @IEEE

arXiv:2204.04403 [pdf, other]

Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning

Authors: Yangang Ren, Guojian Zhan, Liye Tang, Shengbo Eben Li, Jianhua Jiang, **gliang Duan

Abstract: Intersections are quite challenging among various driving scenes wherein the interaction of signal lights and distinct traffic actors poses great difficulty to learn a wise and robust driving policy. Current research rarely considers the diversity of intersections and stochastic behaviors of traffic participants. For practical applications, the randomness usually leads to some devastating events,… ▽ More Intersections are quite challenging among various driving scenes wherein the interaction of signal lights and distinct traffic actors poses great difficulty to learn a wise and robust driving policy. Current research rarely considers the diversity of intersections and stochastic behaviors of traffic participants. For practical applications, the randomness usually leads to some devastating events, which should be the focus of autonomous driving. This paper introduces an adversarial learning paradigm to boost the intelligence and robustness of driving policy for signalized intersections with dense traffic flow. Firstly, we design a static path planner which is capable of generating trackable candidate paths for multiple intersections with diversified topology. Next, a constrained optimal control problem (COCP) is built based on these candidate paths wherein the bounded uncertainty of dynamic models is considered to capture the randomness of driving environment. We propose adversarial policy gradient (APG) to solve the COCP wherein the adversarial policy is introduced to provide disturbances by seeking the most severe uncertainty while the driving policy learns to handle this situation by competition. Finally, a comprehensive system is established to conduct training and testing wherein the perception module is introduced and the human experience is incorporated to solve the yellow light dilemma. Experiments indicate that the trained policy can handle the signal lights flexibly meanwhile realizing the smooth and efficient passing with a humanoid paradigm. Besides, APG enables a large-margin improvement of the resistance to the abnormal behaviors and thus ensures a high safety level for the autonomous vehicle. △ Less

Submitted 9 April, 2022; originally announced April 2022.

arXiv:2204.02857 [pdf, other]

Primal-dual Estimator Learning: an Offline Constrained Moving Horizon Estimation Method with Feasibility and Near-optimality Guarantees

Authors: Wenhan Cao, **gliang Duan, Shengbo Eben Li, Chen Chen, Chang Liu, Yu Wang

Abstract: This paper proposes a primal-dual framework to learn a stable estimator for linear constrained estimation problems leveraging the moving horizon approach. To avoid the online computational burden in most existing methods, we learn a parameterized function offline to approximate the primal estimate. Meanwhile, a dual estimator is trained to check the suboptimality of the primal estimator during exe… ▽ More This paper proposes a primal-dual framework to learn a stable estimator for linear constrained estimation problems leveraging the moving horizon approach. To avoid the online computational burden in most existing methods, we learn a parameterized function offline to approximate the primal estimate. Meanwhile, a dual estimator is trained to check the suboptimality of the primal estimator during execution time. Both the primal and dual estimators are learned from data using supervised learning techniques, and the explicit sample size is provided, which enables us to guarantee the quality of each learned estimator in terms of feasibility and optimality. This in turn allows us to bound the probability of the learned estimator being infeasible or suboptimal. Furthermore, we analyze the stability of the resulting estimator with a bounded error in the minimization of the cost function. Since our algorithm does not require the solution of an optimization problem during runtime, state estimates can be generated online almost instantly. Simulation results are presented to show the accuracy and time efficiency of the proposed framework compared to online optimization of moving horizon estimation and Kalman filter. To the best of our knowledge, this is the first learning-based state estimator with feasibility and near-optimality guarantees for linear constrained systems. △ Less

Submitted 6 April, 2022; originally announced April 2022.

arXiv:2201.12518 [pdf, other]

Zeroth-Order Actor-Critic

Authors: Yuheng Lei, Jianyu Chen, Shengbo Eben Li, Sifa Zheng

Abstract: The recent advanced evolution-based zeroth-order optimization methods and the policy gradient-based first-order methods are two promising alternatives to solve reinforcement learning (RL) problems with complementary advantages. The former methods work with arbitrary policies, drive state-dependent and temporally-extended exploration, possess robustness-seeking property, but suffer from high sample… ▽ More The recent advanced evolution-based zeroth-order optimization methods and the policy gradient-based first-order methods are two promising alternatives to solve reinforcement learning (RL) problems with complementary advantages. The former methods work with arbitrary policies, drive state-dependent and temporally-extended exploration, possess robustness-seeking property, but suffer from high sample complexity, while the latter methods are more sample efficient but are restricted to differentiable policies and the learned policies are less robust. To address these issues, we propose a novel Zeroth-Order Actor-Critic algorithm (ZOAC), which unifies these two methods into an on-policy actor-critic architecture to preserve the advantages from both. ZOAC conducts rollouts collection with timestep-wise perturbation in parameter space, first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM) alternately in each iteration. We extensively evaluate our proposed method on a wide range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms. △ Less

Submitted 11 June, 2022; v1 submitted 29 January, 2022; originally announced January 2022.

arXiv:2111.12953 [pdf, other]

Learn Zero-Constraint-Violation Policy in Model-Free Constrained Reinforcement Learning

Authors: Haitong Ma, Changliu Liu, Shengbo Eben Li, Sifa Zheng, Wenchao Sun, Jianyu Chen

Abstract: In the trial-and-error mechanism of reinforcement learning (RL), a notorious contradiction arises when we expect to learn a safe policy: how to learn a safe policy without enough data and prior model about the dangerous region? Existing methods mostly use the posterior penalty for dangerous actions, which means that the agent is not penalized until experiencing danger. This fact causes that the ag… ▽ More In the trial-and-error mechanism of reinforcement learning (RL), a notorious contradiction arises when we expect to learn a safe policy: how to learn a safe policy without enough data and prior model about the dangerous region? Existing methods mostly use the posterior penalty for dangerous actions, which means that the agent is not penalized until experiencing danger. This fact causes that the agent cannot learn a zero-violation policy even after convergence. Otherwise, it would not receive any penalty and lose the knowledge about danger. In this paper, we propose the safe set actor-critic (SSAC) algorithm, which confines the policy update using safety-oriented energy functions, or the safety indexes. The safety index is designed to increase rapidly for potentially dangerous actions, which allows us to locate the safe set on the action space, or the control safe set. Therefore, we can identify the dangerous actions prior to taking them, and further obtain a zero constraint-violation policy after convergence.We claim that we can learn the energy function in a model-free manner similar to learning a value function. By using the energy function transition as the constraint objective, we formulate a constrained RL problem. We prove that our Lagrangian-based solutions make sure that the learned policy will converge to the constrained optimum under some assumptions. The proposed algorithm is evaluated on both the complex simulation environments and a hardware-in-loop (HIL) experiment with a real controller from the autonomous vehicle. Experimental results suggest that the converged policy in all environments achieves zero constraint violation and comparable performance with model-based baselines. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2109.13132 [pdf, ps, other]

doi 10.23919/ACC53348.2022.9867384

Optimization Landscape of Gradient Descent for Discrete-time Static Output Feedback

Authors: **gliang Duan, Jie Li, Shengbo Eben Li, Lin Zhao

Abstract: In this paper, we analyze the optimization landscape of gradient descent methods for static output feedback (SOF) control of discrete-time linear time-invariant systems with quadratic cost. The SOF setting can be quite common, for example, when there are unmodeled hidden states in the underlying process. We first establish several important properties of the SOF cost function, including coercivity… ▽ More In this paper, we analyze the optimization landscape of gradient descent methods for static output feedback (SOF) control of discrete-time linear time-invariant systems with quadratic cost. The SOF setting can be quite common, for example, when there are unmodeled hidden states in the underlying process. We first establish several important properties of the SOF cost function, including coercivity, L-smoothness, and M-Lipschitz continuous Hessian. We then utilize these properties to show that the gradient descent is able to converge to a stationary point at a dimension-free rate. Furthermore, we prove that under some mild conditions, gradient descent converges linearly to a local minimum if the starting point is close to one. These results not only characterize the performance of gradient descent in optimizing the SOF problem, but also shed light on the efficiency of general policy gradient methods in reinforcement learning. △ Less

Submitted 10 March, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Journal ref: 2022 American Control Conference (ACC)

arXiv:2109.05540 [pdf, other]

Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-lane Scenarios

Authors: **gliang Duan, Yangang Ren, Fawang Zhang, Yang Guan, Dongjie Yu, Shengbo Eben Li, Bo Cheng, Lin Zhao

Abstract: In this paper, we propose a new reinforcement learning (RL) algorithm, called encoding distributional soft actor-critic (E-DSAC), for decision-making in autonomous driving. Unlike existing RL-based decision-making methods, E-DSAC is suitable for situations where the number of surrounding vehicles is variable and eliminates the requirement for manually pre-designed sorting rules, resulting in highe… ▽ More In this paper, we propose a new reinforcement learning (RL) algorithm, called encoding distributional soft actor-critic (E-DSAC), for decision-making in autonomous driving. Unlike existing RL-based decision-making methods, E-DSAC is suitable for situations where the number of surrounding vehicles is variable and eliminates the requirement for manually pre-designed sorting rules, resulting in higher policy performance and generality. We first develop an encoding distributional policy iteration (DPI) framework by embedding a permutation invariant module, which employs a feature neural network (NN) to encode the indicators of each vehicle, in the distributional RL framework. The proposed DPI framework is proved to exhibit important properties in terms of convergence and global optimality. Next, based on the developed encoding DPI framework, we propose the E-DSAC algorithm by adding the gradient-based update rule of the feature NN to the policy evaluation process of the DSAC algorithm. Then, the multi-lane driving task and the corresponding reward function are designed to verify the effectiveness of the proposed algorithm. Results show that the policy learned by E-DSAC can realize efficient, smooth, and relatively safe autonomous driving in the designed scenario. And the final policy performance learned by E-DSAC is about three times that of DSAC. Furthermore, its effectiveness has also been verified in real vehicle experiments. △ Less

Submitted 12 September, 2021; originally announced September 2021.

arXiv:2108.13038 [pdf]

doi 10.1088/1742-6596/2234/1/012015

Integrated Decision and Control at Multi-Lane Intersections with Mixed Traffic Flow

Authors: Jianhua Jiang, Yangang Ren, Yang Guan, Shengbo Eben Li, Yuming Yin, ** **

Abstract: Autonomous driving at intersections is one of the most complicated and accident-prone traffic scenarios, especially with mixed traffic participants such as vehicles, bicycles and pedestrians. The driving policy should make safe decisions to handle the dynamic traffic conditions and meet the requirements of on-board computation. However, most of the current researches focuses on simplified intersec… ▽ More Autonomous driving at intersections is one of the most complicated and accident-prone traffic scenarios, especially with mixed traffic participants such as vehicles, bicycles and pedestrians. The driving policy should make safe decisions to handle the dynamic traffic conditions and meet the requirements of on-board computation. However, most of the current researches focuses on simplified intersections considering only the surrounding vehicles and idealized traffic lights. This paper improves the integrated decision and control framework and develops a learning-based algorithm to deal with complex intersections with mixed traffic flows, which can not only take account of realistic characteristics of traffic lights, but also learn a safe policy under different safety constraints. We first consider different velocity models for green and red lights in the training process and use a finite state machine to handle different modes of light transformation. Then we design different types of distance constraints for vehicles, traffic lights, pedestrians, bicycles respectively and formulize the constrained optimal control problems (OCPs) to be optimized. Finally, reinforcement learning (RL) with value and policy networks is adopted to solve the series of OCPs. In order to verify the safety and efficiency of the proposed method, we design a multi-lane intersection with the existence of large-scale mixed traffic participants and set practical traffic light phases. The simulation results indicate that the trained decision and control policy can well balance safety and tracking performance. Compared with model predictive control (MPC), the computational time is three orders of magnitude lower. △ Less

Submitted 30 August, 2021; originally announced August 2021.

Comments: 8 pages, 10 figures, 11 equations and 14 conferences

arXiv:2108.11623 [pdf, other]

Model-based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian

Authors: Baiyu Peng, **gliang Duan, Jianyu Chen, Shengbo Eben Li, Gen** Xie, Congsheng Zhang, Yang Guan, Yao Mu, Enxin Sun

Abstract: Safety is essential for reinforcement learning (RL) applied in the real world. Adding chance constraints (or probabilistic constraints) is a suitable way to enhance RL safety under uncertainty. Existing chance-constrained RL methods like the penalty methods and the Lagrangian methods either exhibit periodic oscillations or learn an over-conservative or unsafe policy. In this paper, we address thes… ▽ More Safety is essential for reinforcement learning (RL) applied in the real world. Adding chance constraints (or probabilistic constraints) is a suitable way to enhance RL safety under uncertainty. Existing chance-constrained RL methods like the penalty methods and the Lagrangian methods either exhibit periodic oscillations or learn an over-conservative or unsafe policy. In this paper, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. We first review the constrained policy optimization process from a feedback control perspective, which regards the penalty weight as the control input and the safe probability as the control output. Based on this, the penalty method is formulated as a proportional controller, and the Lagrangian method is formulated as an integral controller. We then unify them and present a proportional-integral Lagrangian method to get both their merits, with an integral separation technique to limit the integral value in a reasonable range. To accelerate training, the gradient of safe probability is computed in a model-based manner. We demonstrate our method can reduce the oscillations and conservatism of RL policy in a car-following simulation. To prove its practicality, we also apply our method to a real-world mobile robot navigation task, where our robot successfully avoids a moving obstacle with highly uncertain or even aggressive behaviors. △ Less

Submitted 26 August, 2021; originally announced August 2021.

arXiv:2105.11299 [pdf, other]

doi 10.1109/TITS.2021.3136588

Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving

Authors: **gliang Duan, Dongjie Yu, Shengbo Eben Li, Wenxuan Wang, Yangang Ren, Ziyu Lin, Bo Cheng

Abstract: In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), for the state representation of decision-making in autonomous driving. Unlike existing state representation methods, ESC is applicable to a variable number of surrounding vehicles and eliminates the need for manually pre-designed sorting rules, leading to higher representation ability and gene… ▽ More In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), for the state representation of decision-making in autonomous driving. Unlike existing state representation methods, ESC is applicable to a variable number of surrounding vehicles and eliminates the need for manually pre-designed sorting rules, leading to higher representation ability and generality. The proposed ESC method introduces a representation neural network (NN) to encode each surrounding vehicle into an encoding vector, and then adds these vectors to obtain the representation vector of the set of surrounding vehicles. By concatenating the set representation with other variables, such as indicators of the ego vehicle and road, we realize the fixed-dimensional and permutation invariant state representation. This paper has further proved that the proposed ESC method can realize the injective representation if the output dimension of the representation NN is greater than the number of variables of all surrounding vehicles. This means that by taking the ESC representation as policy inputs, we can find the nearly optimal representation NN and policy NN by simultaneously optimizing them using gradient-based updating. Experiments demonstrate that compared with the fixed-permutation representation method, the proposed method improves the representation ability of the surrounding vehicles, and the corresponding approximation error is reduced by 62.2%. △ Less

Submitted 4 March, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

Journal ref: IEEE Transactions on Intelligent Transportation Systems, 2021

arXiv:2103.05505 [pdf]

Approximate Optimal Filter for Linear Gaussian Time-invariant Systems

Authors: Kaiming Tang, Shengbo Eben Li, Yuming Yin, Yang Guan, **gliang Duan, Wenhan Cao, Jie Li

Abstract: State estimation is critical to control systems, especially when the states cannot be directly measured. This paper presents an approximate optimal filter, which enables to use policy iteration technique to obtain the steady-state gain in linear Gaussian time-invariant systems. This design transforms the optimal filtering problem with minimum mean square error into an optimal control problem, call… ▽ More State estimation is critical to control systems, especially when the states cannot be directly measured. This paper presents an approximate optimal filter, which enables to use policy iteration technique to obtain the steady-state gain in linear Gaussian time-invariant systems. This design transforms the optimal filtering problem with minimum mean square error into an optimal control problem, called Approximate Optimal Filtering (AOF) problem. The equivalence holds given certain conditions about initial state distributions and policy formats, in which the system state is the estimation error, control input is the filter gain, and control objective function is the accumulated estimation error. We present a policy iteration algorithm to solve the AOF problem in steady-state. A classic vehicle state estimation problem finally evaluates the approximate filter. The results show that the policy converges to the steady-state Kalman gain, and its accuracy is within 2 %. △ Less

Submitted 9 March, 2021; originally announced March 2021.

arXiv:2102.13304 [pdf]

Feasibility Enhancement of Constrained Receding Horizon Control Using Generalized Control Barrier Function

Authors: Haitong Ma, Xiangteng Zhang, Shengbo Eben Li, Ziyu Lin, Yao Lyu, Sifa Zheng

Abstract: Receding horizon control (RHC) is a popular procedure to deal with optimal control problems. Due to the existence of state constraints, optimization-based RHC often suffers the notorious issue of infeasibility, which strongly shrinks the region of controllable state. This paper proposes a generalized control barrier function (CBF) to enlarge the feasible region of constrained RHC with only a one-s… ▽ More Receding horizon control (RHC) is a popular procedure to deal with optimal control problems. Due to the existence of state constraints, optimization-based RHC often suffers the notorious issue of infeasibility, which strongly shrinks the region of controllable state. This paper proposes a generalized control barrier function (CBF) to enlarge the feasible region of constrained RHC with only a one-step constraint on the prediction horizon. This design can reduce the constrained steps by penalizing the tendency to move towards the constraint boundary. Additionally, generalized CBF is able to handle high-order equality or inequality constraints through extending the constrained step to nonadjacent nodes. We apply this technique on an automated vehicle control task. The results show that compared to multi-step pointwise constraints, generalized CBF can effectively avoid the infeasibility issue in a larger partition of the state space, and the computing efficiency is also improved by 14%-23%. △ Less

Submitted 26 February, 2021; originally announced February 2021.

arXiv:2102.11736 [pdf, other]

Recurrent Model Predictive Control

Authors: Zhengyu Liu, **gliang Duan, Wenxuan Wang, Shengbo Eben Li, Yuming Yin, Ziyu Lin, Qi Sun, Bo Cheng

Abstract: This paper proposes an off-line algorithm, called Recurrent Model Predictive Control (RMPC), to solve general nonlinear finite-horizon optimal control problems. Unlike traditional Model Predictive Control (MPC) algorithms, it can make full use of the current computing resources and adaptively select the longest model prediction horizon. Our algorithm employs a recurrent function to approximate the… ▽ More This paper proposes an off-line algorithm, called Recurrent Model Predictive Control (RMPC), to solve general nonlinear finite-horizon optimal control problems. Unlike traditional Model Predictive Control (MPC) algorithms, it can make full use of the current computing resources and adaptively select the longest model prediction horizon. Our algorithm employs a recurrent function to approximate the optimal policy, which maps the system states and reference values directly to the control inputs. The number of prediction steps is equal to the number of recurrent cycles of the learned policy function. With an arbitrary initial policy function, the proposed RMPC algorithm can converge to the optimal policy by directly minimizing the designed loss function. We further prove the convergence and optimality of the RMPC algorithm thorough Bellman optimality principle, and demonstrate its generality and efficiency using two numerical examples. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2102.10289

arXiv:2102.10289 [pdf, other]

doi 10.1109/TIE.2022.3153800

Recurrent Model Predictive Control: Learning an Explicit Recurrent Controller for Nonlinear Systems

Authors: Zhengyu Liu, **gliang Duan, Wenxuan Wang, Shengbo Eben Li, Yuming Yin, Ziyu Lin, Bo Cheng

Abstract: This paper proposes an offline control algorithm, called Recurrent Model Predictive Control (RMPC), to solve large-scale nonlinear finite-horizon optimal control problems. It can be regarded as an explicit solver of traditional Model Predictive Control (MPC) algorithms, which can adaptively select appropriate model prediction horizon according to current computing resources, so as to improve the p… ▽ More This paper proposes an offline control algorithm, called Recurrent Model Predictive Control (RMPC), to solve large-scale nonlinear finite-horizon optimal control problems. It can be regarded as an explicit solver of traditional Model Predictive Control (MPC) algorithms, which can adaptively select appropriate model prediction horizon according to current computing resources, so as to improve the policy performance. Our algorithm employs a recurrent function to approximate the optimal policy, which maps the system states and reference values directly to the control inputs. The output of the learned policy network after N recurrent cycles corresponds to the nearly optimal solution of N-step MPC. A policy optimization objective is designed by decomposing the MPC cost function according to the Bellman's principle of optimality. The optimal recurrent policy can be obtained by directly minimizing the designed objective function, which is applicable for general nonlinear and non input-affine systems. Both simulation-based and real-robot path-tracking tasks are utilized to demonstrate the effectiveness of the proposed method. △ Less

Submitted 8 April, 2022; v1 submitted 20 February, 2021; originally announced February 2021.

Journal ref: IEEE Transactions on Industrial Electronics, 2022

arXiv:2102.08539 [pdf, other]

Separated Proportional-Integral Lagrangian for Chance Constrained Reinforcement Learning

Authors: Baiyu Peng, Yao Mu, **gliang Duan, Yang Guan, Shengbo Eben Li, Jianyu Chen

Abstract: Safety is essential for reinforcement learning (RL) applied in real-world tasks like autonomous driving. Chance constraints which guarantee the satisfaction of state constraints at a high probability are suitable to represent the requirements in real-world environment with uncertainty. Existing chance constrained RL methods like the penalty method and the Lagrangian method either exhibit periodic… ▽ More Safety is essential for reinforcement learning (RL) applied in real-world tasks like autonomous driving. Chance constraints which guarantee the satisfaction of state constraints at a high probability are suitable to represent the requirements in real-world environment with uncertainty. Existing chance constrained RL methods like the penalty method and the Lagrangian method either exhibit periodic oscillations or cannot satisfy the constraints. In this paper, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. Taking a control perspective, we first interpret the penalty method and the Lagrangian method as proportional feedback and integral feedback control, respectively. Then, a proportional-integral Lagrangian method is proposed to steady learning process while improving safety. To prevent integral overshooting and reduce conservatism, we introduce the integral separation technique inspired by PID control. Finally, an analytical gradient of the chance constraint is utilized for model-based policy optimization. The effectiveness of SPIL is demonstrated by a narrow car-following task. Experiments indicate that compared with previous methods, SPIL improves the performance while guaranteeing safety, with a steady learning process. △ Less

Submitted 16 February, 2021; originally announced February 2021.

arXiv:2102.08072 [pdf, other]

Steadily Learn to Drive with Virtual Memory

Authors: Yuhang Zhang, Yao Mu, Yujie Yang, Yang Guan, Shengbo Eben Li, Qi Sun, Jianyu Chen

Abstract: Reinforcement learning has shown great potential in develo** high-level autonomous driving. However, for high-dimensional tasks, current RL methods suffer from low data efficiency and oscillation in the training process. This paper proposes an algorithm called Learn to drive with Virtual Memory (LVM) to overcome these problems. LVM compresses the high-dimensional information into compact latent… ▽ More Reinforcement learning has shown great potential in develo** high-level autonomous driving. However, for high-dimensional tasks, current RL methods suffer from low data efficiency and oscillation in the training process. This paper proposes an algorithm called Learn to drive with Virtual Memory (LVM) to overcome these problems. LVM compresses the high-dimensional information into compact latent states and learns a latent dynamic model to summarize the agent's experience. Various imagined latent trajectories are generated as virtual memory by the latent dynamic model. The policy is learned by propagating gradient through the learned latent model with the imagined latent trajectories and thus leads to high data efficiency. Furthermore, a double critic structure is designed to reduce the oscillation during the training process. The effectiveness of LVM is demonstrated by an image-input autonomous driving task, in which LVM outperforms the existing method in terms of data efficiency, learning stability, and control performance. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: Submitted to the 32nd IEEE Intelligent Vehicles Symposium

arXiv:2012.10716 [pdf, other]

Model-Based Actor-Critic with Chance Constraint for Stochastic System

Authors: Baiyu Peng, Yao Mu, Yang Guan, Shengbo Eben Li, Yuming Yin, Jianyu Chen

Abstract: Safety is essential for reinforcement learning (RL) applied in real-world situations. Chance constraints are suitable to represent the safety requirements in stochastic systems. Previous chance-constrained RL methods usually have a low convergence rate, or only learn a conservative policy. In this paper, we propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficientl… ▽ More Safety is essential for reinforcement learning (RL) applied in real-world situations. Chance constraints are suitable to represent the safety requirements in stochastic systems. Previous chance-constrained RL methods usually have a low convergence rate, or only learn a conservative policy. In this paper, we propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficiently learn a safe and non-conservative policy. Different from existing methods that optimize a conservative lower bound, CCAC directly solves the original chance constrained problems, where the objective function and safe probability is simultaneously optimized with adaptive weights. In order to improve the convergence rate, CCAC utilizes the gradient of dynamic model to accelerate policy optimization. The effectiveness of CCAC is demonstrated by a stochastic car-following task. Experiments indicate that compared with previous RL methods, CCAC improves the performance while guaranteeing safety, with a five times faster convergence rate. It also has 100 times higher online computation efficiency than traditional safety techniques such as stochastic model predictive control. △ Less

Submitted 16 March, 2021; v1 submitted 19 December, 2020; originally announced December 2020.

arXiv:2011.09612 [pdf, other]

Numerically Stable Dynamic Bicycle Model for Discrete-time Control

Authors: Qiang Ge, Shengbo Eben Li, Qi Sun, Sifa Zheng

Abstract: Dynamic/kinematic model is of great significance in decision and control of intelligent vehicles. However, due to the singularity of dynamic models at low speed, kinematic models have been the only choice under many driving scenarios. This paper presents a discrete dynamic bicycle model feasible at any low speed utilizing the concept of backward Euler method. We further give a sufficient condition… ▽ More Dynamic/kinematic model is of great significance in decision and control of intelligent vehicles. However, due to the singularity of dynamic models at low speed, kinematic models have been the only choice under many driving scenarios. This paper presents a discrete dynamic bicycle model feasible at any low speed utilizing the concept of backward Euler method. We further give a sufficient condition, based on which the numerical stability is proved. Simulation verifies that (1) the proposed model is numerically stable while the forward-Euler discretized dynamic model diverges; (2) the model reduces forecast error by up to 49% compared to the kinematic model. As far as we know, it is the first time that a dynamic bicycle model is qualified for urban driving scenarios involving stop-and-go tasks. △ Less

Submitted 18 November, 2020; originally announced November 2020.

Comments: 6 pages, 7 figures, conference

arXiv:2008.13081 [pdf, other]

Centralized Coordination of Connected Vehicles at Intersections using Graphical Mixed Integer Optimization

Authors: Qiang Ge, Qi Sun, Zhen Wang, Shengbo Eben Li, Ziqing Gu, Sifa Zheng

Abstract: This paper proposes a centralized multi-vehicle coordination scheme serving unsignalized intersections. The whole process consists of three stages: a) target velocity optimization: formulate the collision-free vehicle coordination as a Mixed Integer Linear Programming (MILP) problem, with each incoming lane representing an independent variable; b) dynamic vehicle selection: build a directed graph… ▽ More This paper proposes a centralized multi-vehicle coordination scheme serving unsignalized intersections. The whole process consists of three stages: a) target velocity optimization: formulate the collision-free vehicle coordination as a Mixed Integer Linear Programming (MILP) problem, with each incoming lane representing an independent variable; b) dynamic vehicle selection: build a directed graph with result of the optimization, and reserve only some of the vehicle nodes to coordinate by applying a subset extraction algorithm; c) synchronous velocity profile planning: bridge the gap between current speed and optimal velocity in a synchronous manner. The problem size is essentially bounded by number of lanes instead of vehicles. Thus the optimization process is realtime with guaranteed solution quality. Simulation has verified efficiency and real-time performance of the scheme. △ Less

Submitted 29 August, 2020; originally announced August 2020.

Comments: 6 pages, 9 figures, conference

arXiv:2008.00674 [pdf, other]

Reinforcement Solver for H-infinity Filter with Bounded Noise

Authors: Jie Li, Shengbo Eben Li, Kaiming Tang, Yao Lv, Wenhan Cao

Abstract: H-infinity filter has been widely applied in engineering field, but cop** with bounded noise is still an open problem and difficult to solve. This paper considers the H-infinity filtering problem for linear system with bounded process and measurement noise. The problem is first formulated as a zero-sum game where the dynamic of estimation error is non-affine with respect to filter gain and measu… ▽ More H-infinity filter has been widely applied in engineering field, but cop** with bounded noise is still an open problem and difficult to solve. This paper considers the H-infinity filtering problem for linear system with bounded process and measurement noise. The problem is first formulated as a zero-sum game where the dynamic of estimation error is non-affine with respect to filter gain and measurement noise. A nonquadratic Hamilton-Jacobi-Isaacs (HJI) equation is then derived by employing a nonquadratic cost to characterize bounded noise, which is extremely difficult to solve due to its non-affine and nonlinear properties. Next, a reinforcement learning algorithm based on gradient descent method which can handle nonlinearity is proposed to update the gain of reinforcement filter, where measurement noise is fixed to tackle non-affine property and increase the convexity of Hamiltonian. Two examples demonstrate the convergence and effectiveness of the proposed algorithm. △ Less

Submitted 3 August, 2020; originally announced August 2020.

arXiv:2007.06810 [pdf]

Ternary Policy Iteration Algorithm for Nonlinear Robust Control

Authors: Jie Li, Shengbo Eben Li, Yang Guan, **gliang Duan, Wenyu Li, Yuming Yin

Abstract: The uncertainties in plant dynamics remain a challenge for nonlinear control problems. This paper develops a ternary policy iteration (TPI) algorithm for solving nonlinear robust control problems with bounded uncertainties. The controller and uncertainty of the system are considered as game players, and the robust control problem is formulated as a two-player zero-sum differential game. In order t… ▽ More The uncertainties in plant dynamics remain a challenge for nonlinear control problems. This paper develops a ternary policy iteration (TPI) algorithm for solving nonlinear robust control problems with bounded uncertainties. The controller and uncertainty of the system are considered as game players, and the robust control problem is formulated as a two-player zero-sum differential game. In order to solve the differential game, the corresponding Hamilton-Jacobi-Isaacs (HJI) equation is then derived. Three loss functions and three update phases are designed to match the identity equation, minimization and maximization of the HJI equation, respectively. These loss functions are defined by the expectation of the approximate Hamiltonian in a generated state set to prevent operating all the states in the entire state set concurrently. The parameters of value function and policies are directly updated by diminishing the designed loss functions using the gradient descent method. Moreover, zero-initialization can be applied to the parameters of the control policy. The effectiveness of the proposed TPI algorithm is demonstrated through two simulation studies. The simulation results show that the TPI algorithm can converge to the optimal solution for the linear plant, and has high resistance to disturbances for the nonlinear plant. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2007.02070 [pdf, other]

Continuous-time finite-horizon ADP for automated vehicle controller design with high efficiency

Authors: Ziyu Lin, **gliang Duan, Shengbo Eben Li, Haitong Ma, Yuming Yin

Abstract: The design of an automated vehicle controller can be generally formulated into an optimal control problem. This paper proposes a continuous-time finite-horizon approximate dynamicprogramming (ADP) method, which can synthesis off-line near-optimal control policy with analytical vehicle dynamics. Lying on the general Policy Iteration framework, it employs value andpolicy neural networks to approxima… ▽ More The design of an automated vehicle controller can be generally formulated into an optimal control problem. This paper proposes a continuous-time finite-horizon approximate dynamicprogramming (ADP) method, which can synthesis off-line near-optimal control policy with analytical vehicle dynamics. Lying on the general Policy Iteration framework, it employs value andpolicy neural networks to approximate the map**s from thesystem states to value function and control inputs, respectively. The proposed method can converge to the near-optimal solutionof the finite-horizon Hamilton-Jacobi-Bellman (HJB) equation. We further applied our algorithm to the simulation of automated vehicle control for the path tracking maneuver. The results suggest that the proposed ADP method can obtain the near-optimal policy with 1% error and less calculation time. What is more, the proposed ADP algorithm is also suitable for nonlinear control systems, where ADP is almost 500 times faster than the nonlinear MPC ipopt solver. △ Less

Submitted 4 July, 2020; originally announced July 2020.

Comments: 7 pages,conference

arXiv:2003.00848 [pdf, other]

Mixed Reinforcement Learning with Additive Stochastic Uncertainty

Authors: Yao Mu, Shengbo Eben Li, Chang Liu, Qi Sun, Bingbing Nie, Bo Cheng, Baiyu Peng

Abstract: Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual r… ▽ More Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle). △ Less

Submitted 28 February, 2020; originally announced March 2020.

arXiv:2001.09816 [pdf, other]

doi 10.1049/iet-its.2019.0317

Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data

Authors: **gliang Duan, Shengbo Eben Li, Yang Guan, Qi Sun, Bo Cheng

Abstract: Decision making for self-driving cars is usually tackled by manually encoding rules from drivers' behaviors or imitating drivers' manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This paper presents a hierarchical reinforcement learning method for decision making of self-driving cars, which does not depend on a large… ▽ More Decision making for self-driving cars is usually tackled by manually encoding rules from drivers' behaviors or imitating drivers' manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This paper presents a hierarchical reinforcement learning method for decision making of self-driving cars, which does not depend on a large amount of labeled driving data. This method comprehensively considers both high-level maneuver selection and low-level motion control in both lateral and longitudinal directions. We firstly decompose the driving tasks into three maneuvers, including driving in lane, right lane change and left lane change, and learn the sub-policy for each maneuver. Then, a master policy is learned to choose the maneuver policy to be executed in the current state. All policies including master policy and maneuver policies are represented by fully-connected neural networks and trained by using asynchronous parallel reinforcement learners (APRL), which builds a map** from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each maneuver. We apply this method to a highway driving scenario, which demonstrates that it can realize smooth and safe decision making for self-driving cars. △ Less

Submitted 27 January, 2020; originally announced January 2020.

Journal ref: IET Intelligent Transport Systems, 2020, 14(5): 297-305

arXiv:2001.02811 [pdf, other]

doi 10.1109/TNNLS.2021.3082568

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Authors: **gliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Bo Cheng

Abstract: In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This paper presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory… ▽ More In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This paper presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update stepsize of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by kee** the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance. △ Less

Submitted 11 June, 2021; v1 submitted 8 January, 2020; originally announced January 2020.

Journal ref: IEEE Transactions on Neural Networks and Learning Systems, 2021

arXiv:1911.11397 [pdf, other]

doi 10.1016/j.neucom.2021.04.134

Adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints

Authors: **gliang Duan, Zhengyu Liu, Shengbo Eben Li, Qi Sun, Zhenzhong Jia, Bo Cheng

Abstract: This paper presents a constrained adaptive dynamic programming (CADP) algorithm to solve general nonlinear nonaffine optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state constraints. Firstly, a constrained generalized policy iteration (CGPI) framework is developed to handle state constraints by transforming the traditional poli… ▽ More This paper presents a constrained adaptive dynamic programming (CADP) algorithm to solve general nonlinear nonaffine optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state constraints. Firstly, a constrained generalized policy iteration (CGPI) framework is developed to handle state constraints by transforming the traditional policy improvement process into a constrained policy optimization problem. Next, we propose an actor-critic variant of CGPI, called CADP, in which both policy and value functions are approximated by multi-layer neural networks to directly map the system states to control inputs and value function, respectively. CADP linearizes the constrained optimization problem locally into a quadratically constrained linear programming problem, and then obtains the optimal update of the policy network by solving its dual problem. A trust region constraint is added to prevent excessive policy update, thus ensuring linearization accuracy. We determine the feasibility of the policy optimization problem by calculating the minimum trust region boundary and update the policy using two recovery rules when infeasible. The vehicle control problem in the path-tracking task is used to demonstrate the effectiveness of this proposed method. △ Less

Submitted 8 April, 2022; v1 submitted 26 November, 2019; originally announced November 2019.

Journal ref: Neurocomputing 484 (2022) 128-141

arXiv:1909.05402 [pdf, other]

doi 10.1109/TIV.2023.3255264

Relaxed Actor-Critic with Convergence Guarantees for Continuous-Time Optimal Control of Nonlinear Systems

Authors: **gliang Duan, Jie Li, Qiang Ge, Shengbo Eben Li, Monimoy Bujarbaruah, Fei Ma, Dezhao Zhang

Abstract: This paper presents the Relaxed Continuous-Time Actor-critic (RCTAC) algorithm, a method for finding the nearly optimal policy for nonlinear continuous-time (CT) systems with known dynamics and infinite horizon, such as the path-tracking control of vehicles. RCTAC has several advantages over existing adaptive dynamic programming algorithms for CT systems. It does not require the ``admissibility" o… ▽ More This paper presents the Relaxed Continuous-Time Actor-critic (RCTAC) algorithm, a method for finding the nearly optimal policy for nonlinear continuous-time (CT) systems with known dynamics and infinite horizon, such as the path-tracking control of vehicles. RCTAC has several advantages over existing adaptive dynamic programming algorithms for CT systems. It does not require the ``admissibility" of the initialized policy or the input-affine nature of controlled systems for convergence. Instead, given any initial policy, RCTAC can converge to an admissible, and subsequently nearly optimal policy for a general nonlinear system with a saturated controller. RCTAC consists of two phases: a warm-up phase and a generalized policy iteration phase. The warm-up phase minimizes the square of the Hamiltonian to achieve admissibility, while the generalized policy iteration phase relaxes the update termination conditions for faster convergence. The convergence and optimality of the algorithm are proven through Lyapunov analysis, and its effectiveness is demonstrated through simulations and real-world path-tracking tasks. △ Less

Submitted 30 March, 2023; v1 submitted 11 September, 2019; originally announced September 2019.

Journal ref: IEEE Transactions on Intelligent Vehicles, 2023 (Early Access)

arXiv:1807.11874 [pdf, ps, other]

Parallel Optimal Control for Cooperative Automation of Large-scale Connected Vehicles via ADMM

Authors: Zhitao Wang, Yang Zheng, Shengbo Eben Li, Keyou You, Keqiang Li

Abstract: This paper proposes a parallel optimization algorithm for cooperative automation of large-scale connected vehicles. The task of cooperative automation is formulated as a centralized optimization problem taking the whole decision space of all vehicles into account. Considering the uncertainty of the environment, the problem is solved in a receding horizon fashion. Then, we employ the alternating di… ▽ More This paper proposes a parallel optimization algorithm for cooperative automation of large-scale connected vehicles. The task of cooperative automation is formulated as a centralized optimization problem taking the whole decision space of all vehicles into account. Considering the uncertainty of the environment, the problem is solved in a receding horizon fashion. Then, we employ the alternating direction method of multipliers (ADMM) to solve the centralized optimization in a parallel way, which scales more favorably to large-scale instances. Also, Taylor series is used to linearize nonconvex constraints caused by coupling collision avoidance constraints among interactive vehicles. Simulations with two typical traffic scenes for multiple vehicles demonstrate the effectiveness and efficiency of our method. △ Less

Submitted 31 July, 2018; originally announced July 2018.

Showing 1–39 of 39 results for author: Li, S E